Voiceprint Biometric System

(Material in Red Added This Semester)

The goal of this project is to develop a voiceprint biometric system. However, there are many steps involved in accomplishing this goal and it may not be possible to complete all of them in this project. The major steps are as follows: collect speech samples, find or develop the necessary speech processing tools, develop an appropriate set of voiceprint features from the common-utterance speech samples, and use the existing Pace University biometric system backend to conduct experiments.

There are four types of passphrases:

  1. user-specified phrase, like the user's name
  2. specified phrase common to all users
  3. random phrase - i.e., one displayed on the computer screen
  4. random phrase that can vary at the user's discretion
We have decided to focus on a specified phrase common to all users. Using a common passphrase simplifies the segmentation problem, allows for careful selection of the common phrase to optimize the choice and variety of phonetic units for their authentication value, facilitates testing for imposters, and (according to Killourhy and Maxion, see reference below) permits the measurement of true voice authentication biometric performance and avoids potential experimental flaws.

Data Samples of Common-Phrase Speech Utterances

The common phrase to be recorded is: "My name is" followed by the person's name -- for example, "My name is John Smith." For recording the utterances, the built-in microphone can be used with a laptop or an inexpensive microphone with a desktop computer.

We would like to obtain 20 sample utterances from each of 30 subjects (10 samples from each of 100 subjects is preferred). The utterance samples should preferably be collected in groups of five samples over a period of a week or more -- for example, collect five samples per day on each of four different days. In the worst case, record ten samples from a subject per day, requiring data collection on two different days.

The instructions to the subjects should be: "please speak naturally but clearly in producing the utterance samples." Each subject should practice recording the utterance about ten times, and the subject and the experimenter should review the practice utterances for clarity and naturalness. The experimenter will keep a record of the date the recordings were made and the microphone used.

The following would be nice but is not required. A database of the speech recordings (.wav files) could be made accessible through a Web interface so users can input new recordings or listen to selected existing recordings. An example of a similar database for speech samples can be found at George Mason University's Speech Database.

Speech Processing Tools

  1. Segmentation tools: Because the biometric system will operate on the initial "My name is" portion of the utterance, that portion must be separated (isolated) from the background noise at the beginning of the utterance and the person's name at the end of the utterance. This requires:
  2. Speech spectrogram tool to perform a spectral analysis of a speech signal. This is a standard speech visualization tool that typically gives a grey-scale plot of frequency bands as a function of time. We anticipate finding an appropriate spectral analysis tool on the Internet. However, we need a spectrographic tool that provides access to the actual numerical data (e.g., the energy in a particular frequency band in a particular time interval). The numeric data is usually represented in a matrix of frequency bands versus time intervals. These data will be used by both the elastic matching and feature extraction components of the system.
  3. Elastic matching (dynamic time warping) algorithm to align each sample speech signal of the utterance with one that has been pre-segmented into the seven sounds ([m],[ai],[n],[ei],[m],[i],[z]) in preparation for feature extraction. We may have to develop the alignment tool in-house, but that should not be difficult because it is a rather concise algorithm.

This Semester

Most important is that the same software produces both the spectrogram and the numeric frequency versus time energy values. This would provide the numeric values for segmentation of the utterance from the background noise and for DTW to segment the seven phonetic sound units, and then for presenting the results on the spectrogram.

Feature Extraction

Data measurements (features) will be designed and code written to extract them from each sample utterance. The output of the feature extractor will be a fixed-length vector of measurements appropriate for input to the Pace University biometric authentication system. Several feature sets will be explored, and all features will be normalized over the varying lengths of the speech utterances.

The initial speech processing of the utterance samples will consist of a standard spectral analysis. One possibility is as follows. Compute the 13 lowest Mel-frequency Cepstral coefficients (MFCC) from 40 Mel-spaced filters: 13 spaced linearly with 133.33 Hz between center frequencies, and 27 spaced logarithmically by a frequency factor of 1.07 between adjacent filters. The spectral analysis time frame could be a 30 msec Hamming window with 10 msec overlap between adjacent windows. The number of time windows per utterance will vary because they are of fixed size and the lengths of the voice samples varies.

Features will then be extracted from the numeric values of the spectral analysis. For example, one feature set could consist of the means and variances of each of the 13 frequency bands over the entire utterance, for a total of 26 features per utterance. Additional features will be extracted from each of the seven sound regions of the utterance. For example, the same 13 frequency bands could be divided into its 7 speech sounds each utterance, averaging the energy in the 13 frequency bands within each of the 7 sounds, for a total of 91 features. The first Cepstral component might be omitted because it represents the energy of the signal and is probably not speaker specific.

Additional Features This Semester

In addition to the features described above, this semester we want to extract the voice fundamental frequency (pitch, usually denoted F0) and the first three formant frequencies (called F1, F2, F3). The F0 value should be averaged over the voiced sounds of the utterance and the standard deviation obtained to characterize the variation. The formant frequencies values should be obtained in each of the vowel sounds, and at the beginning and ending of diphthongs. Also, try to find additional features of your choosing.

The additional utterance-length features are: F0 (mean and standard deviation, 2 features)
The additional sound-related features are:
[m]:
[ai]: F1, F2, F3 at beginning and end of sound (6 features)
[n]:
[ei]: F1, F2, F3 at beginning and end of sound (6 features)
[m]:
[i]: F1, F2, F3 (3 features)
[z]:

Pace University Biometric Authentication System

Because the team seems unable to reliably segment the utterance into the seven sounds, we need manually segmented data in order to provide biometric performance results early in the semester. The deadline for you to provide accurately segmented data (manually, automatically, or a combination) from 30 users will be March 1. And if we currently don't have data from 30 users, then you must obtain such data.

By the end of the semester we clearly need an automated system so we can process large numbers of sample utterances.

The generic Pace University Biometric Authentication System will be used to perform various voiceprint authentication experiments.


Background

Previous work on this problem

Rationale for choosing the same phrase for all users

Pace University biometric authentication system

Links to some speech processing tools