The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT. Rio Akasaka ’09, Youngmoo Kim, Ph.D* Department of Linguistics/Engineering, Swarthmore College *Drexel University. Results. Conclusions
Rio Akasaka ’09, Youngmoo Kim, Ph.D*
Department of Linguistics/Engineering, Swarthmore College *Drexel University
While optimal performance in speaker recognition is expected with a larger training set, the availability of testing material did not seem to affect performance if at least three files are used and if the number of training files is equal to or greater than the number of testing files. Though this might be expected to extend to the length of the wav files, it was not necessarily the case because using half a file to test consistently demonstrated poor results.
Most importantly, testing and training with vowel phones only provided impressive recognition rates at approximately 93%, meriting further study.
With regards to individual phone contributions to recognition, it was found that a single phoneme does not predict a speaker more effectively when using the same phoneme to train, as compared to any other phoneme.
However, two phonemes consistently outperform the others in predicting 1a speaker: 'ae' and 'ay'. Of the five trials, 'ae' was ranked most highly recognized 3 times, 'ay' was highest twice, and both were among the top two in four of the trials. More tests are being done to obtain a statistically significant conclusion.
Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are derived by obtaining the Fourier transform of a signal and mapping the result on the mel-scale, which is an auditory perception-based scale of pitch differences. With these unique labels on speech files, the similarity between two files can be determined by the Kullback–Leibler (KL) distance, which is based on probability distributions, and, given a training set upon which to base one’s decisions, the corresponding speaker can be identified.
The goal of this research is to test the robustness of MFCCs in speaker detection by varying the testing and training parameters with the following methods:
1) using segments of a whole speech file
2) varying the number of speech files used, and
3) splicing together the vowel phones of a speech
The following nomenclature is adopted in this poster:
F: Full (complete) speech file
H: Speech file segmented at middle
V: File consisting of vowel phones only
In order to extract more information about the role that individual phones play in speaker recognition, the same algorithm was applied to test recognition based on individual phonemes that are extracted from each speaker. The training set consists of files containing only file segments for a particular phoneme, which are then later tested individually.
The TIMIT Corpus
The TIMIT corpus was created as a joint effort between Texas Instruments (TI) and MIT and consists of time-aligned orthographic, phonetic and word transcriptions for each of the 6300 16-bit 16kHz speech files. 630 speakers from the 8 major dialects of American English each read from 10 ‘phonetically rich’ texts, among which 2 are common across all speakers.
10160 10733 y
10733 11880 axr
10160 11880 your
0 57140 She had
your dark suit
in greasy wash
water all year.
In order to investigate the distribution of the phonemes in TIMIT, the plot shown above was generated. The average sample length is
The individual texts may be phonetically rich, but taken as a whole the distribution of the phonemes is unbalanced.
Cole, Ronald A., et al.. 1996. The Contribution of Consonants Versus Vowels to Word Recognition in Fluent Speech
Van Heerden, C.J, E. Bernard. 2008. Speaker-specific variability of Phoneme Durations.
Fattah, Mohamed, Ren Fuji, Shingo Kuroiwa. 2006. Phoneme Based Speaker Modeling to Improve Speaker Identification
Figure 2. Speaker prediction based on individual phonemes. The results show that while speaker recognition based on individual phoneme is considerably low (μ=3.60%, σ=2.34), the diagonal does show slightly higher recognition rates, as would be expected.
Figure 1. Speaker recognition based on 144 vowel-based files
Figure 1. The predicted speaker ID plotted against the actual speaker, for 144 full speech files.
Grateful acknowledgement is made to Youngmoo Kim for providing insight and direction throughout my research and to Jiahong Yuan for encouraging my pursuit of corpus phonetics.
For further information
Figure 3. To test the possibility that one speaker is consistently retrieved as the ideal candidate for a particular phoneme, the above plot was generated to plot the predicted speaker vs the actual speaker based on speaker ID. Speaker 183 is selected most often in the above scenario.
Please contact [email protected] Further details about the methodology may be read online at wiki.rioleo.org