The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT
1 / 1

Introduction - PowerPoint PPT Presentation

  • Uploaded on

The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT. Rio Akasaka ’09, Youngmoo Kim, Ph.D* Department of Linguistics/Engineering, Swarthmore College *Drexel University. Results. Conclusions

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Introduction' - will

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Introduction 3343675

The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT

Rio Akasaka ’09, Youngmoo Kim, Ph.D*

Department of Linguistics/Engineering, Swarthmore College *Drexel University



While optimal performance in speaker recognition is expected with a larger training set, the availability of testing material did not seem to affect performance if at least three files are used and if the number of training files is equal to or greater than the number of testing files. Though this might be expected to extend to the length of the wav files, it was not necessarily the case because using half a file to test consistently demonstrated poor results.

Most importantly, testing and training with vowel phones only provided impressive recognition rates at approximately 93%, meriting further study.

With regards to individual phone contributions to recognition, it was found that a single phoneme does not predict a speaker more effectively when using the same phoneme to train, as compared to any other phoneme.

However, two phonemes consistently outperform the others in predicting 1a speaker: 'ae' and 'ay'. Of the five trials, 'ae' was ranked most highly recognized 3 times, 'ay' was highest twice, and both were among the top two in four of the trials. More tests are being done to obtain a statistically significant conclusion.


Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are derived by obtaining the Fourier transform of a signal and mapping the result on the mel-scale, which is an auditory perception-based scale of pitch differences. With these unique labels on speech files, the similarity between two files can be determined by the Kullback–Leibler (KL) distance, which is based on probability distributions, and, given a training set upon which to base one’s decisions, the corresponding speaker can be identified.

The goal of this research is to test the robustness of MFCCs in speaker detection by varying the testing and training parameters with the following methods:

1) using segments of a whole speech file

2) varying the number of speech files used, and

3) splicing together the vowel phones of a speech


  • Vowels only, cont.

    • If individual phoneme files are used for training and testing, the results are impressive.

  • Train: 5V, Test: 5V

  • Evaluated: 570 Correct: 533, Percentage: 0.935088

  • Train: 3V, Test: 3V

  • Evaluated: 342 Correct: 278, Percentage: 0.812865

The following nomenclature is adopted in this poster:

F: Full (complete) speech file

H: Speech file segmented at middle

V: File consisting of vowel phones only

  • Control

  • Train: 5F , Test: 5F (not including SA)

  • Evaluated: 570 Correct: 493, Percentage: 0.864912

  • Difference in number

    • Reducing both the number of training and testing files to be consistent results in an optimal (~84.2%) success rate, but only up to 3 files.

  • Train: 3F , Test: 5F

  • Evaluated: 570 Correct: 440, Percentage: 0.771930

  • Train: 5F , Test: 3F

  • Evaluated: 342 Correct: 292, Percentage: 0.853801

  • Train: 3F , Test: 3F

  • Evaluated: 342 Correct: 290, Percentage: 0.847953

Further examination

In order to extract more information about the role that individual phones play in speaker recognition, the same algorithm was applied to test recognition based on individual phonemes that are extracted from each speaker. The training set consists of files containing only file segments for a particular phoneme, which are then later tested individually.

The TIMIT Corpus

The TIMIT corpus was created as a joint effort between Texas Instruments (TI) and MIT and consists of time-aligned orthographic, phonetic and word transcriptions for each of the 6300 16-bit 16kHz speech files. 630 speakers from the 8 major dialects of American English each read from 10 ‘phonetically rich’ texts, among which 2 are common across all speakers.


10160 10733 y

10733 11880 axr


10160 11880 your


0 57140 She had

your dark suit

in greasy wash

water all year.

In order to investigate the distribution of the phonemes in TIMIT, the plot shown above was generated. The average sample length is

The individual texts may be phonetically rich, but taken as a whole the distribution of the phonemes is unbalanced.

Literature cited

Cole, Ronald A., et al.. 1996. The Contribution of Consonants Versus Vowels to Word Recognition in Fluent Speech

Van Heerden, C.J, E. Bernard. 2008. Speaker-specific variability of Phoneme Durations.

Fattah, Mohamed, Ren Fuji, Shingo Kuroiwa. 2006. Phoneme Based Speaker Modeling to Improve Speaker Identification

Figure 2. Speaker prediction based on individual phonemes. The results show that while speaker recognition based on individual phoneme is considerably low (μ=3.60%, σ=2.34), the diagonal does show slightly higher recognition rates, as would be expected.

Figure 1. Speaker recognition based on 144 vowel-based files

Figure 1. The predicted speaker ID plotted against the actual speaker, for 144 full speech files.

  • Difference in size

    • Performance is considerably better with full speech files during testing, regardless of which half of the file we use.

  • Train: 5F, Test: 5H

  • Evaluated: 570 Correct: 327, Percentage: 0.573684

  • Train: 5H, Test: 5F

  • Evaluated: 570 Correct: 442, Percentage: 0.775439

  • Train: 5H, Test: 3F

  • Evaluated: 342 Correct: 277, Percentage: 0.809942


Grateful acknowledgement is made to Youngmoo Kim for providing insight and direction throughout my research and to Jiahong Yuan for encouraging my pursuit of corpus phonetics.

For further information

  • Vowels only

    • Individual phonemes are exceedingly difficult to use when predicting speaker based on entire wav files.

  • Train: 5V, Test: 3F

  • Evaluated: 342 Correct: 242, Percentage: 0.707602

  • Train: 5F, Test: 5V

  • Evaluated: 570 Correct: 53, Percentage: 0.092982

Figure 3. To test the possibility that one speaker is consistently retrieved as the ideal candidate for a particular phoneme, the above plot was generated to plot the predicted speaker vs the actual speaker based on speaker ID. Speaker 183 is selected most often in the above scenario.

Please contact Further details about the methodology may be read online at