1 / 37

Speaker ID Smorgasbord or How I spent My Summer at ICSI

Speaker ID Smorgasbord or How I spent My Summer at ICSI. Kofi A. Boakye International Computer Science Institute. Outline. Keyword System Enhancements Monophone System Hybrid HMM/SVM Score Combinations Possible Directions. Keyword System: A Review. Motivation

Download Presentation

Speaker ID Smorgasbord or How I spent My Summer at ICSI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speaker ID Smorgasbordor How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute Speech Group Lunch Talk

  2. Outline • Keyword System • Enhancements • Monophone System • Hybrid HMM/SVM • Score Combinations • Possible Directions Speech Group Lunch Talk

  3. Keyword System: A Review • Motivation • Text-dependent systems have high performance, but limited flexibility when compared to text-independent systems Capitalize on advantages of text-dependent systems in this text-independent domain by limiting words of interest to a select group: Backchannels (yeah, uhhuh) , filled pauses (um, uh), discourse markers (like, well, now…) => high frequency and high speaker-characteristic quality II. GMMs assume frames are independent and fail to take advantage of sequential information • => Use HMMs instead to model the evolution of speech in time Speech Group Lunch Talk

  4. Keyword System: A Review • Approach • Model each speaker using a collection of keyword HMMs • Speaker models generated via adaptation of background models trained from a development data set • Use standard likelihood ratio approach: • Compute log likelihood ratio scores using accumulated log probabilities from keyword HMMs • Use a speech recognizer to: • Locate words in the speech stream • Align speech frames to the HMM • Generate acoustic likelihood scores HMM-UBM 1 Word Extractor HMM-UBM 2 signal Combination HMM-UBM N Speech Group Lunch Talk

  5. Keyword System: A Review Keywords Discourse markers: {actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean} Filled pauses: {um, uh} Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know } Keyword Models Simple left-to-right (whole word) HMMs with self-loops and no skips 4 Gaussian components per state Number of states related to number of phones and median number of frames for word HMMs trained and scored using HTK Acoustic features: 19 mel-cepstra, zeroth cepstrum, and their first differences Speech Group Lunch Talk

  6. System Performance Switchboard 1 Dev Set Data partitioned into 6 splits Tests use jack-knifing procedure: Test on splits 1 - 3 using background model trained on splits 4 – 6 (and vice versa) For development, tested primarily on split 1 with 8-side training Result:EER = 0.83% Speech Group Lunch Talk

  7. System Performance • Observations: • Well-performing bigrams have comparable EERs • Poorly-performing bigrams suffer from a paucity of data • Suggests possibility of frequency threshold for performance • Single word ‘yeah’ yields EER of 4.62% Speech Group Lunch Talk

  8. Enhancements: Words • Examine the performance of other words • Sturim et al. propose word sets for text-constrained GMM system • Full set: 50 words that occur in > 70% of conversation sides Speech Group Lunch Talk

  9. Enhancements: Words • Examine the performance of other words • Sturim et al. propose word sets for text-constrained GMM system • Full set: 50 words that occur in > 70% of conversation sides • { and, I , that, yeah, you, just like, uh, to, think, the, have, so, know, in, but, they, really, it, well, is, not, because, my, that’s, on, its, about, do, for, was, don’t, one, get, all, with, oh, a, we, be, there, of, this, I’m, what, out, or, if, are, at } Speech Group Lunch Talk

  10. Enhancements: Words • Examine the performance of other words • Sturim et al. propose word sets for text-constrained GMM system • Full set: 50 words that occur in > 70% of conversation sides • Min set: 11 words that yield the lowest word-specific EERs Speech Group Lunch Talk

  11. Enhancements: Words • Examine the performance of other words • Sturim et al. propose word sets for text-constrained GMM system • Full set: 50 words that occur in > 70% of conversation sides • Min set: 11 words that yield the lowest word-specific EERs • {and, I, that, yeah, you, just, like, uh, to, think, the} Speech Group Lunch Talk

  12. Enhancements: Words Performance Full set: EER = 1.16% My set Full set = { yeah, like, uh, well, I, think, you } Speech Group Lunch Talk

  13. Enhancements: Words • Observations: • Some poorly performing words occur quite frequently • Such words may simply not be highly discriminative in nature • Single word ‘and’ yields EER of 2.48% !! Speech Group Lunch Talk

  14. Enhancements: Words Performance Min set: EER = 0.99% My set Min set = {yeah, like, uh, I, you, think} Speech Group Lunch Talk

  15. Enhancements: Words Observations: Except for ‘and’, min set words have comparable performance Most can fall into one of the three categories of filled pause, discourse marker, or backchannel, either in isolation or conjunction Speech Group Lunch Talk

  16. Enhancements: HNorm Target model scores have different distributions for utterances based on handset type LR scores HNorm Scores • Perform mean and variance normalization of scores based on estimated impostor score distribution • For split 1, use impostor utterances from splits 2 and 3 • 75 females • 86 males elec tgt1 carb elec tgt2 carb Speech Group Lunch Talk

  17. Enhancements: HNorm Performance EER = 1.65% Performance worsened! Possible issue in HNorm implementation? Speech Group Lunch Talk

  18. Enhancements: HNorm Examine effect of HNorm on particular speaker scores Speakers of interest: Those generating the most errors 3 Speakers each generating 4 errors Speech Group Lunch Talk

  19. Enhancements: HNorm Speech Group Lunch Talk

  20. Enhancements: HNorm Speech Group Lunch Talk

  21. Enhancements: HNorm Speech Group Lunch Talk

  22. Enhancements: HNorm Conclusion: HNorm works…but doesn’t One possibility: Look at computed devs… Distributions are widening in some cases Speech Group Lunch Talk

  23. Enhancements: Deltas Problem: System performance differs significantly by gender Hypothesis: Higher deltas for females may be noisier Solution: Use longer window for delta computation to smooth Speech Group Lunch Talk

  24. Enhancements: Deltas Extended window size from 2->3 Result: EER = 0.83% Performance nearly indistinguishable Speech Group Lunch Talk

  25. Enhancements: Deltas Extended window size from 2->3 Result: Male and female disparity remains Speech Group Lunch Talk

  26. Enhancements: Deltas Extended window size from 3->5 Result: EER = 1.32% Performance worsens! Speech Group Lunch Talk

  27. Enhancements: Deltas Extended window size from 3->5 Result: Male female disparity widens Further investigation necessary Speech Group Lunch Talk

  28. Monophone System Motivation Keyword system, with its use of HMMs, appears to have good performance However, we are only using a small amount (~10%) of the total data available =>Get full coverage by using phone HMMs rather than word HMMs System represents a trade-off between token coverage and “sharpness” of modeling Speech Group Lunch Talk

  29. Monophone System • Implementation • System implemented similarly to keyword system, with phones replacing words • Background models differ in that: • All models have 3 states, with 128 Gaussians per state • Models trained by successive splitting and Baum-Welch re-estimation, starting with a single Gaussian Speech Group Lunch Talk

  30. Monophone System Performance EER = 1.16% Similar performance to keyword system Uses a lot more data! Speech Group Lunch Talk

  31. Hybrid HMM/SVM System • Motivation SVMs have been shown to yield good performance in speaker recognition systems Features used: • Frames • Phone and word n-gram counts/frequencies • Phone lattices Speech Group Lunch Talk

  32. Hybrid HMM/SVM System Motivation Keyword system looks at “distance” between target and background models as measured by log-probabilities Look at distance between models more explicitly => Use model parameters as features Speech Group Lunch Talk

  33. Hybrid HMM/SVM System Approach Use concatenated mixture means as features for SVM Positive examples obtained by adapting background HMM to each of 8 training conversations Negative examples obtained by adapting background HMM to each conversation in the background set Keyword-level SVM outputs combined to give final score -Presently simple linear combination with equal weighting is used (though clearly suboptimal) Speech Group Lunch Talk

  34. Hybrid HMM/SVM System Performance EER = 1.82% Promising first start Speech Group Lunch Talk

  35. Score Combination We have three independent systems, so let’s see how they combine… Perform post facto (read: cheating) linear combination Each best combination yields same EER =>Possibly approaching EER limit for data set Speech Group Lunch Talk

  36. Possible Directions • Develop on SWB2 • Create word “master list” for keyword system • TNorm • Modify features to address gender-specific performance disparity • Score combination for hybrid system • Modified hybrid system • Tuning • Plowing Speech Group Lunch Talk

  37. Fin Speech Group Lunch Talk

More Related