1 / 49

Recommendations Based on Speech Classification

Recommendations Based on Speech Classification. (and examples of what recommender systems can learn from signal processing) Christian M ü ller German Research Center for Artificial Intelligence International Computer Science Institute, Berkeley, CA .

lyre
Download Presentation

Recommendations Based on Speech Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recommendations Based on Speech Classification (and examples of what recommender systems can learn from signal processing) Christian Müller German Research Center for Artificial Intelligence International Computer Science Institute, Berkeley, CA

  2. Speech as a source of information for non-intrusive user modeling Overview Speech/signal processing Take-away messages Recommendations Based on Speech Classification • Vocal aging -> features for speaker age recognition • GMM/SVM supervector approach for acoustic speech features • Detection task and pseudo-NIST evaluation procedure • Rank and polynomial rank normalization • Knowledge-driven feature selection • Classification methods for independent “bag of observations” features • Valid application-independent evaluation • Feature space warping normalization (and examples of what recommender systems can learn from signal processing) • Conclusions Christian Müller

  3. adaptivespeech dialog system user model adapts it's dialog behavior (e.g. detailed map with shops vs. arrows) provides recommendations (e.g. a different route to the gate) Speech as a Source for Non-Intrusive UM Now it’s time to get to gate 38. Information about the user A speakerclassification ? speech = sensor inference from sensors(not intrusive) B explicit statement (intrusive) Christian Müller

  4. Speaker Classification Systems • Cognitive Load • Best Research Paper AwardUM 2001 System • Age and Gender • Voice Award 2007 • Telekom live operation 2009 • Language • 14 languages + dialects • NIST evaluation 2007 Audio segment (telephone quality) • Identity • Project with BKA 2009 • NIST* Evaluation 2008 • Acoustic Events • Project with VW 2008 • Interspeech 2008 Christian Müller

  5. Recommendations Based on Speech Classification Christian Müller

  6. Product Recommendations Based on Age and Gender Christian Müller

  7. YF AM Product Recommendations Based on Age and Gender Michael Feld and Christian Müller. Speaker Classification for Mobile Devices. In Proceedings of the 2nd IEEE International Interdisciplinary Conference on Portable Information Devices (Portable 2008). 2008 Christian Müller

  8. How can you find features for building your models by explicitly studying the underlying phenomena? • Proposing Knowledge-driven feature select the example of features for speaker age recognition Christian Müller

  9. Speaker Classification as an Interdisciplinary Area of Research Which are the requirements of a speaker classification system and how can they be solved on the implementationlayer ? How can the age (and the gender) of a speaker be recognized automatically ? Which are the manifestations of age (and gender) in thespeaker’s voice and speaking style ? Speech Technology / Artificial Intelligence Phonetics Voice Pathology Speaker Classification Software-Technology Christian Müller

  10. Impact of Aging on the Human Speech Production Speech breathing effects: lower expirational volume more speech pauses lower amplitude thorax stiffer lungs lighter less elastic lower position Christian Müller

  11. Impact of Aging on the Human Speech Production laryngal area effects: rise of fundamental frequency (in men) reduced voice quality larynx calcification and ossification vocal folds loss of tissue stiffening Christian Müller

  12. Impact of Aging on the Human Speech Production supralaryngal area facial bones and muscles degeneration reduced elasticity effects: imprecise articulation for example vowel centralization Christian Müller

  13. Impact of Aging on the Human Speech Production neurological effects loss of tissue in the cortexreduced performance of the neuronal transmitters effects: reduced articulation rate defective coordination between the articulators vowel centralization Christian Müller

  14. F0 (Hz) 170 160 150 140 130 120 110 100 90 20 30 40 50 60 70 80 90 Development of F0 in Men / Women men only non-smokers women smokers and non-smokers Linville (2001) age in years Christian Müller

  15. CF CM YM YF AF AM SF SM Age Classes Female Male age Children <= 13 years Youth 14 - 19 years Adults 20 - 64 years Seniors >= 65 Jahren Christian Müller

  16. CF CM YM YF AF AM SF SM Age Classes Female Male age Children <= 13 years Youth 14 - 19 years Adults 20 - 64 years Seniors >= 65 Jahren Christian Müller

  17. Features fundamental frequency (pitch) mean pitch_mean standard deviation pitch_stddev min, max and difference pitch_min / pitch_max / pitch_diff voice quality shimmer shim_l / shim_ldb / shim_apq3 / shim_apq11 / shim_ddp jitter jitt_l / jitt_la / jitt_rap / jitt_ppq / jitt_ddp harmonics-to-noise-ratio harm_mean / harm_stddev articulation rate ar_rate speech pauses pause_num / pause_dur Christian Müller

  18. Features fundamental frequency (pitch) mean standard deviation min, max and difference voice voice quality shimmer jitter harmonics-to-noise-ratio articulation rate speaking style speech pauses Christian Müller

  19. CF CM YF YM AF AM SF SM CF CM YF YM AF AM SM SF Example Results C_YFAFSFYM_AM_SM high jitter value = low voice quality fundamental frequency (F0) Christian Müller. Zweistufige kontextsensitive Sprecherklassifikation am Beispiel von Alter und Geschlecht [Two-layered Context-Sensitive Speaker Classification on the Example of Age and Gender]. AKA, Berlin, 2006 speech pauses Christian Müller

  20. b b a e b A: B: d d e c Hiearchical Feature Model High-level features (learned characteristics) semantics ? dialog ideloect <s> how shall I say this <c> <s> yeah I know... phonetics /S/ /oU/ /m/ /i:/ /D/ /&/ /m/ // /n/ /i:/ ... prosody spectrum Low-level features (physical characterstics) Christian Müller

  21. How can your features be modeled assuming that they • are multi-dimentional • represent repeating observations of the same kind • can be assumed to be independent (“bag” of observations) • Proposing the GMM/SVM Supervector Approach on the example of frame-by-frame acoustic features Christian Müller

  22. Preprocessing zk wkj -0,4 0.7 -1 y1 y2 -1.5 0.5 1 1 wji 1 1 x2 x1 General Classification Scheme e.g. channel compensation (not addressed in this talk) support-vector machines multilayer perceptron networks FeatureExtraction Classification Fusion Top-Down-Knowledge Christian Müller

  23. b b a e b A: B: d d e c Modeling Acoustics and Prosodics semantics ? dialog ideloect no ASR <s> how shall I say this <c> <s> yeah I know... phonetics /S/ /oU/ /m/ /i:/ /D/ /&/ /m/ // /n/ /i:/ ... prosody spectrum Christian Müller

  24. Generative Approach: Gaussian Mixture Model (GMM) training “emergency vehicle” feature extraction probability density “emergencyvehicle” model frame of speech test feature extraction “emergencyvehicle” model ? avg likelihood over all frames for class “emergency vehicle” Christian Müller

  25. Generative Approach: Gaussian Mixture Model (GMM) test feature extraction “emergencyvehicle” model ? avg. log likelihood ratio over all frames for class “emergency vehicle” frame of speech back-ground model Christian Müller

  26. A Mixture of Gaussians • Means, variances, and mixtures weights are optimized in training • Black line = mixture of 3 Gaussians Christian Müller

  27. Discriminative Method: Support Vector Machine (SVM) • Features are transformed into higher-dimensional space where problem is linear • Discriminating hyper plane is learned using linear regression • Trade-off between training error and width of margin • Model is stored in form of “support vectors” (data points on the margin) training “em. vehic.” (1) feature extraction “em. vehic.” model “not em. vehic.” (-1) Christian Müller

  28. Discriminative Method: Support Vector Machine (SVM) • Discriminative methods have shown to be superior to generative methods for similar tasks • Features vectors have to be of the same lengths (sensitive to variable segment lengths) • Solutions: • feature statistics calculated over the entire utterance • fixes portion of the segment • sequential kernels test ? feature extraction score (distance to hyper plane) Christian Müller

  29. GMM/SVM Supervector Approach • Combines discriminative power of SVMs with length independency of GMMs • Very successful with similar tasks such as speaker recognition • GMM is trained using MAP adaptation feature extraction Gaussian means (MAP adapted) Christian Müller

  30. Evaluation Results Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008. Christian Müller

  31. How can you evaluate your multi-class models independently from the given application? • How can you establish a appropriate evaluation in order procedure to obtain valid results? • Proposing the detection task and the “pseudo NIST” evaluation procedure on the example of acoustic event detection and speaker age recognition. Christian Müller

  32. Background • With multi-class recognition problems, many test/analyzing methods are very application specific. • e.g. confusion matrices. • we want a method that allows results to be generalized across a large set of applications. • With home-grown databases, parameter tuning on the evaluation set often compromises the validity of the results/inferences. • we want a fair “one shot” evaluation. Christian Müller

  33. The Detection Task system • Given • a speech segment (s) • and an acoustic event to be detected (target event, ET ) • the task is to decide whether ET is present in s (yes or no) • the system's output shall also contains a score indicating its confidence with more positive scores indicating greater confidence. yes , 1.324326 emergeny vehicle ? Christian Müller

  34. Terminology • Segment class • e.g. segment event, segment age-class. • ground truth (not known). • Target • the hypothesized class. • Trial • a combination of segment and target. Christian Müller

  35. Evaluation system • The system performance is evaluated by presenting it with a set of trials. • Each test segment is used for multiple trials. • The absence of all of all targets is explicitly included. yes 1.32432 no -0.3212 no 1.8463 no -2.5773 yes 0.00132 no 2.20122 emergency vehicle ? music ? talking ? laughing ?phone ? no event ? Christian Müller

  36. Type of Errors segment “em. vehic.” system no “MISS” target “em. vehic” ? segment “em. vehic” system yes “FALSE ALARM” target “phone” ? Christian Müller

  37. Decision-Error Tradeoff misses • Selecting an operating point (decision threshold) along the dotted line trades misses off false alarms. • Optimal operating point is application dependent. • Low false alarm rates are desirable for most applications. “equal error rate” false alarms Christian Müller

  38. Decision Cost Function C(ET, EN) = CMiss· PTarget· PMiss(ET) + CFA · (1-PTarget) · PFA (ET,EN) where ET and EN are the target and non-target events, and CMiss, CFA and PTarget are application model parameters. • Weighted sum of misses and false alarms using variable costs and priors. • Application model parameters are selected according to the application. The application parameters for EER are: CMiss = CFA = 1 and PTarget = 0.5 Christian Müller

  39. Example DET-Plot miss probability false alarm probability Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008. Christian Müller

  40. Example Cost Chart Acoustic GMM/SVM Supervector system on 7-class age task Christian Müller

  41. Pseudo NIST Evaluation Procedure • ERL provided development and evaluation data as representative as possible for the application. • Three months before the evaluation, ICSI was provided with the development data. • At a pre-determined date, the blind evaluation data was provided to ICSI for processing. • The system's output was submitted to ERL in NIST format. • ERL downloaded the scoring software from NIST’s website, made the necessary modifications due to the changes in the labels. • ERL ran the software on the submitted system output. • The results were then disclosed to ICSI along with the keys (truth) for further analysis. • --> Fair “one-shot” evaluation, no parameter tuning on the evaluation set. Christian Müller

  42. How can you normalize your features in order to obtain a uniform scale and a unifom distribution? • Proposing rank normalization respectively polynomial rank normalization Christian Müller

  43. Background • Fundamental frequency (pitch): 75-200 Hz • Jitter: 0.001324 PPQ • --> implicit feature weighing Christian Müller

  44. Mean/Variance Normalization 1 • uniform scale • non-uniform distribution ai = vi − min(vi) max(vi) − min(vi) -1 1 Christian Müller

  45. feature background model normalized feature 0101 0 0 0101 0.01 0.25 0101 0.06 0.5 0101 0.13 0.75 0101 0.29 1 ... 0101 0.01 ... Rank-Normalization • create ordered list of values using bg data • rank = position in list / number of values • no occurrence mapped to 0 0101 0.75 ... 0123 0.4 2317 0.2 ... 0101 0.06 ... 0101 0.13 ... 0101 0.29 ... Christian Müller

  46. 1 -1 1 Rank Normalization 1 • (+) uniform distribution • (-) large three dimensional lookup tables • (-) linear interpolation for unseen values • larger values ? smaller values ? -1 1 Christian Müller

  47. Polynomial Rank Normalization • use ranks to train a polynomial • apply polynomial instead of look-up tables • better interpolation • no need to store look-uptables Christian Müller and Joan-Isaac Biel. The ICSI 2007 Language Recognition System. In Proceedings of the Odyssey 2008 Workshop on Speaker and Language Recognition. Stellenbosch, South Africa, 2008 Christian Müller

  48. Speech as a source of information for non-intrusive user modeling Conclusions Speech/signal processing Take-away messages • Vocal aging -> features for speaker age recognition • GMM/SVM supervector approach for acoustic speech features • Detection task and pseudo-NIST evaluation procedure • Rank and polynomial rank normalization • Knowledge-driven feature selection • Classification methods for independent “bag of observations” features • Valid application-independent evaluation • Feature space warping normalization Christian Müller

  49. Thank you! Christian Müller

More Related