1 / 30

Maximum Likelihood Adaptation of Semi-Continuous HMMs

This paper discusses the adaptation of semi-continuous Hidden Markov Models (HMMs) using maximum likelihood methods. It also explores the application of Probabilistic Latent Semantic Analysis (PLSA) for adaptation in both speech recognition and information retrieval systems. The evaluation results show the effectiveness of the proposed adaptation techniques.

cornelll
Download Presentation

Maximum Likelihood Adaptation of Semi-Continuous HMMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maximum Likelihood Adaptation of Semi-Continuous HMMs by Latent Variable Decomposition of State Distributions LTI Student Research Symposium 2004 Antoine Raux Work done in collaboration with Rita Singh

  2. Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation

  3. Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation

  4. HMMs for Speech Recognition • Generative probabilistic model of speech • States represent sub-phonemic units • In general, 2 types of parameters: • Temporal aspect: transition probabilities • Spectral aspect: output distributions (means, variances, mixing weights of mixtures of Gaussians) • 2 broad types of structure: • Continuous Density • Semi-Continuous

  5. Continuous Density HMMs N(mi1,vi1) N(mi2,vi2) N(mi3,vi3) N(mj1,vj1) N(mj2,vj2) N(mj3,vj3) N(mk1,vk1) N(mk2,vk2) N(mk3,vk3) wi1=P(Ci1|S=Si) wi2 wi3 wk2 wk1 wj2 wj1 wk3 wj3 Si Sj Sk

  6. Semi-Continuous HMMs N(m1,v1) N(m2,v2) N(m3,v3) N(m4,v4) N(m5,v5) N(m6,v6) N(m7,v7) wi2 wi6 wi7 wi1 wi4 wi5 wi3 Si Sj Sk

  7. SCHMMs vs CDHMMs • Less powerful (i.e. continuous are better with large amounts of training data) • BUT faster to compute (fewer Gaussian computations) and train well on less data • Training of codebook and mixture weights can be decoupled

  8. Acoustic Adaptation • Both CDHMMs and SCHMMs need a large amount of data for training • Such amounts are not always available for some conditions (domain, speakers, environment) • Acoustic Adaptation: modify models trained on a large amount of data to match different conditions using a small amount of data

  9. Model-based (ML) Adaptation • Tie the parameters of different states so that all states can be adapted with little data • Typical method: Maximum Likelihood Linear Regression (MLLR) used to adapt means and variances of CDHMMs

  10. Adapting Mixture Weights • Problem: MLLR does not work for mixture weights of SCHMMs • Weights are not evenly distributed (because their sum always equals 1) • Standard clustering algorithms ineffective • Problem: tie states with similar weight distributions

  11. Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation

  12. Parallel with Information Retrieval • Typical problem in Information Retrieval: identify similar documents • Documents can be represented as distributions over the vocabulary:tie documents with similar word distributions

  13. Word Document Representation Word1 Word2 Word3 Word4 Word5 Word6 Word7 wi2 wi6 wi7 wi1 wi4 wi5 wi3 Di Dj Dk … …

  14. Problems with Word Document Representation • Word distribution for a document is sparse • Ambiguous words, synonyms… • Cannot reliably compare distributions to compare documents

  15. PLSA for IR • Solution proposed by Hofmann (1999):Probabilistic Latent Semantic Analysis • Express documents and words as distributions over a latent variable (topic?) • Latent variable takes a small number of values compared to words/documents • Similar to standard LSA but guarantees proper probability distributions

  16. PLSA for IR Word1 Word2 Word3 Word4 Word5 Word6 Word7 wz11=P(Word1|Z=Z1) Z1 Z2 Z3 Z4 wdi1=P(Z1|D=Di) Di Dj Dk … …

  17. PLSA Decomposition • Decompose the joint probability: Independence Assumption ! • Pd(d,w) lies on a sub-space of the probability simplex (PLS-Space) • Estimate parameters using EM algorithm so as to minimize the KL-divergence between P(d,w) and Pd(d,w)

  18. Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation

  19. Back to Speech Recognition… N(m1,v1) N(m2,v2) N(m3,v3) N(m4,v4) N(m5,v5) N(m6,v6) N(m7,v7) wi2 wi6 wi7 wi1 wi4 wi5 wi3 Si Sj Sk

  20. PLSA for SCHMMs N(m1,v1) N(m2,v2) N(m3,v3) N(m4,v4) N(m5,v5) N(m6,v6) N(m7,v7) wz11=P(C1|Z=Z1) Z1 Z2 Z3 Z4 wsi1=P(Z1|S=Si) Si Sj Sk

  21. Adaptation through PLSA Large Database Small Database Transitions I Means I Variances I Weights I Transitions II Means II Variances II Weights II Transitions II Means II Variances II Weights III Train SCHMM (Baum-Welch) Retrain SCHMM (Baum-Welch) Decompose Weights using PLSA Recompose Weights Decompose Weights using PLSA P(Z) I P(C|Z) I P(S|Z) I P(Z) I/II P(C|Z) I/II P(S|Z) I/II

  22. Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation

  23. Evaluation Experiment • Training data/Original models • 50 hours of calls to the Communicator system • Mostly native speakers • 4000 states, 256 Gaussian components • Adaptation data • 3 hours of calls to the Let’s Go system • Non-native speakers • Evaluation data • 449 utterances (20 min) from calls to Let’s Go • Non-native speakers

  24. Evaluation results

  25. Transitions I Means I Variances I Weights I Evaluation results

  26. Transitions II Means II Variances II Weights II Evaluation results

  27. Transitions II Means II Variances II Weights III Best Result: Readapt everything!! Evaluation results

  28. Reestimating all three distributions: P(Z), P(C|Z) and P(S|Z) Large Database Small Database Transitions I Means I Variances I Weights I Transitions II Means II Variances II Weights II Transitions II Means II Variances II Weights III Train SCHMM Retrain SCHMM Decompose Weights Recompose Weights Decompose Weights P(Z) I P(C|Z) I P(S|Z) I P(Z) I/II P(C|Z) I/II P(S|Z) I/II

  29. Conclusion • PLSA ties states of SCHMMs by introducing a latent variable • PLSA adaptation improves accuracy • Best method is equivalent to smoothing the retrained weight distributions by projection on the PLS-space • Future direction: directly learn the PLSA parameters in the Baum-Welch training

  30. Thank you… Questions?

More Related