1 / 106

Conditional Random Fields for Automatic Speech Recognition

Conditional Random Fields for Automatic Speech Recognition. Jeremy Morris 05/12/2010. Motivation. What is the purpose of Automatic Speech Recognition? Take an acoustic speech signal … … and extract higher level information (e.g. words) from it. “speech”. Motivation.

tausiq
Download Presentation

Conditional Random Fields for Automatic Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 05/12/2010

  2. Motivation • What is the purpose of Automatic Speech Recognition? • Take an acoustic speech signal … • … and extract higher level information (e.g. words) from it “speech”

  3. Motivation • How do we extract this higher level information from the speech signal? • First extract lower level information • Use it to build models of phones, words “speech” / s p iych/

  4. Motivation • State-of-the-art ASR takes a top-down approach to this problem • Extract acoustic features from the signal • Model a process that generates these features • Use these models to find the word sequence that best fits the features “speech” / s p iych/

  5. Motivation • A bottom-up approach • Look for evidence of speech in the signal • Phones, phonological features • Combine this evidence together to find the most probable sequence of words in the signal voicing? burst? frication? “speech” / s p iych/

  6. Motivation • How can we combine this evidence? • Conditional Random Fields (CRFs) • Discriminative, probabilistic sequence model • Models the conditional probability of a sequence given evidence voicing? burst? frication? “speech” / s p iych/

  7. Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

  8. CRF Models • Conditional Random Fields (CRFs) • Discriminative probabilistic sequence model • Directly defines a posterior probability P(Y|X) of a label sequence Y given evidence X

  9. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence

  10. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence

  11. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence

  12. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence

  13. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence • Evidence can influence transitions between states

  14. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence • Evidence can influence transitions between states

  15. CRF Models • Evidence is incorporated via feature functions state feature functions

  16. CRF Models • Evidence is incorporated via feature functions transition feature functions state feature functions

  17. CRF Models state feature functions transition feature functions • The form of the CRF is an exponential model of weighted feature functions • Weights trained via gradient descent to maximize the conditional likelihood

  18. Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

  19. Phone Recognition • What evidence do we have to combine? • MLP ANN trained to estimate frame-level posteriors for phonological features • MLP ANN trained to estimate frame-level posteriors for phone classes P(voicing|X) P(burst|X) P(frication|X) … P( /ah/ | X) P( /t/ | X) P( /n/ | X) …

  20. Phone Recognition • Use these MLP outputs to build state feature functions

  21. Phone Recognition • Use these MLP outputs to build state feature functions

  22. Phone Recognition • Use these MLP outputs to build state feature functions

  23. Phone Recognition • Pilot task – phone recognition on TIMIT • ICSI Quicknet MLPs trained on TIMIT, used as inputs to the CRF models • Compared to Tandem and a standard PLP HMM baseline model • Output of ICSI Quicknet MLPs as inputs • Phone class attributes (61 outputs) • Phonological features attributes (44 outputs)

  24. Phone Recognition *Signficantly(p<0.05) better than comparable Tandem system (Morris & Fosler-Lussier 08)

  25. Phone Recognition • Moving forward: How do we make use of CRF classification for word recognition? • Attempt to fit CRFs into current state-of-the-art models for speech recognition? • Attempt to use CRFs directly? • Each approach has its benefits • Fitting CRFs into a standard framework lets us reuse existing code and ideas • A model that uses CRFs directly opens up new directions for investigation • Requires some rethinking of the standard model for ASR

  26. Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

  27. HMM-CRF Word Recognition • Inspired by Tandem HMM systems • Uses ANN outputs as input features to an HMM “speech” / s p iych/ PCA

  28. HMM-CRF Word Recognition • Inspired by Tandem HMM systems • Uses ANN outputs as input features to an HMM • HMM-CRF system (Crandem) • Use a CRF to generate input features for HMM • See if improved phone accuracy helps the system • Problem: CRFs estimate probability of the entire sequence, not individual frames “speech” / s p iych/ PCA

  29. HMM-CRF Word Recognition • One solution: Forward-Backward Algorithm • Used during CRF training to maximized conditional likelihood • Provides an estimate of the posterior probability of a phone label given the input

  30. HMM-CRF Word Recognition • Original Tandem system “speech” / s p iych/ PCA

  31. HMM-CRF Word Recognition • Modified Tandem system (Crandem) Local Feature Calc. PCA “speech” / s p iych/

  32. HMM-CRF Word Recognition • Pilot task – phone recognition on TIMIT • Same ICSI Quicknet MLP outputs used as inputs • Crandem compared to Tandem, a standard PLP HMM baseline model, and to the original CRF • Evidence on transitions • This work also examines the effect of using the same MLP outputs as transition features for the CRF

  33. HMM-CRF Word Recognition • Pilot Results 1 (Fosler-Lussier & Morris 08) *Significant (p<0.05) improvement at 0.6% difference between models

  34. HMM-CRF Word Recognition • Pilot Results 2 (Fosler-Lussier & Morris 08) *Significant (p<0.05) improvement at 0.6% difference between models

  35. HMM-CRF Word Recognition • Extension – Word recognition on WSJ0 • New MLPs and CRFs trained on WSJ0 corpus of read speech • No phone level assignments, only word transcripts • Initial alignments from HMM forced alignment of MFCC features • Compare Crandem baseline to Tandem and original MFCC baselines • WJ0 5K Word Recognition task • Same bigram language model used for all systems

  36. HMM-CRF Word Recognition • Results (Morris & Fosler-Lussier 09) *Significant (p≤0.05) improvement at roughly 0.9% difference between models

  37. HMM-CRF Word Recognition *Significant (p≤0.05) improvement at roughly 0.06% difference between models

  38. HMM-CRF Word Recognition Comparison of MLP activation vs. CRF activation

  39. HMM-CRF Word Recognition Ranked average per-frame activation MLP vs. CRF

  40. HMM-CRF Word Recognition • Insights from these experiments • CRF posteriors very different in flavor from MLP posteriors • Overconfident in local decision being made • Higher phone accuracy did not translate to lower WER • Further experiment to test this idea • Transform posteriors via taking a root and renormalizing • Bring classes closer together • Achieved results insignificantly different from baseline, no longer degraded with further epochs of training (though no improvement either)

  41. Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

  42. CRF Word Recognition • Instead of feeding CRF outputs into an HMM “speech” / s p iych/ 42

  43. CRF Word Recognition • Instead of feeding CRF outputs into an HMM • Why not decode words directly off the CRF? “speech” / s p iych/ “speech” / s p iych/ 43

  44. CRF Word Recognition • The standard model of ASR uses likelihood based acoustic models • CRFs provide a conditional acoustic model P(Φ|X) Acoustic Model Lexicon Model Language Model

  45. CRF Word Recognition Lexicon Model Language Model CRF Acoustic Model Phone Penalty Model

  46. CRF Word Recognition • Models implemented using OpenFST • Viterbi beam search to find best word sequence • Word recognition on WSJ0 • WJ0 5K Word Recognition task • Same bigram language model used for all systems • Same MLPs used for CRF-HMM (Crandem) experiments • CRFs trained using 3-state phone model instead of 1-state model • Compare to Tandem and original MFCC baselines

  47. CRF Word Recognition • Results – Phone Classes only *Significant (p≤0.05) improvement at roughly 0.9% difference between models

  48. CRF Word Recognition • Results – Phone & Phonological features *Significant (p≤0.05) improvement at roughly 0.9% difference between models

  49. Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

  50. Conclusions & Future Work • Designed and developed software for CRF training for ASR • Developed a system for word-level ASR using CRFs • Meets baseline performance of an MLE trained HMM system • Platform for further exploration

More Related