1 / 16

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition. Konstantin Markov and Satoshi Nakamura. ATR Spoken Language Communication Research Laboratories (ATR-SLC) Kyoto, Japan. Outline. Motivation and previous studies. HMM based accent acoustic modeling.

saman
Download Presentation

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition Konstantin Markov and Satoshi Nakamura ATR Spoken Language Communication Research Laboratories (ATR-SLC) Kyoto, Japan SRIV2006, Toulouse, France

  2. Outline • Motivation and previous studies. • HMM based accent acoustic modeling. • Hybrid HMM/BN acoustic model for accented speech. • Evaluation and results. • Conclusion. SRIV2006, Toulouse, France

  3. Motivation and Previous Studies • Accent variability: • Causes performance degradation due to training / testing conditions mismatch. • Becomes major factor for ASR’s public applications. • Differences due to accent variability are mainly: • Phonetic - • lexicon modification (Liu, ICASSP,98). • accent dependent dictionary (Humphries, ICASSP,98). • Acoustic – (addressed in this work) • Pooled data HMM (Chengalvarayan, Eurospeech’01). • Accent identification (Huang, ICASSP’05). SRIV2006, Toulouse, France

  4. HMM based approaches (1) Accent-dependent data → A B C A,B,C Pooled data → MA-HMM Multi-accent AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France

  5. HMM based approaches (2) Accent-dependent data → A B C Accent-dependent HMMs → A-HMM B-HMM C-HMM PA-HMM Parallel AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France

  6. HMM based approaches (3) Accent-dependent data → A B C Gender-dependent HMMs → M-HMM F-HMM GD-HMM Parallel AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France

  7. Hybrid HMM/BN Background • HMM/BN Structure:  HMM at the top level. Models speech temporal characteristic by state transitions.  BN at the bottom level. Represents states PDF. • BN Topologies: • Simple BN Example: State PDF: State output probability: If M is hidden, then: HMM q1 q2 q3 Bayesian Network HMM State Mixture component index Observation Q M X SRIV2006, Toulouse, France

  8. HMM/BN based Accent Model • Accent and Gender are modeled as additional variables of the BN. • The BN topology: • G = {F,M} • A = {A,B,C} • When G and A are hidden: SRIV2006, Toulouse, France

  9. HMM/BN Training • Initial conditions • Bootstrap HMM: gives the (tied) state structure. • Labelled data: each feature vector has accent and gender label. • Training algorithm: Step 1: Viterbi alignment of the training data using the bootstrap HMM to obtain state labels. Step 2: Initialization of BN parameters. Step 3: Forwards-Backward based embedded HMM/BN training. Step 4: If convergence criterion is met  Stop Otherwise  go to Step 3 SRIV2006, Toulouse, France

  10. HMM/BN approach A(M) B(M) C(M) Accent-dependent and gender-dependent data → A(F) B(F) C(F) HMM/BN HMM/BN AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France

  11. Comparison of state distributions MA-HMM PA-HMM GD-HMM HMM/BN SRIV2006, Toulouse, France

  12. Database and speech pre-processing • Database • Accents: • American (US). • British (BRT). • Australian (AUS). • Speakers / Utterances: • 100 per accent (90 for training + 10 for evaluation). • 300 utterances per speaker. • Speech material same for each accent. • Travel arrangement dialogs. • Speech feature extraction: • 20ms frames at 10ms rate. • 25 dim. features vectors (12MFCC + 12ΔMFCC + ΔE). SRIV2006, Toulouse, France

  13. Models • Acoustic models: • All HMM based AMs have: • Three states, left-to-right, triphone contexts • 3,275 states (MDL-SSS) • Variants with 6,18, 30 and 42 total Gaussians per state. • HMM/BN model: • Same state structure as the HMM models. • Same number of Gaussian components. • Language model: • Bi-gram, Tri-gram (600,000 training sentences). • 35,000 word vocabulary. • Test data perplexity: 116.5 and 27.8 • Pronunciation lexicon – American English. SRIV2006, Toulouse, France

  14. Evaluation results SRIV2006, Toulouse, France

  15. Evaluation results Word accuracies (%), all models with total of 42 Gaussians per state. SRIV2006, Toulouse, France

  16. Conclusions • In the matched accent case, accent-dependent models are the best choice. • The HMM/BN is the best, almost matching the results of accent-dependent models, but requires more mixture components. • Multi-accent HMM is the most efficient in terms of performance and complexity. • Different performance levels of accent-dependent models apparently caused by the phonetic accent differences. SRIV2006, Toulouse, France

More Related