1 / 1

ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION

ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION. M.Y. Hwang 1 , W. Wang 2 , X. Lei 1 , J. Zheng 2 , O. Cetin 3 ,, G. Peng 1 1. Department of Electrical Engineering, University of Washington, Seattle, WA, USA

roch
Download Presentation

ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION M.Y. Hwang1, W. Wang2, X. Lei1, J. Zheng2, O. Cetin3,, G. Peng1 1. Department of Electrical Engineering, University of Washington, Seattle, WA, USA 2. SRI International, Menlo Park, CA, USA 3. ICSI Berkeley, Berkeley, USA • Overview • Goal • Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast conversation (BC). • Improvements over the previous version • Increased training data, • Discriminative features, • Frame-level discriminative training criterion, • Multiple-pass AM adaptation, • System combination, • LM adaptation, • Total 24%--64% relative CER reduction. • Perplexity • Two Acoustic Models • MFCC, 39-dim, CW+SAT fMPE+MPE, 3000x128 Gaussians. • 2. MFCC + MPE-phoneme posterior feature, 74-dim, nonCW fMPE+MPE, 3000x64 Gaussians. • MLP features: • [1] Zheng, ICASSP-2007, “Combining discriminative feature, transform, and model training for large vocabulary speech recognition”. • [2] Chen, ICSLP-2004, “Learning long-term temporal features in LVCSR using neural networks”. Character Error Rates • Increased Training Data • Acoustic training data increased from 97 hours to 465 hours (2/3 BN, 1/3 BC) • Test data • ML word segmentation used for Chinese text. • Training text increased from 420M words to 849M words. • Lexicon size increased from 49K to 60K (1700 English words) • 6 LMs trained for interpolation into one • Bigrams (qLM2), trigrams (LM3, qLM3), 5-grams (LM5a, LM5b) are trained. LM5b uses count-based smoothing. • LM Adaptation for BC (Dev05bc) • LMBN = (1) ~ (6) BN + EARS Conversational Telephony Speech (159M words) • LMBC = (2)+(6) BC • LMALL = interpolation (LMBN, LMBC) • LMBN-C = LMBN adapted by (2) GALE-BC • One LM adaptation per show i, to maximize the likelihood of h • Re-start the entire recognition process after LM adaptation. • LMBN’ = LMBN adapted by h dynamically per show i • Same strategy no improvement on BN test data --- plenty of BN training text. Decoding Architecture Acoustic segmentation VTLN/CMN/CVN Speaker clustering 2. CW MFCC MLLR, LM3 • nonCW MLP • qLM3 3. nonCW MLP MLLR, LM3 h LM5a , LM5b rescore LM5a , LM5b rescore Confusion Network Combination • Future Work • Topic-dependent language model adaptation. • Machine-translation (MT) targeted error rates. Top 1

More Related