1 / 30

Building A Highly Accurate Mandarin Speech Recognizer

Building A Highly Accurate Mandarin Speech Recognizer. Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Mari Ostendorf, etc. Outline. Mandarin-specific modules: Word segmentation. Tonal phonetic pronunciations. Pronunciation look-up tools.

ronat
Download Presentation

Building A Highly Accurate Mandarin Speech Recognizer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Mari Ostendorf, etc.

  2. Outline • Mandarin-specific modules: • Word segmentation. • Tonal phonetic pronunciations. • Pronunciation look-up tools. • Linguistic questions for CART state clustering. • Pitch features. • Mandarin-optimized acoustic segmenter.

  3. Outline • Language independent techniques: • MPE training. • fMPE feature transform. • MLP feature front end. • System combination. • Jan-08 system • Future

  4. Word segmentation and lexicon • Started from BBN 64K lexicon (originally from LDC 44K lexicon) • /g/ssli/data/mandarin-bn/external-sites/ • Added 20K new entries (especially names) from various sources. • First-pass: Longest-first match (LFM) word segmentation • Selected most frequent 60K words as our decoding lexicon. • UW ÇBBN = 46.8K • UW \ BBN = 13.6K (阿扁,马英九) • BBN \ UW = 17.3K (狼狈为奸,心慌意乱,北京烤鸭)

  5. Word segmentation and lexicon • Train 3-gram. Treat OOV = @reject@ = garbage. • Second-pass: Re-segment training text with ML word segmentation. • /homes/mhwang/src/ngramseg/wseg/ngram –order 1 –lm <DARPA n-gram> • Output depends on (1) algorithm, (2) lexicon. • 记者-从中-国-国家计划委员会-有关部门-获悉 • 记者-从-中国-国家计划委员会-有关部门-获悉 • Re-train 3-gram, 4-gram, 5-gram. • Very minor perplexity improvement. Character accuracy from 74.42% (LFM) to 75.01% (ML) by NTU.

  6. Lexicon and Perplexity • 1.2B words of training text. • qLMn: quick (highly pruned) n-gram

  7. Two Tonal Phone Sets • 70 tonal phones from BBN originally, using IBM main-vowel idea: • Split Mandarin Final into vowel+coda to increase parameter sharing. • bang /b a NG/ • ban /b a N/ • {n,N},{y,Y},{w,Y} for unique syllabification • Silence for pauses and rejfor noises/garbage/foreign. • Introducing diphthongs and neutral tones for BC  79 tonal phones

  8. Phone-81: Diphthongs for BC • Add diphthongs (4x4=16) for fast speech and modeling longer triphone context. • Maintain unique syllabification. • Syllable ending W and Y not needed anymore.

  9. Phone-81: Frequent Neutral Tones • Neutral tones more common in conversation. • Neutral tones were not modeled. The 3rd tone was used as replacement. • Add 3 neutral tones for frequent chars.

  10. Phone-81: Special CI Phones • Filled pauses (hmm, ah) common in BC. Add two CI phones for them. • Add CI /V/ for English.

  11. Phone-81: Simplification of Other Phones • Now 72+14+3+3=92 phones, too many triphones to model. • Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2. • 92 – (4x3–1) = 81 phones.

  12. Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 Indeed different error behaviors --- good for system combo.

  13. Pronunciation Look-up Tools • SRC=/g/ssli/data/mandarin-bn/scripts/pron • $SRC/wlookup.pl: Look up pronunciations from a word dictionary, for Chinese and/or English words. • $SRC/eng2bbn.pl: Look up English word pronunciations in Mandarin phone set. • $SRC/standarnd-all.sc: P72 Single-char lexicon. First pronunciation = most common • $SRC/sc2bbn.pl: Look up Chinese word pronunciation from individual characters. • $SRC/pconvert.pl: convert a dict from one phone set to another • $SRC/RWTH/: RWTH-70 phone set (3rd phone set)

  14. Pitch Features • Get_f0 to compute pitch for voiced segments. • Pass to graphtrack to reduce pitch halving/doubling problem • SPLINE interpolation for unvoiced regions. • Log, D, DD

  15. silence noise Start / null End / null speech silence silence silence noise noise noise Start Start Start / / / null null null End End End / / / null null null speech speech speech Acoustic segmentation • Former segmenter, inherited from the English system, caused high deletion errors. It mis-classified some speech segments as noises. • Speech segment min duration 18*30=540ms=0.5s

  16. noise silence Foreign Start / null End / null Mandarin 1 Mandarin 2 New Acoustic Segmenter • Allow shorter speech duration • Model Mandarin vs. Foreign (English) separately.

  17. Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on Eval06

  18. Language-Independent Technologies

  19. Two Sets of Acoustic Models • MLP-model: • MFCC+pitch+MLP (32-dim) = 74-dim • CW Triphones with SD SAT feature transform • MPE trained • P72 • PLP-model: • PLP+pitch = 32-dim • CW Triphones with SD SAT feature transform • Followed by fMPE SI feature transform • MPE trained • P81

  20. MLP Phoneme Posterior Features • One MLP to compute Tandem features with pitch+PLP input. 71 output units. • 20 MLPs to compute HATs features with 19 critical bands. 71 output units. • Combine Tandem and HATs posterior vectors into one 71-dim vector, valued [0..1]. • PCA(Log(71))  32 • MFCC + pitch + MLP = 74-dim

  21. (42x9)x15000x71 PLP (39x9) Pitch (3x9) Tandem Features [T1,T2,…,T71] • Input: 9 frames of PLP+pitch

  22. MLP and Pitch Features nonCW ML, Hub4 Training, MLLR, LM2 on Eval04

  23. HATS Features [H1,H2,…,H71] 51x60x71 (60*19)x8000x71 E1 E2 … E19

  24. PLP Models with fMPE Transform • PLP model with fMPE transform to compete with MLP model. • Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT • 5 Neighboring frames of Gaussian posteriors. • M is 42 x (3500*32*5), ht is (3500*32*5)x1. • Ref: Zheng ICASSP 07 paper

  25. Eval07: June 2007

  26. Jan 08: RWTH Improvements • Using RWTH-70 phone set, converted from UW dictionary. • Using UW-ICSI MLP features. • On Dev07

  27. Jan-2008: Decoding Architecture • Manual acoustic segmentation. • Removing sub-segments. • Removing the ending of the first utterance when partially overlapped. • Gender-ID per utternace. • Auto speaker clustering per gender. • VTLN per speaker. • CMN/CVN per utterance.

  28. Jan-2008 Decoding Architecture SI MLP nonCW qLM3 Aachen PLP-SA MLP-SA PLP CW SAT+fMPE MLLR, LM3 MLP CW SAT MLLR, LM3 Confusion Network Combination

  29. Re-Test: Jan 2008 • Dev07 • PLP-SA-1: 10.2% • PLP-SA-2: 9.9% (very competitive to MLP-model after adaptation) • MLP-SA-2: 9.8% • {PLP-SA-2, MLP-SA-2}: 9.5% • RWTH: 9.9% (more sub errors, fewer del errors) • {RWTH, PLP-SA-2, MLP-SA-2}: 9.2% • Eval07-retest: 8.1%  7.3%

  30. Future Work • Putting all words together, re-do word segmentation and re-select decoding lexicon. • Auto create new words using point-wise mutual information: • PMI(w1,w2) = log P(w1,w2)/{P(w1)P(w2)} • LM adaptation • Finer topics • Names • Has to coordinate with MT/NE

More Related