1 / 34

Mandarin Pronunciation Variation Modeling

This study aims to model the variations in pronunciation in Mandarin, including sound changes, phone changes, and accent variations. The goal is to establish a corpus with spontaneous phenomena to improve pronunciation modeling.

lafreniere
Download Presentation

Mandarin Pronunciation Variation Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCMMSC’0120-22 NOV 01, Shenzhen, China Mandarin Pronunciation Variation Modeling Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/

  2. Change includes insertion, deletion and substitution. Motivation • In spontaneous speech, pronunciations of individual words are different, there are often • Sound changes, and • Phone changes. • For Chinese • an additional accent problem even people are speaking Mandarin, due to different dialect backgrounds (in Chinese, 7 major dialects) • colloquialism, grammar, style • Goal: modelling the pronunciation variations • Establishing a corpus with spontaneous phenomena, because we should know what the canonical phones change to. • Finding solutions to the pronunciation modelling theoretically and practically Center of Speech Technology, Tsinghua University

  3. Overview Center of Speech Technology, Tsinghua University

  4. Necessity to establish a new annotated spontaneous speech corpus • The existing databases (incl. Broadcast News, CallHome, CallFriend, …) do not cover all the Chinese spoken language phenomena • Sound changes: voiced, unvoiced, nasalization, … • Phone changes: retroflexed, OOV-phoneme, … • The existing databases do not contain pronunciation variation information for use of bootstrap training • A Chinese Annotated Spontaneous Speech (CASS) Corpus was established before WS00 on LSP in JHU • Completely spontaneous (discourses, lectures, ...) • Remarkable background noise, accent background, ... • Recorded onto tapes and then digitalized Center of Speech Technology, Tsinghua University

  5. Chinese Annotated Spontaneous Speech (CASS) Corpus • CASS w/ Five-Tier Transcription • Character level : base form • Syllable (or Pinyin) Level (w/ tone) : base form • Initial/Final (IF) Level : w/ time boundary for baseform • SAMPA-C Level : surface form • Miscellaneous Level : used for garbage modeling • Lengthening, breathing, laughing, coughing, disfluency, noise, silence, murmur (unclear), modal, smack, non-Chinese • Example Center of Speech Technology, Tsinghua University

  6. SAMPA-C: Machine Readable IPA • Phonologic Consonants - 23 • Phonologic Vowels - 9 • Initials - 21 • Finals - 38 • Retroflexed finals - 38 • Tones and Silences • Sound Changes • Spontaneous Phenomenon Labels Center of Speech Technology, Tsinghua University

  7. Key Points in PM (1) • Choosing and generating speech recognition unit (SRU) set • So as to well describe the phone changes and sound changes • Could be syllable, semi-syllable, or INITIAL/FINAL. • Constructing a multi-pronunciation lexicon (MPL) • A syllable-to-SRU lexicon to reflect the relation between the grammatical units and acoustic models • Acoustically modeling spontaneous speech • Theoretical framework • CD modeling; confusion matrix; data-driven Center of Speech Technology, Tsinghua University

  8. Key Points in PM (2) • Customizing decoding algorithm according to new lexicon • Improved time-synchronous search algorithm to reduce the path expansion (caused by CD modeling) • A* based algorithm based tree-trellis search algorithm to score multiple pronunciation variations simultaneously in the path • Modifying statistical language model Center of Speech Technology, Tsinghua University

  9. Establishment of Multi-Pron. Lexicon • Two major approaches • Defined by linguists and phonetists • Data-driven: confusion matrix, rewritten rules, decision tree ... • Our method: • Find all possible pronunciations in SAMPA-C from database • Reduce the size according to occurring frequencies Center of Speech Technology, Tsinghua University

  10. Collect all of them and choose the most frequent ones as GIFs. Define them according to GIF set. P ([GIFi] GIFf | Syllable ) Surface form for IF and Syllable • Learning pronunciations • Definition of Generalized Initial-Finals (GIFs) • z ts : canonical • z ts_v : voiced • z ts` : changed to ‘zh’ • z ts`_v : changed to voiced ‘zh’ • e 7 : canonical • e 7` : retroflexed or changed to ‘er’ • e @ : changed • Definition of Generalized Syllables (GSs) – the lexicon • chang [0.7850] ts`_h AN • chang [0.1215] ts`_h_v AN • chang [0.0280] ts`_v AN • chang [0.0187] <deletion> AN • chang [0.0187] z` AN • chang [0.0093] <deletion> iAN • chang [0.0093] ts_h AN • chang [0.0093] ts`_h UN Probabilistic lexicon. Center of Speech Technology, Tsinghua University

  11. AM LM Refined AM Output Prob. Probabilistic Pronunciation Modeling • Theory • Recognizer goal • K*=argmaxKP(K|A) = argmaxKP(A|K) P(K) • Applying independent assumption • P(A|K) = nP(an|kn) • Pronunciation modeling part – via introducing surface s • P(a|k) = sP(a|k,s)P(s|k) • Symbols • a: Acoustic signal, k: IF, s: GIF • A, K, S: corresponding string Center of Speech Technology, Tsinghua University

  12. Refined Acoustic Modeling (RAM) • P(a|k, s) -- RAM • It cannot be trained directly, the solutions could be: • Use P(a|k) instead -- IF modeling  • Use P(a|s) instead -- GIF modeling  • Adapt P(a|k) to P(a|k, s) -- B-GIF modeling  • Adapt P(a|s) to P(a|k, s) -- S-GIF modeling  • IF-GIF transcription should be generated from the IF and GIF transcriptions • Need more data, but the data amount is fixed • Using adaptation Center of Speech Technology, Tsinghua University

  13. Adapt P(a|k) to P(a|k, s) -- B-GIF scheme Adapt P(a|s) to P(a|k, s) -- S-GIF scheme IF1 IF GIF1 IF2 IF3 GIF2 GIF3 GIF Generate RAM via adaptation technology Center of Speech Technology, Tsinghua University

  14. AM LM Refined AM Output Prob. Probabilistic Pronunciation Modeling • Theory • Recognizer goal • K*=argmaxKP(K|A) = argmaxKP(A|K) P(K) • Applying independent assumption • P(A|K) = nP(an|kn) • Pronunciation modeling part (2/2) • P(a|k) = sP(a|k,s)P(s|k) • Symbols • a: Acoustic signal, k: IF, s: GIF • A, K, S: corresponding string Center of Speech Technology, Tsinghua University

  15. Surface-form Output Probability Modeling (SOPM) • P(s|k) - SOPM • Solution: Direct Output Prob. (DOP) learned from CASS • Problem: data sparseness • Idea: syllable level data sparseness DOESN’T mean IF/GIF level data sparseness • New solution – Context-Dependent Weighting (CDW): • P(GIF|IF) = IFLP(GIF| (IFL, IF)) P (IFL| IF) • P(GIF| (IFL, IF)): GIF output prob. given context • P (IFL| IF): IF transition prob. • Above two items can be learned from CASS Center of Speech Technology, Tsinghua University

  16. Generate SOPM via CDW • P(S-Syl | B-Syl): B-Syl=(i, f), S-Syl = (gi, gf) • CDW: • P(S-Syl | B-Syl ) = P(gi | i) P(gf | f) • P(GIF|IF) = IFLP(GIF| (IFL, IF)) P (IFL| IF) • Q(GIF|IF) = maxIFLP(GIF| (IFL, IF)) P (IFL| IF) • ML (GIF|IF) = P(GIF| (L, IF)) P (L| IF) • Different estimation of P(S-Syl | B-Syl) • P(gi | i) · P(gf | f) • P(gi | i) · Q(gf | f) • P(gi | i) · Mi(gf | f) Center of Speech Technology, Tsinghua University

  17. Can CDW be better (1) ? • Pronunciation Lexicon’s Intrinsic Confusion • Introduction of MPL useful for pronunciation variation modeling, but • Enlarges the among syllable confusion • The recognition target: IF string • What we actually get: GIF string • Even the GIF recognizer achieves 100%, we cannot get the 100% IF string because of MPL Center of Speech Technology, Tsinghua University

  18. Can CDW be better (2) ? • Pronunciation Lexicon’s Intrinsic Confusion • To reflect syllable level intrinsic confusion extent • Is the lower bound of syllable error rate • CDW can reduce PLIC Center of Speech Technology, Tsinghua University

  19. Can CDW be better (3) ? Center of Speech Technology, Tsinghua University

  20. Experiment condition • CASS Corpus was used for the experiment • Training Set: 3 hours’ data • Testing Set: 15 minutes’ data • Feature • MFCC +  +  + E (with CMN) • HTK • Accuracy calculated based on syllable • %Acc = Hit / Num * 100% • %Cor = (Hit – Ins) / Num * 100% Center of Speech Technology, Tsinghua University

  21. Experimental results Center of Speech Technology, Tsinghua University

  22. Question ? Does it work when more data without phonetic transcription is available ? Center of Speech Technology, Tsinghua University

  23. Using more data w/o IF transcription • A question: is the above method useful when only a small amount of data with IF transcription is available? • The answer depends on how we use the data w/o IF transcription. • Two parts of data: • Seed database: that w/ phonetic transcription • Extra database: that w/o phonetic transcription Center of Speech Technology, Tsinghua University

  24. What’s the purpose of these two databases? • Seed Database • To define the SRU set (surface form) • To train initial acoustic models • To train initial CDW weights • Extra Database • To refine the existing acoustic models • To refine the CDW weights Center of Speech Technology, Tsinghua University

  25. How to use extra database? • The problem is that extra database contains only higher level transcriptions (say syllable instead of IF) • An algorithm is needed to generate the phonetic level (IF level) transcription • Our solution is the iterative forced-alignment based transcription (IFABT) algorithm Center of Speech Technology, Tsinghua University

  26. Steps for IFABT (1) • Use the forced-alignment technique and the MPL to decode both the seed database and the extra database • To generate IF-GIF transcription under the constraints of previous canonical syllable level transcription • Use these two databases with IF-GIF transcription • To redefine MPL • To retrain CDW weights • To retrain IF-GIF models • The above two steps will be repeated until satisfying. Center of Speech Technology, Tsinghua University

  27. Steps for IFABT (2) Center of Speech Technology, Tsinghua University

  28. Experiments done on CASS-II (1) • Database • Enlarge the database: from 3 hrs  6 hrs, to • cover more spontaneous phenomena, and • provide more training data • The additional 3 hrs data are transcribed only in the canonical syllable level Center of Speech Technology, Tsinghua University

  29. Experiments done on CASS-II (2) CASS-I # CASS-I/-II Center of Speech Technology, Tsinghua University

  30. Summary • An annotated spontaneous speech corpus is important • At the syllable level, the use of GIFs as acoustic models always achieves better results than IFs. • Either the context dependent modeling or the Gaussian density sharing is a good method for pronunciation variation modeling. • The context-dependent weighting is more useful than the Gaussian density sharing for pronunciation modeling, because it can reduce MPL's PLIC value. • The IFABT method is helpful when more data with higher level transcription yet without the phonetic transcription is available. Center of Speech Technology, Tsinghua University

  31. References • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Mandarin Pronunciation Modeling Based on CASS Corpus,” Sino-French Symposium on Speech and Language Processing, pp. 47-53, Oct. 16, 2000, Beijing • Pascale Fung, William Byrne, ZHENG Fang Thomas, Terri Kamm, LIU Yi, SONG Zhanjiang, Veera Venkataramani, and Umar Ruhi, “Pronunciation modeling of Mandarin casual speech,”Workshop 2000 on Speech and Language Processing: Final Report for MPM Group, http://www.clsp.jhu.edu/index.shtml • Zhanjiang Song, “Research on pronunciation modeling for spontaneous Chinese speech recognition,” Ph.D. Dissertation: Tsinghua University, Beijing, China, Apr. 2001 • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Modeling Pronunciation Variation Using Context-Dependent Weighting and B/S Refined Acoustic Modeling,” EuroSpeech, 1:57-60, Sept. 3-7, 2001, Aalborg, Denmark • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “MANDARIN PRONUNCIATION MODELING BASED ON CASS CORPUS,” to appear in J. Computer Science & Technology Center of Speech Technology, Tsinghua University

  32. Announcement (1) • ISCA Tutorial and Research Workshop onPronunciation Modeling and Lexicon Adaptation for Spoken Language TechnologySeptember 14-15, 2002, Colorado, USA • http://www.clsp.jhu.edu/pmla2002/ Center of Speech Technology, Tsinghua University

  33. Announcement (2) • International Joint Conference of SNLP-O-COCOSDA May 9-11, 2002, Prachuapkirikhan, Thailand • http://kind.siit.tu.ac.th/snlp-o-cocosda2002/ orhttp://www.links.nectec.or.th/itech/snlp-o-cocosda2002/ Center of Speech Technology, Tsinghua University

  34. Thanks for listening Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/

More Related