1 / 42

Language Independent & Adaptive Speech Recognition

Language Independent & Adaptive Speech Recognition. Tanja Schultz Carnegie Mellon University Speech Seminar at Johns Hopkins University November, 27 2001. Motivation. Computerization : Speech processing is a key technology Internet and mobile systems Globalization : Multilinguality

lei
Download Presentation

Language Independent & Adaptive Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Independent & Adaptive Speech Recognition Tanja Schultz Carnegie Mellon University Speech Seminar at Johns Hopkins University November, 27 2001

  2. Motivation • Computerization: Speech processing is a key technology • Internet and mobile systems • Globalization: Multilinguality • Myth: “Everyone speaks English, why bother?” • BUT: About 4500 different languages in the world • Increasing number of languages in Internet (indicator) • Computer remote control: mother tongue  Speech recognition in many languagesHuman-to-Human communication, Human-Computer interfaces

  3. Motivation • Speech Recognition in many Languages • Myth: It’s just retraining on foreign data - no science • BUT: Other languages bring unseen challenges, i.e.scripts, vocabulary and morphology, tonality, less than ideal speech data resources, very few if any text data, ... • Rapid Adaptation to new Languages • Many Languages, few Data • Combine Data and Knowledge of many Languages • Apply both to unseen Languages

  4. Speech Recognition in Multiple Languages Goal: Speech recognition in a many different languages Problem: Data collection is time consuming and expensive Sound system Speech data ( 10 hours) Pronunciation rules Text data ela /e/l/a/ eu /e/u/ sou /s/u/ eu sou você é ela é AM Lex LM

  5. Speech Recognition in Multiple Languages ela /e/l/a/ eu /e/u/ sou /s/u/ eu sou você é ela é AM Lex LM Example: Speech Recognition for Brasilien Portuguese Sound system Speech data Pronunciation rules Text data

  6. Speech Recognition in Multiple Languages ela /e/l/a/ eu /e/u/ sou /s/u/ eu sou você é ela é AM Lex LM Idea: Borrow existing data and run adaptation to target Sound system Speech data + ? Pronunciation rules Text data

  7. Speech Recognition in Multiple Languages    ela /e/l/a/ eu /e/u/ sou /s/u/ eu sou você é ela é AM Lex LM Sound system Speech data Pronunciation rules Text data

  8. Monolingual Speech Recognition    • Step 1: • Uniform multilingual speech database • Monolingual recognizers in many languages

  9. GlobalPhone Multilingual Database • Widespread languages • Native Speakers • Uniformity • Broad domain • Huge text resources • Internet Newspapers Total sum of resources • 15 languages so far •  300 hours speech data •  1400 native speakers German Japanese Korean Croatian Portuguese Arabic Ch-Mandarin Ch-Shanghai English French Russian Spanish Swedish Tamil Turkish

  10. Top-15 Languages of the World Webster‘s New Encyclopedic Dictionary, 1992

  11. Language Characteristics Writing system, grapheme-to-phoneme relation: Segmentation: short, long phrases, missing segmentation Exp. Osman-l-laç-tr-ama-yabil-ecek-ler-imiz-den-miş-siniz Prosody: Tonal languages like Mandarin  Sound system: simple vs very complex systems Phonotactics: simple syllable structure vs complex consonant clusters

  12. Monolingual Recognizers in 10 Languages

  13. Phonemes vs Error Rate • High correlation between number of phonemes and phoneme error rate • The more compact the phone set the less confusability among phones • Outlayer TU: • high confusion among [e], [i], [y]

  14. Multilingual Acoustic Modeling    • Step 2: • Combine acoustic models • Share data across languages

  15. Multilingual Acoustic Modeling • Sound production is human not language specific: • International Phonetic Alphabet (IPA) • Multilingual Acoustic Modeling 1)Universal sound inventory based on IPA 485 sounds are reduced to 162 IPA-sound classes 2)Each sound class is represented by one “phoneme” which is trained through data sharing across languages • m,n,s,loccur in all languages • p,b,t,d,k,g,fandi,u,e,a,ooccur in almost all languages • no sharing of triphthongs and palatal consonants

  16. IPA Scheme for Consonants

  17. IPA Scheme for Vowels

  18. Multilingual Acoustic Modeling • Sound production is human not language specific: • International Phonetic Alphabet (IPA) • Multilingual Acoustic Modeling 1)Universal sound inventory based on IPA 485 sounds are reduced to 162 IPA-sound classes 2)Each sound class is represented by one “phoneme” which is trained through data sharing across languages • m,n,s,loccur in all languages • p,b,t,d,k,g,fandi,u,e,a,ooccur in almost all languages • no sharing of triphthongs and palatal consonants

  19. Acoustic Model Combination ML-Mix ML-Sep ML-Tag

  20. Acoustic Model Combination

  21. Rapid Language Adaptation ela /e/l/a/ eu /e/u/ sou /s/u/ eu sou você é ela é AM Lex LM • Step 3: • Use ML acoustic models, borrow data • Adapt ML acoustic models to target language

  22. Rapid Language Adaptation Model mapping to the target language 1) Map the multilingual phonemes to Portuguese ones based on the IPA-scheme 2) Copy the corresponding acoustic models in order to initialize Portuguese models Problem: Contexts are language specific, how to apply context dependent models to a new target language Solution: Adaptation of multilingual contexts to the target language based on limited training data

  23. Polyphone Coverage

  24. Context Decision Trees lau k ra ut k le ot k or in k ar Blaukraut Brautkleid Brotkorb Weinkarte k • Context dependent phoneme ( n = polyphone) • Trainability vs Granularity • Divisive clustering based on linguistic motivated questions • One model is assigned to each context cluster • Multi-lingual case: should we ignore language information? • Depends on application • Yes in case of adaptation to new target languages -1=Plosiv? N J k (0) +2=Vokal? N J k (1) k (2) lau k ra in k ar ot k or ut k le Context Decision Tree

  25. Polyphone Decision Tree Specialization ď -1=Plosiv? N J ď (0) +2=Vokal? N J ď (1) ď (2) ď (1) lau ď ra in ď ar ot ď or ut ď le ot ď or -2=n? ď (4) ď (5) an ď ão na ď ão ď -1=Plosiv? N J +1=Nasal? +2=Vokal? N N J J ď (3) ď (2) ut ď le ni ď ao bo ď in to ď al nu ď la Polyphone Decision Tree Specialization (PDTS)

  26. Language Adaptation Experiments Ø Tree Po-Tree

  27. Language Adaptation Experiments Ø Tree ML-Tree Po-Tree +

  28. Language Adaptation Experiments Ø Tree ML-Tree Po-Tree +

  29. Language Adaptation Experiments Ø Tree ML-Tree Po-Tree PDTS +

  30. Summary • Multilingual database suitable for MLVCSR • Language dependent recognition in 10 languages • Language independent acoustic modeling • Global phoneme set that covers  10 languages • Data sharing thru multilingual models • Language adaptive speech recognition • Limited amount of language specific data • Adaptation thru PDTS  Create speech engines in new target languages using only limited data, save time and money

  31. Selected Publications Language Independent and Language Adaptive Acoustic Modeling T. Schultz and A. Waibel in: Speech Communication, Volume 35, Issues 1-2, pp 31-51, August 2001 Multilinguality in Speech and Spoken Language SystemsA. Waibel, P. Geutner, L. Mayfield-Tomokiyo, T. Schultz, and M. Woszczyna in: Proceedings of the IEEE, Special Issue on Spoken Language Processing, Volume 88(8), pp 1297-1313, August 2000 Polyphone Decision Tree Specialization for Language AdaptationT. Schultz and A. Waibel in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000), Istanbul, Turkey, June 2000. Download from http://www.cs.cmu.edu/~tanja

  32. Language Independent & Adaptive Speech Recognition Tanja Schultz Carnegie Mellon University Speech Seminar at Johns Hopkins University November, 27 2001

  33. Applications • Language Identification: • Contribution of language independent and language dependent acoustic models for LID • Speaker Identification: • Combine the votes of phoneme recognizers of many different languages

  34. Language Identification LM-switch LM-no sw No LM 77.2 (52.2-100) 84.8 (70.3-100) 78.0 (58.9-100) • Zissman and others proved already the usefulness of phoneme recognizers and phoneme LMs for LID • How do multilingual acoustic models contribute to LID? Observation 1: • Language dependent acoustic models (ML-sep) outperform language independent ones (ML-mix) Observation 2: • Phoneme LMs improve LID significantly if language switches are allowed • Results on 8 languages (~8sec chunks)

  35. Speaker Identification • Phone strings as features for speaker identification • Phone strings capture the pronunciation idiosyncrasy • Pronunciation is one elementary factor used by human beings for identifying a speaker • Decode phone strings using phoneme recognizers of many different languages • Robustness by applying multiple languages • Speaker identification independent from what language is spoken

  36. Train a Speakers Model PR- PhonemeRecognizer Language 1 Calculate PLM Spk i, L1Phoneme Language Modelfor Spk i, Language 1 Phoneme String . . . Training Speech from Spk i PR- PhonemeRecognizer Language N Calculate PLM Spk i, LNPhoneme Language Modelfor Spk i, Language N Phoneme String No Phoneme LM information

  37. Identify a Speaker PLM Spk 1, L1 Phoneme Phoneme PLM Spk 2, L1 String String ... PLM Spk k, L1 PLM Spk 1, LN PLM Spk 2, LN ... PLM Spk k, LN Perplexity PR L1 . . . Testing Speech from Spk X Spk =argmin PR LN

  38. Experimental Setup • Closed-set speaker identification • 30 native speakers read English text • 5 sessions per speaker, 4 sessions for training (~7min), 1 for testing (~1min) • 8 parallel recordings at 6 different microphone distances per speaker • 4000~5000 phones per speaker per distance for training the PLM • Phone Recognizers in 8 different languages: Chinese, Croatian, French, German, Japanese, Portuguese, Spanish, and Turkish

  39. #phones in test CH DE FR JA KR PO SP TU All ~ 1000 - 1500 100 80 70 30 40 76.7 70 53.3 96.7 800 96.7 80 66.7 20 40 66.7 66.7 53.3 96.7 500 100 76.7 56.7 30 33.3 66.7 56.7 50 96.7 300 93.3 63.3 53.3 36.7 33.3 63.3 50 46.7 96.7 100 56.7 50 46.7 36.7 30 33.3 30 30 96.7 50 40 33.3 16.7 26.7 26.7 20 20 16.7 93.3 30 26.7 26.7 13.3 16.7 36.7 10 16.7 20 80 Results on close-speaking Mic  Combination compensates for decreasing IDrate (length)

  40. Results on matching conditions

  41. matching D0-0 D1-1 D2-2 D6-6 GMM 100 90 93.3 83.3 ML phone string 96.7 90 96.7 83.3 unmatching D0 D1 D2 D6 GMM 50 65.8 66.6 53.3 ML phone string 96.7 96.7 83.3 Results on non-matching conditions • Recording conditions / micro distance may vary • GMM suffers severe degradation if training and testing conditions do not match • Train speaker and distance dependent PLM

  42. Conclusion Applications • Multilingual acoustic models can be sucessfully applied to Language Identification problem • Preserve language specific information! • Recognition engines in many different languages can be sucessfully applied to Speaker Identification problem • robust against non-matching conditions • independent from spoken language

More Related