1 / 18

Broadcast News Transcription System for Khmer Language

Broadcast News Transcription System for Khmer Language. S. Seng 1,2,3 , S. Sam 1,2,3 , B. Bigi 1 , L. Besacier 1 , E. Castelli 2 1 LIG, Grenoble, France 2 MICA, Hanoi, Vietnam 3 ITC, Phnom Penh, Cambodia Sopheap.Seng@imag.fr. Outline. ASR for Khmer: the challenges

zohar
Download Presentation

Broadcast News Transcription System for Khmer Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Broadcast News Transcription System for Khmer Language S. Seng1,2,3, S. Sam1,2,3, B. Bigi1, L. Besacier1, E. Castelli2 1LIG, Grenoble, France 2 MICA, Hanoi, Vietnam 3 ITC, Phnom Penh, Cambodia Sopheap.Seng@imag.fr

  2. Outline • ASR for Khmer: the challenges • Language data acquisition • Word/Sub-word Language Modeling • Acoustic Modeling • Experiments & Results • Conclusion & future work S. SENG, Lrec'08 Marrakech

  3. Khmer Language • Official language of Cambodia • Spoken by more than 15 M people • An atonal language • Writing system • 33 Consonants, 23 dependent vowels • 14 independent vowels, 13 diacritics and various signs • No explicit word boundary S. SENG, Lrec'08 Marrakech

  4. ASR for Khmer: the challenges • An under-resourced language • Lack of text and speech data in digital form • Lacking explicit Word Segmentation • Automatic Segmentation is needed to make language modeling feasible • State-of-the-art method of segmentation • Uses hand-crafted lexicons, statistic, optimization criteria … • Error-prone • Others under-resourced, unsegmented languages in the region: Burmese, Laos, Vietnamese … S. SENG, Lrec'08 Marrakech

  5. Language data acquisition: text data • Retrieving text from the Web • Well selected rich-content websites Vs crawling the Web • Adapting ClipsTextTk, an open source tool for corpus creation for Khmer language • Conversion from legacy character encoding to Unicode • Automatic Segmentation • Conversion of special sign and number to text • Normalization of word spelling • Text Corpus obtained from 5 well selected sites: • 2.5000 html documents retrieved • After processing : 0.5 M sentences, 15 M words • Duration : November 2007 – January 2008 S. SENG, Lrec'08 Marrakech

  6. Language data acquisition: text data • An example of segmentation of Khmer text • Word Segmentation • Based on 18k Lexicons from the official Chhoun Nat dictionary • Optimization criteria : longest matching • Imperfect segmentation • Syllable Segmentation • Rule based (20 rules) • Imperfect segmentation • Character Cluster (CC) Segmentation • CC is a group of characters which has a well defined structure • CC Segmentation is a trivial task S. SENG, Lrec'08 Marrakech

  7. Language data acquisition: speech signal • Speech data collection • Downloadable Khmer Radio programs • Sites: Voice of America, Free Asia, Radio Australia … • Quality: Narrowband, poor quality • Recording of local Radio Broadcast News, Phnom Penh • Manual transcription campaign • Volunteers students contribute to do the transcription • 6h30mn of transcript signal obtained (reading news in the studio) • 3000 sentences, 45k words, 8 speakers (3 women) S. SENG, Lrec'08 Marrakech

  8. Statistical Language Modeling • Problematic • Very limited quantity of text data • Error Word Segmentation, High OOV rate • Is word still an optimal modeling unit? • In the literature, alternative modeling units were used : • Morpheme for Morphologically rich language [MorphoChallenge 05] • Logographic character for Japanese [Den 06] • Logographic character + word for Chinese [Chen 00] • Syllable for Vietnamese [Le 06] • Sub-word units in Khmer : Syllable, Character Cluster S. SENG, Lrec'08 Marrakech

  9. Statistical Language Modeling • Using Word & Sub-Word in language modeling • Exploit different views from the same data • Deal with OOV problem • Compensate the error introduced by automatic segmentation • Create Hybrid LM by combining Word/Sub-word S. SENG, Lrec'08 Marrakech

  10. Acoustic modeling • Grapheme based Pronunciation • A grapheme is directly a modeling unit • Grapheme-to-Phoneme Rule based Pronunciation • General Khmer syllable structure : C[C]V[CF] • 20 conversion rules formed based on rule in [Huffman 75] • Not all word could be phonetized by the rules (especially words from Sanskrit and Pali) S. SENG, Lrec'08 Marrakech

  11. Acoustic modeling • Khmer Phonemes inventory Source [Huffman 75] S. SENG, Lrec'08 Marrakech

  12. Experiments: ASR system • Decoder • Sphinx V3.6 • Test Corpus • 172 utterance • Acoustic model HMM training • SphinxTrain • Grapheme based : 77 modeling units • Phoneme based : 33 modeling units (single phone) • Model Context-Independent and Context-Dependent • 3-grams LM • Word/Sub-Word LM : Word, Syllable, CC • Hybrid LM : CC + N most frequent word (N vary from 0 to 20k) • Vocabulary : 20k most frequent word, 8800 syllables, 3500 CC • Evaluation metric • WER (Word Error Rate) • SER (Syllable Error Rate) • CCER (Character Cluster Error Rate) S. SENG, Lrec'08 Marrakech

  13. Experiments: baseline results • Grapheme Vs Phoneme • Performance of Grapheme-based and Phoneme-based models is comparable • the potential of Grapheme-Based approach S. SENG, Lrec'08 Marrakech

  14. Experiments: Word/Sub-Word LM • Comparison of Word and Sub-word LMs • A Khmer word is in average composed of 3.2 syllables and 4.3 CC S. SENG, Lrec'08 Marrakech

  15. Experiments: Hybrid LMs • Hybrid LMs: • progressively add N most frequent word to CC vocabulary to create Hybrid vocabularies • The small size V5K vocabulary give a comparable performance to Word based LM S. SENG, Lrec'08 Marrakech

  16. Conclusion & Future work • ASR for Khmer, an under-resourced, unsegmented language • Tools for language data acquisition and processing • Word/Sub-word unit for language modeling: Hybrid LMs • Grapheme-based Vs Grapheme-to-Phoneme Rule based acoustic modeling • Future Work • Discounting illegal Sub-word sequences • A tree structure for Word/Sub-word on LM level • Systems Combination scheme based on lattice combination S. SENG, Lrec'08 Marrakech

  17. Question or Suggestion ? S. SENG, Lrec'08 Marrakech

  18. Reference [MorphoChallenge 05] M. Kurimo and all. Unsupervised segmentation of words into morphemes - Morpho Challenge 2005: Application to Automatic Speech Recognition. In Proc. Interspeech, pages 1021-1024, Pittsburgh, PA, 2006 [Le 06] Viet-Bac Le, «Reconnaissance automatique de la parole pour les langues peu dotées », Thèse de doctorat de l’Université J. Fourier - Grenoble I, France, 2006 [Den 06] E. Denoual and Y. Lepage. The character as an appropriate unit of processing for non-segmenting languages. NLP Annual Meeting, pages 731-734, Tokyo, Japan, 2006. [Huffman 70] Huffman, Franklin, «Cambodian system of writing and begining reader ». Yales University Press, 1970 S. SENG, Lrec'08 Marrakech

More Related