1 / 24

Arianna Bisazza & Roberto Gretter – FBK, Italy

TURKIC LANGUAGES WORKSHOP May 2012, İstanbul BUILDING A TURKISH ASR SYSTEM WITH MINIMAL RESOURCES. Arianna Bisazza & Roberto Gretter – FBK, Italy. Typical ASR recipe to build a news transcription system. N-gram Language Model. Pronunciation Lexicon. Pre-processing:

ellema
Download Presentation

Arianna Bisazza & Roberto Gretter – FBK, Italy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TURKIC LANGUAGES WORKSHOP May 2012, İstanbulBUILDING A TURKISH ASR SYSTEM WITH MINIMAL RESOURCES Arianna Bisazza & Roberto Gretter – FBK, Italy

  2. Typical ASR recipe to build a news transcription system N-gram Language Model Pronunciation Lexicon Pre-processing: tokeniz., case, digits etc. Manual transcriptions Speech recordings Written text collection

  3. …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Cheap for TR Pre-processing: tokeniz., case, digits etc. Manual transcriptions TV recordings Expensive! Web-crawled Speech recordings Written text collection

  4. …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Cheap for TR Pre-processing: tokeniz., case, digits etc. Automatic transcriptions Expensive! TV recordings + morphology Existing ASR system Web-crawled Speech recordings Written text collection

  5. …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Cheap for TR Pre-processing: tokeniz., case, digits etc. Automatic transcriptions Expensive! TV recordings + morphology Existing ASR system Web-crawled Speech recordings Written text collection

  6. …for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Cheap for TR Pre-processing: tokeniz., case, digits etc. Automatic transcriptions Data-driven segmentation & suffix lexicaliz. TV recordings + morphology Existing ASR system Web-crawled Speech recordings Written text collection

  7. Outline Data Collection Unsupervised Acoustic Modeling Language Modeling for Turkish Word Segmentation Data-driven Morphophonemics

  8. Data Collection • International satellite TV channel broadcasting: • one video stream, • parallel audio streams in many languages including Turkish, English, Italian etc. • Written news daily collected from the same channel and other newspaper websites. AMTrain: 108h untranscribed audio TurTest: 12’ transcribed audio LMTrain: 130M words TurDev: 3.2M words

  9. Unsupervised Acoustic Modeling

  10. Unsupervised Acoustic Modeling • Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y • Method: • transcribe Turkish audio with Italian AMs

  11. Unsupervised Acoustic Modeling • Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y • Method: • transcribe Turkish audio with Italian AMs • use these transcriptions to retrain AM

  12. Unsupervised Acoustic Modeling • Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y • Method: • transcribe Turkish audio with Italian AMs • use these transcriptions to retrain AM • repeat until convergence

  13. Unsupervised Acoustic Modeling • Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y • Method: • transcribe Turkish audio with Italian AMs • use these transcriptions to retrain AM • repeat until convergence • No manual transcription required • Language independent in principle, works better if phonemes are similar

  14. Turkish Language Modeling

  15. Relevant features of Turkish • Agglutination fast vocabulary growth • Rich suffix allomorphy due to vowel harmony & other phonological phenomena: • We address both with data-driven methods

  16. 1) Word segmentation • Unsupervised morphological segmentation: Morfessor tool (based on Minimum Description Length) • Parameter PPthreshold controls level of segmentation • Stem+endingrepresentation to avoid too small units: trade-off coverage/recogn.accuracy (cf. Erdoğan&al.05, Arısoy&al.09) 1 word  max 2 segments

  17. 2) Data-driven suffix normalization • Goal: factorize together word ending allomorphs • Procedure: • define letter equivalence classes A={a,e} H={ı,i,ü,u} D={d,t} K={k,ğ} C={c,ç} • normalize = map letters to their class: kural+lar  kural+lAr santral+ler  santral+lAr • train LM, build ASR system, transcribe • recover surface forms with simple statistical models:

  18. 2) Data-driven suffix normalization • Intrinsinc evaluation on clean data: • 27% tokens in TurDev are ambiguous lexicalized endings • 99.7% of them are assigned the correct surface form • some "errors" are due to misspellings in web-crawled data *yatirimci+lArIn  *yatirimci+lerin • Impact on language model PP:

  19. Results

  20. ASR results • Scores:WA (word accuracy)|HWA (half-word accuracy) • Performances of recent works on related tasks (WA): [Erdogan&al.05] 53%, [Kurimo&al.06] 67%, [Arisoy&al. 09] 76%

  21. Conclusions

  22. Conclusion • We built a Turkish ASR system with almost no language-specific resources, achieving reasonably good results: • unsupervised AM, bootstrapped from unrelated language • unsupervised segmentation with off-the-shelf tool • Level of segmentation (PPth) affects ASR accuracy, it should be tuned for specific task • We proposed a highly accurate data-driven method for suffix normalization + surface prediction • No gain in ASR quality so far, more analysis needed • Replicate experiments on a larger test

  23. İlginiz için teşekkürler Conclusion • Similar methods may be applied to ASR (and MT) of under-resourced agglutinative languages

  24. Bakan Çağlayan Çin Ticaret Bakanı Deming ile görüştü Bakan Çağlayan Çin Ticaret Bakan+ +H Deming ile görüştü Devlet Bakanı Zafer Çağlayan her iki ülkenin kendi parasıyla karşılıklı Devlet Bakan+ +H Zafer Çağlayan her iki ülke+ +nHn kendi parası+ +ylA karşılık+ +lH ticaret yapma konusundaki çalışmaların sürdüğünü bildirdi ticaret yapma konusu+ +nDAKH çalış+ +mAlArHn sürdüğünü bildird+ +H Çağlayan Çin Ticaret Bakanı Chen Deming ve beraberinde özel ve kamu Çağlayan Çin Ticaret Bakan+ +H Chen Deming ve beraber+ +HnDA özel ve kamu sektörü temsilcilerinden oluşan heyetle görüştü sektörü temsilci+ +lArHnDAn oluşan heyet+ +lA görüştü Ümit ediyorum ki yarım sabah itibariyle Sayın Bakan ile beraber _bir_ milyar Ümit ediyor+ +Hm ki yarım sabah itibar+ +HylA Sayın Bakan ile beraber _bir_ milyar doların üzerinde bir Çin 'den Türkiye 'den alım gerçekleştirilmiş olacak dedi dolar+ +Hn üzerin+ +DA bir Çin 'den Türk+ +HyA 'den alım gerç+ +AKlAşDHrHlmHş olacak dedi

More Related