Arianna Bisazza & Roberto Gretter – FBK, Italy

TURKIC LANGUAGES WORKSHOP May 2012, İstanbulBUILDING A TURKISH ASR SYSTEM WITH MINIMAL RESOURCES Arianna Bisazza & Roberto Gretter – FBK, Italy

Typical ASR recipe to build a news transcription system N-gram Language Model Pronunciation Lexicon Pre-processing: tokeniz., case, digits etc. Manual transcriptions Speech recordings Written text collection

…for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Cheap for TR Pre-processing: tokeniz., case, digits etc. Manual transcriptions TV recordings Expensive! Web-crawled Speech recordings Written text collection

…for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Cheap for TR Pre-processing: tokeniz., case, digits etc. Automatic transcriptions Expensive! TV recordings + morphology Existing ASR system Web-crawled Speech recordings Written text collection

…for Turkish, with minimal resources N-gram Language Model Pronunciation Lexicon Cheap for TR Pre-processing: tokeniz., case, digits etc. Automatic transcriptions Data-driven segmentation & suffix lexicaliz. TV recordings + morphology Existing ASR system Web-crawled Speech recordings Written text collection

Outline Data Collection Unsupervised Acoustic Modeling Language Modeling for Turkish Word Segmentation Data-driven Morphophonemics

Data Collection • International satellite TV channel broadcasting: • one video stream, • parallel audio streams in many languages including Turkish, English, Italian etc. • Written news daily collected from the same channel and other newspaper websites. AMTrain: 108h untranscribed audio TurTest: 12’ transcribed audio LMTrain: 130M words TurDev: 3.2M words

Unsupervised Acoustic Modeling

Unsupervised Acoustic Modeling • Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y • Method: • transcribe Turkish audio with Italian AMs

Unsupervised Acoustic Modeling • Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y • Method: • transcribe Turkish audio with Italian AMs • use these transcriptions to retrain AM

Unsupervised Acoustic Modeling • Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y • Method: • transcribe Turkish audio with Italian AMs • use these transcriptions to retrain AM • repeat until convergence

Unsupervised Acoustic Modeling • Scenario: we have a good ASR system for lang. X, audio & text data but no transcription for lang. Y • Method: • transcribe Turkish audio with Italian AMs • use these transcriptions to retrain AM • repeat until convergence • No manual transcription required • Language independent in principle, works better if phonemes are similar

Turkish Language Modeling

Relevant features of Turkish • Agglutination fast vocabulary growth • Rich suffix allomorphy due to vowel harmony & other phonological phenomena: • We address both with data-driven methods

1) Word segmentation • Unsupervised morphological segmentation: Morfessor tool (based on Minimum Description Length) • Parameter PPthreshold controls level of segmentation • Stem+endingrepresentation to avoid too small units: trade-off coverage/recogn.accuracy (cf. Erdoğan&al.05, Arısoy&al.09) 1 word  max 2 segments

2) Data-driven suffix normalization • Goal: factorize together word ending allomorphs • Procedure: • define letter equivalence classes A={a,e} H={ı,i,ü,u} D={d,t} K={k,ğ} C={c,ç} • normalize = map letters to their class: kural+lar  kural+lAr santral+ler  santral+lAr • train LM, build ASR system, transcribe • recover surface forms with simple statistical models:

2) Data-driven suffix normalization • Intrinsinc evaluation on clean data: • 27% tokens in TurDev are ambiguous lexicalized endings • 99.7% of them are assigned the correct surface form • some "errors" are due to misspellings in web-crawled data *yatirimci+lArIn  *yatirimci+lerin • Impact on language model PP:

Results

ASR results • Scores:WA (word accuracy)|HWA (half-word accuracy) • Performances of recent works on related tasks (WA): [Erdogan&al.05] 53%, [Kurimo&al.06] 67%, [Arisoy&al. 09] 76%

Conclusions

Conclusion • We built a Turkish ASR system with almost no language-specific resources, achieving reasonably good results: • unsupervised AM, bootstrapped from unrelated language • unsupervised segmentation with off-the-shelf tool • Level of segmentation (PPth) affects ASR accuracy, it should be tuned for specific task • We proposed a highly accurate data-driven method for suffix normalization + surface prediction • No gain in ASR quality so far, more analysis needed • Replicate experiments on a larger test

İlginiz için teşekkürler Conclusion • Similar methods may be applied to ASR (and MT) of under-resourced agglutinative languages

Bakan Çağlayan Çin Ticaret Bakanı Deming ile görüştü Bakan Çağlayan Çin Ticaret Bakan+ +H Deming ile görüştü Devlet Bakanı Zafer Çağlayan her iki ülkenin kendi parasıyla karşılıklı Devlet Bakan+ +H Zafer Çağlayan her iki ülke+ +nHn kendi parası+ +ylA karşılık+ +lH ticaret yapma konusundaki çalışmaların sürdüğünü bildirdi ticaret yapma konusu+ +nDAKH çalış+ +mAlArHn sürdüğünü bildird+ +H Çağlayan Çin Ticaret Bakanı Chen Deming ve beraberinde özel ve kamu Çağlayan Çin Ticaret Bakan+ +H Chen Deming ve beraber+ +HnDA özel ve kamu sektörü temsilcilerinden oluşan heyetle görüştü sektörü temsilci+ +lArHnDAn oluşan heyet+ +lA görüştü Ümit ediyorum ki yarım sabah itibariyle Sayın Bakan ile beraber _bir_ milyar Ümit ediyor+ +Hm ki yarım sabah itibar+ +HylA Sayın Bakan ile beraber _bir_ milyar doların üzerinde bir Çin 'den Türkiye 'den alım gerçekleştirilmiş olacak dedi dolar+ +Hn üzerin+ +DA bir Çin 'den Türk+ +HyA 'den alım gerç+ +AKlAşDHrHlmHş olacak dedi

Arianna Bisazza & Roberto Gretter – FBK, Italy