180 likes | 351 Views
HYP update - part1. Morphological Analysis for Phrase-Based Statistical Machine Translation. Luong Minh Than g WING group meeting – 15 Aug, 2008. Agenda. Introduction - what does my project title mean? Language pair English-Finnish challenges Related works Project direction.
E N D
HYP update - part1 Morphological Analysis for Phrase-Based Statistical Machine Translation Luong Minh Thang WING group meeting – 15 Aug, 2008
Agenda • Introduction - what does my project title mean? • Language pair • English-Finnish challenges • Related works • Project direction
Introduction I: phrase-based SMT • Statistical: derive statistical information from large data • Phrase-base: capture local constraints Source Target
Introduction II - Morphology • Morpheme: minimal meaning-bearing unit • machines = machine + s • translation = translate + ion • goalkeeper = goal + keeper • English is a low-inflected language - simple morphological structure High-inflected languages are much complicated!
Introduction III – high-inflected languages • Concatenate chain of morphemes to form a word Finnish: oppositio + kansa + n + edusta + ja (opposition + people + of + represent + -ative) = opposition of parliarment member Turkish: uygarlas,tiramadiklarimizdanmis,sinizcasina (uygar+las, tir+ama+dik+lar+imiz+dan+mis, siniz+casina) = (behaving) as if you are among those whom we could not cause to become civilized This is a word!!!
Introduction IV – Why morphological-aware SMT? • Tackle the data sparseness problem (Statistics from 1.021.180 sentence pairs) • Capture the relations among words Spanish máquina máquinas English machine machines
Language pair I – our choice? • We chose English - Finnish as our main translation task Vietnamese Low-inflected highly-inflected (Dyer, 2007)
Language pair II – why Finnish? • Honestly, I don’t know Finnish … • But because: • Available corpora • Finnish is an agglutinative morphologically-complex language, suitable for our project scope • Investigate in translation from low to high inflected languages -> an area to explore, yet hard !!!
English-Finnish challenges I – many-to-one word relationship • Finnish uses suffixes to express grammatical relations and also to derive new words (about 14-15 cases for nouns) Not merely concatenating • Many-to-one English-Finnish word relationship need word-morpheme correspondence
English-Finnish challenges II – word order • Word order is “free” in Finnish • Pete rakastaa Annaa = Pete loves Annaa (normal) • Annaa Pete rakastaa: emphasizes Annaa • Rakastaa Pete Annaa: emphasizes rakastaa = Pete does love Anna • Pete Annaa rakastaa: stress on Pete • Rakastaa Annaa Pete. not sound like a normal sentence, quite understandable.
English-Finnish challenges III – surface form generation • After translating from English words Finnish morphemes, need a surface generation step oppositio + kansa + n + edusta + ja oppositiokansanedustaja • What if missing morphemes or changes in morpheme order? Need a more error-tolerate surface recovery algorithm
Related works I – low-to-high inflected languages • Many works from high to low inflected languages, but very few works on the opposite direction, considered hard in (Koehn, 2005) • (Yang & Kirchhoff, 2006): Finnish-English, backoff • (Oflazer & Durgar El-Kahlout, 2006, 2007): English-Turkish, word-morpheme translation, then simply concatenating morphemes • All use language-dependent tools & syntactic knowledge: TreeTager, Snowball stemmer …
Related works II – surface form recovery • (Toutanova et. al., 2007, 2008): English-Russian, English-Arabic; translate stem-to-stem; predict inflection from stems using many different features (lexical, morphological, and syntactic) • (Avramidis & Koehn, 2008): English-Greek Use syntax to get the “missing” morphology, depending on the syntactic position Noun cases agreement and verb person conjugation Rely mostly on manual annotation data
Project direction • Use language-independent tool (Morfessor), and based on the unannotated data only (i.e. no feature data or syntactical information) • Work on a general surface-form recovery • We would like to have a unified view of the transalation process: separating low-low, low-high, high-low, high-high We are at here
Reference I • Chirs Dyer, 2007 http://www.ling.umd.edu/~redpony/edinburgh.pdf • Jurafsky, D., & Martin, J. H. (2007). Speech and language processing book • The Finnish language http://www.cs.tut.fi/~jkorpela/Finnish.html • Yang & Kirchhoff, 2006: Phrase-based backoff models for machine translation of highly inflected languages • Oflazer & Durgar El-Kahlout, 2006: Initial Explorations in English to Turkish Statistical Machine Translation
Reference II • Oflazer & Durgar El-Kahlout, 2007: Exploring different representational units in English-to-Turkish statistical machine translation • Toutanova et. al., 2007: Generating complex morphology for machine translation • Toutanova et. al., 2008: Applying morphology generation models to machine translation • Avramidis & Koehn, 2008: Enriching morphologically poor languages for statistical machine translation
To be continued … • Thank you !!!