1 / 42

Machine Translation: Challenges and Approaches

Natural Language Processing. Machine Translation: Challenges and Approaches. Based on a talk by Nizar Habash http://www.nizarhabash.com/ Associate Research Scientist Center for Computational Learning Systems Columbia University, New York. Google offers translations between many languages:

zamora
Download Presentation

Machine Translation: Challenges and Approaches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Machine Translation:Challenges and Approaches Based on a talk by Nizar Habashhttp://www.nizarhabash.com/ Associate Research Scientist Center for Computational Learning Systems Columbia University, New York

  2. Google offers translations between many languages: Afrikaans, Albanian, Arabic, Belarusian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Filipino, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Vietnamese, Welsh, Yiddish

  3. WorldMapperThe world as you’ve never seen it before List of languages with maps showing world-wide distribution: http://www.worldmapper.org/extraindex/text_language.html Eg Arabic:

  4. “BBC found similar support”!!!

  5. Road Map • Multilingual Challenges for MT • MT Approaches • MT Evaluation

  6. Multilingual Challenges • Complex Orthography (writing system) • Ambiguous spelling, eg Arabic vowels omitted • كتب الاولاد اشعاراكَتَبَ الأوْلادُ اشعَاراً • Ambiguous word boundaries, eg Chinese • Lexical Ambiguity • Bank  بنك (financial) vs. ضفة(river) • Eat  essen (human) vs. fressen (animal)

  7. conj noun article plural Multilingual Challenges Morphological complexity and variation • Affixation vs. Root+Pattern • Tokenization: a “word” can be a whole phrase

  8. Translation Divergences conflation لست am suis هنا I not here Je ne pas ici لست هنا I-am-not here I am not here Je nesuispas ici I notamnot here

  9. Translation Divergences head swap and categorial

  10. Corpus resources for training Need a Corpus (eg web-as-corpus: BootCat)and DICTIONARY, other NLP resources. BUT: really need PARALLEL corpus, with SOURCE and TARGET setentence ALIGNED. Some languages have few resources, esp non-European languages: Bengali, Amharic, … WorldMapper : many International languages

  11. Road Map • Multilingual Challenges for MT • MT Approaches • MT Evaluation

  12. Gisting MT ApproachesMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  13. MT ApproachesGisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.

  14. Gisting Transfer MT ApproachesMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  15. butter Y X MT ApproachesTransfer Example • Transfer Lexicon • Map SL structure to TL structure poner :subj :mod :obj  :subj :obj mantequilla en X :obj Y X puso mantequilla en Y X buttered Y

  16. Gisting Transfer Interlingua MT ApproachesMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  17. MT ApproachesInterlingua Example: Lexical Conceptual Structure (Dorr, 1993)

  18. Gisting Transfer Interlingua MT ApproachesMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  19. Dictionaries/Parallel Corpora Transfer Lexicons Interlingual Lexicons MT ApproachesMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  20. MT ApproachesStatistical vs. Rule-based Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  21. Statistical MT Noisy Channel Model Portions from http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf

  22. Slide based on Kevin Knight’s http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Statistical MT Automatic Word Alignment • GIZA++ • A statistical machine translation toolkit used to train word alignments. • Uses Expectation-Maximization with various constraints to bootstrap alignments Maria no dio una bofetada a la bruja verde Mary did not slap the green witch

  23. Statistical MT IBM Model (Word-based Model) http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf

  24. Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Phrase-Based Statistical MT Morgen fliege ich nach Kanada zur Konferenz Tomorrow I will fly to the conference In Canada • Foreign input segmented in to phrases • “phrase” is any sequence of words • Each phrase is probabilistically translated into English • P(to the conference | zur Konferenz) • P(into the meeting | zur Konferenz) • Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an intro. This is state-of-the-art!

  25. Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

  26. Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the)

  27. Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch)

  28. Word Alignment Induced Phrases Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch)(Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) …

  29. Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch)(Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)

  30. Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Advantages of Phrase-Based SMT • Many-to-many mappings can handle non-compositional phrases • Local context is very useful for disambiguating • “Interest rate”  … • “Interest in”  … • The more data, the longer the learned phrases • Sometimes whole sentences

  31. MT ApproachesStatistical vs. Rule-based vs. Hybrid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  32. MT ApproachesPractical Considerations • Resources Availability • Parsers and Generators • Input/Output compatability • Translation Lexicons • Word-based vs. Transfer/Interlingua • Parallel Corpora • Domain of interest • Bigger is better • Time Availability • Statistical training, resource building

  33. Road Map • Multilingual Challenges for MT • MT Approaches • MT Evaluation

  34. MT Evaluation • More art than science • Wide range of Metrics/Techniques • interface, …, scalability, …, faithfulness, ... space/time complexity, … etc. • Automatic vs. Human-based • Dumb Machines vs. Slow Humans

  35. Human-based Evaluation ExampleAccuracy Criteria

  36. Human-based Evaluation ExampleFluency Criteria

  37. Automatic Evaluation ExampleBleu Metric(Papineni et al 2001) • Bleu • BiLingual Evaluation Understudy • Modified n-gram precision with length penalty • Quick, inexpensive and language independent • Correlates highly with human evaluation • Bias against synonyms and inflectional variations

  38. Automatic Evaluation ExampleBleu Metric Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily

  39. Automatic Evaluation ExampleBleu Metric Test Sentence colorless green ideassleepfuriously Gold Standard References all dull jade ideassleep irately drab emerald concepts sleepfuriously colorless immature thoughts nap angrily Unigram precision = 4/5

  40. Automatic Evaluation ExampleBleu Metric Test Sentence colorless green ideas sleep furiously colorless green ideas sleep furiously colorless greenideas sleepfuriously colorless green ideassleep furiously Gold Standard References all dull jade ideassleep irately drab emerald concepts sleepfuriously colorless immature thoughts nap angrily Unigram precision = 4 / 5 = 0.8 Bigram precision = 2 / 4 = 0.5 Bleu Score = (a1 a2 …an)1/n = (0.8╳ 0.5)½ = 0.6325  63.25

  41. Summary • Multilingual Challenges for MT: • Different writing systems, morphology, segmentation, word order; • Need parallel corpus, dictionaries, NLP tools • MT Approaches: statistical v rule-based • Gisting: word-for-word; • Transfer: source-to-target phrase mapping; • Interlingua: map to/from “semantic representation” • MT Evaluation: Accuracy, Fluency • IBM BLEU method: count overlaps with Reference (human) translations

  42. Interested in MT?? • Contact habash@cs.columbia.edu • Research courses, projects • Languages of interest: • English, Arabic, Hebrew, Chinese, Urdu, Spanish, Russian, …. • Topics • Statistical, Hybrid MT • Phrase-based MT with linguistic extensions • Component improvements or full-system improvements • MT Evaluation • Multilingual computing

More Related