1 / 49

Machine Translation: Challenges and Approaches

Invited Lecture CS 4705: Introduction to Natural Language Processing Fall 2004. Machine Translation: Challenges and Approaches. Nizar Habash Post-doctoral Fellow Center for Computational Learning Systems Columbia University. Sounds like Faulkner?.  Faulkner  Machine Translation.

rockwell
Download Presentation

Machine Translation: Challenges and Approaches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Invited LectureCS 4705: Introduction to Natural Language Processing Fall 2004 Machine Translation:Challenges and Approaches Nizar HabashPost-doctoral Fellow Center for Computational Learning Systems Columbia University

  2. Sounds like Faulkner?  Faulkner  Machine Translation  Faulkner  Machine Translation http://www.ee.ucla.edu/~simkin/sounds_like_faulkner.html

  3. Progress in MTStatistical MT example Form a talk by Charles Wayne, DARPA

  4. Road Map • Why Machine Translation (MT)? • Multilingual Challenges for MT • MT Approaches • MT Evaluation

  5. Why (Machine) Translation? • Languages in the world • 6,800 living languages • 600 with written tradition • 95% of world population speaks 100 languages • Translation Market • $8 Billion Global Market • Doubling every five years • (Donald Barabé, invited talk, MT Summit 2003)

  6. Why Machine Translation? • Full Translation • Domain specific • Weather reports • Machine-aided Translation • Translation dictionaries • Translation memories • Requires post-editing • Cross-lingual NLP applications • Cross-language IR • Cross-language Summarization

  7. Road Map • Why Machine Translation (MT)? • Multilingual Challenges for MT • Orthographic variations • Lexical ambiguity • Morphological variations • Translation divergences • MT Paradigms • MT Evaluation

  8. Multilingual Challenges • Orthographic Variations • Ambiguous spelling • كتب الاولاد اشعاراكَتَبَ الأوْلادُ اشعَاراً • Ambiguous word boundaries • Lexical Ambiguity • Bank  بنك (financial) vs. ضفة(river) • Eat  essen (human) vs. fressen (animal)

  9. conj noun article plural Multilingual Challenges Morphological Variations • Affixation vs. Root+Pattern • Tokenization

  10. Multilingual ChallengesTranslation Divergences • How languages map semantics to syntax • 35% of sentences in TREC El Norte Corpus (Dorr et al 2002) • Divergence Types • Categorial (X tener hambre  X be hungry) [98%] • Conflational (X dar puñaladas a Z  X stab Z) [83%] • Structural (X entrar en Y  X enter Y) [35%] • Head Swapping (X cruzar Y nadando  X swim across Y) [8%] • Thematic (X gustar a Y  Y like X) [6%]

  11. Translation Divergences conflation ليس be etre ا نا هنا I not here Je ne pas ici لست هنا I-am-not here I am not here Je nesuispas ici I notbenot here

  12. Translation Divergencescategorial, thematic and structural * be tener * ا نا بردان I cold Yo frio קר ל אני انا بردان I cold I am cold tengo frio I-have cold קר לי cold for-me

  13. اسرع انا عبور سباحة swim نهر I across quickly river Translation Divergenceshead swap and categorial I swam across the river quickly اسرعت عبور النهر سباحة I-sped crossing the-river swimming

  14. חצה swim אני את ב ב I across quickly נהר שחיה מהירות river Translation Divergenceshead swap and categorial חציתי את הנהר בשחיה במהירות I-crossed obj river in-swim speedily I swam across the river quickly

  15. اسرع חצה انا عبور سباحة swim אני את ב ב نهر I across quickly נהר שחיה מהירות river Translation Divergences head swap and categorial verb verb noun noun verb noun noun prep adverb

  16. car possessed-by mom Translation DivergencesOrthography+Morphology+Syntax mom’s car 妈妈的车 mamade che سيارة ماما sayyAratmama la voituredemaman

  17. Road Map • Why Machine Translation (MT)? • Multilingual Challenges for MT • MT Approaches • Gisting / Transfer / Interlingua • Statistical / Symbolic / Hybrid • Practical Considerations • MT Evaluation

  18. Gisting MT ApproachesMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  19. MT ApproachesGisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.

  20. Gisting Transfer MT ApproachesMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  21. butter Y X MT ApproachesTransfer Example • Transfer Lexicon • Map SL structure to TL structure poner :subj :mod :obj  :subj :obj mantequilla en X :obj Y X puso mantequilla en Y X buttered Y

  22. Gisting Transfer Interlingua MT ApproachesMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  23. MT ApproachesInterlingua Example: Lexical Conceptual Structure (Dorr, 1993)

  24. Gisting Transfer Interlingua MT ApproachesMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  25. Dictionaries/Parallel Corpora Transfer Lexicons Interlingual Lexicons MT ApproachesMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  26. MT ApproachesStatistical vs. Symbolic Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  27. MT ApproachesNoisy Channel Model Portions from http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf

  28. MT ApproachesIBM Model (Word-based Model) http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf

  29. MT ApproachesStatistical vs. Symbolic vs. Hybrid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  30. MT ApproachesStatistical vs. Symbolic vs. Hybrid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  31. poner lay locate place put render set stand :subj :mod :obj :subj :mod mantequilla en Maria :obj on in into at butter bilberry Maria pan :obj bread loaf MT ApproachesHybrid Example: GHMT • Generation-Heavy Hybrid Machine Transaltion • Lexical transfer but NO structural transfer Maria puso la mantequilla en el pan.  :obj

  32. MT ApproachesHybrid Example: GHMT • LCS-driven Expansion • Conflation Example [CAUSE GO] [CAUSE GO] PUTV BUTTERV Agent Agent Goal Goal Theme MARIA MARIA BUTTERN BREAD BREAD CategorialVariation

  33. put lay render into butter on butter at butter Maria Maria Maria bread loaf loaf butter bread bread Maria butter Maria MT ApproachesHybrid Example: GHMT • Structural Overgeneration …

  34. buy John car a red MT ApproachesHybrid Example: GHMTTarget Statistical Resources • Structural N-gram Model • Long-distance • Lexemes • Surface N-gram Model • Local • Surface-forms John bought a red car

  35. MT ApproachesHybrid Example: GHMTLinearization &Ranking Maria buttered the bread -47.0841 Maria butters the bread -47.2994 Maria breaded the butter -48.7334 Maria breads the butter -48.835 Maria buttered the loaf -51.3784 Maria butters the loaf -51.5937 Maria put the butter on bread -54.128

  36. MT ApproachesPractical Considerations • Resources Availability • Parsers and Generators • Input/Output compatability • Translation Lexicons • Word-based vs. Transfer/Interlingua • Parallel Corpora • Domain of interest • Bigger is better • Time Availability • Statistical training, resource building

  37. MT ApproachesResource Poverty • No Parser? • No Translation Dictionary? • Parallel Corpus • Align with rich language • Extract dictionary • Parse rich side • Infer parses • Build a statistical parser

  38. Road Map • Why Machine Translation (MT)? • Multilingual Challenges for MT • MT Approaches • MT Evaluation

  39. MT Evaluation • More art than science • Wide range of Metrics/Techniques • interface, …, scalability, …, faithfulness, ... space/time complexity, … etc. • Automatic vs. Human-based • Dumb Machines vs. Slow Humans

  40. MT Evaluation Metrics (Church and Hovy 1993) • System-based MetricsCount internal resources: size of lexicon, number of grammar rules, etc. • easy to measure • not comparable across systems • not necessarily related to utility

  41. MT Evaluation Metrics • Text-based Metrics • Sentence-based Metrics • Quality: Accuracy, Fluency, Coherence, etc. • 3-point scale to 100-point scale • Comprehensibility Metrics • Comprehension, Informativeness, • x-point scales, questionnaires • most related to utility • hard to measure

  42. MT Evaluation Metrics • Text-based Metrics (cont’d) • Amount of Post-Editing • number of keystrokes per page • not necessarily related to utility • Cost-based Metrics • Cost per page • Time per page

  43. Human-based Evaluation ExampleAccuracy Criteria

  44. Human-based Evaluation ExampleFluency Criteria

  45. Fluency vs. Accuracy FAHQ MT conMT Prof. MT Fluency Info. MT Accuracy

  46. Automatic Evaluation ExampleBleu Metric • Bleu • BiLingual Evaluation Understudy (Papineni et al 2001) • Modified n-gram precision with length penalty • Quick, inexpensive and language independent • Correlates highly with human evaluation • Bias against synonyms and inflectional variations

  47. Automatic Evaluation ExampleBleu Metric Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily

  48. Automatic Evaluation ExampleBleu Metric Test Sentence colorless green ideassleepfuriously Gold Standard References all dull jade ideassleep irately drab emerald concepts sleepfuriously colorless immature thoughts nap angrily Unigram precision = 4/5

  49. Automatic Evaluation ExampleBleu Metric Test Sentence colorless green ideas sleep furiously colorless green ideas sleep furiously colorless greenideas sleepfuriously colorless green ideassleep furiously Gold Standard References all dull jade ideassleep irately drab emerald concepts sleepfuriously colorless immature thoughts nap angrily Unigram precision = 4 / 5 = 0.8 Bigram precision = 2 / 4 = 0.5 Bleu Score = (a1 a2 …an)1/n = (0.8╳ 0.5)½ = 0.6325  63.25

More Related