1 / 28

Highly inflected languages (Arabic, Basque, Turkish, Russian, Hungarian, etc. )

A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010, Massachusetts, USA. lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas.

darrin
Download Presentation

Highly inflected languages (Arabic, Basque, Turkish, Russian, Hungarian, etc. )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Hybrid Morpheme-Word Representationfor Machine Translation of Morphologically Rich LanguagesMinh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010, Massachusetts, USA

  2. lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilaslentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas This is a Finnish word!!! lento+ kone+ suihku+ turbiini+ moottori+ apu+ mekaanikko+ ali+ upseeri+ oppilas Highly inflected languages are hard for MT  Analysis at the morpheme level is sensible. flight machine shower turbine engine assistance mechanic under officer student technical warrant officer trainee specialized in aircraft jet engines used in the Finnish Air Force • Highly inflected languages (Arabic, Basque, Turkish, Russian, Hungarian, etc. ) • Extensive use of affixes • Huge number of distinct word forms A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  3. Morphological Analysis Helps auto+ si =yourcar auto+ i+ si=yourcar+ s auto+ i+ ssa+ si=inyour car+ s auto+ i+ ssa+ si+ ko=inyourcar+ s? Our approach: the basic unit of translation is the morpheme, but word boundaries are respected at all stages. • Morpheme representation alleviates data sparseness Arabic  English(Lee, 2004) Czech  English (Goldwater & McClosky, 2005) Finnish  English(Yang & Kirchhoff, 2006) • Word representation captures context better for large corpora Arabic English (Sadat & Habash, 2006) Finnish English (de Gispert et al., 2009) A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  4. Translation into Morphologically Rich Languages We are interested in unsupervised approach, and experiment translation with a large dataset • A challenging translation direction • Recent interest in morphologically rich languages: • Arabic (Badr et al., 2008) • Turkish (Oflazer and El-Kahlout, 2007)  enhance the performance for only small bi-texts • Greek (Avramidis and Koehn, 2008), • Russian (Toutanova et al., 2008) • rely heavily on language-specific tools A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  5. Methodology 5) Translation Model Enrichment Phrase pairs Phrase table PT scoring 1) Morphological analysis Alignment training 2) Phrase extraction Morph- emes Align- ment 3) MERT tuning Words 4) Decoding Morphological Analysis – Unsupervised Morphological Enhancements – Respect word boundaries Translation Model Enrichment – Merge phrase tables (PT) A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  6. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Morphological Analysis • Use Morfessor (Creutz and Lagus, 2007) - unsupervised morphological analyzer • Segments words  morphemes (PRE, STM, SUF) un/PRE+ care/STM+ ful/SUF+ ly/SUF • “+” sign used to enforce word boundary constraints later A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  7. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Word Boundary-aware Phrase Extraction advoid proposing nonwords SRC = theSTM newSTM , unPRE+ democraticSTMimmigrationSTM policySTM TGT = uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+ enSUFmaahanmuuttoPRE+ politiikanSTM (uusi=new , epädemokraattisen=undemocratic maahanmuuttopolitiikan=immigration policy) Training • Typical SMT: maximum phrase length n=7words • Problem: morpheme phrases • of length n can span less than n words • may only partially span words Severe for morphologically rich languages • Solution: morpheme phrases • span n words or less • span a sequence of whole words A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  8. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Morpheme MERT Optimizing Word BLEU Tuning • Why ? • BLEU brevity penalty influenced by sentence length • Same # words span different # morphemes • a 6-morpheme Finnish word: epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+ enSUF Suboptimal weight for the word penalty feature function • Solution: for each iteration of MERT, • Decode at the morpheme level • Convert morpheme translation  word sequence • Compute word BLEU • Convert back to morphemes A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  9. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Decoding with Twin Language Models Current hypothesis Previous hypotheses • uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+ enSUF maahanmuuttoPRE+ politiikanSTM • Morpheme LM: Decoding • Morpheme language model (LM) • Pros: alleviate data sparseness • Cons: phrases span fewer words • Introduce a second LM at the word level • Log-linear model: add a separate feature • Moses decoder: add word-level “view” on the morpheme-level hypotheses A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  10. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Decoding with Twin Language Models Current hypothesis Previous hypotheses • uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+sSUF+ enSUF maahanmuuttoPRE+ politiikanSTM • Morpheme LM: “sSUF+ enSUF maahanmuuttoPRE+” Decoding • Morpheme language model (LM) • Pros: alleviate data sparseness • Cons: phrases span fewer words • Introduce a second LM at the word level • Log-linear model: add a separate feature • Moses decoder: add word-level “view” on the morpheme-level hypotheses A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  11. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Decoding with Twin Language Models Current hypothesis Previous hypotheses • uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+enSUF maahanmuuttoPRE+ politiikanSTM • Morpheme LM: “sSUF+ enSUF maahanmuuttoPRE+” ; “enSUF maahanmuuttoPRE+ politiikanSTM ” Decoding • Morpheme language model (LM) • Pros: alleviate data sparseness • Cons: phrases span fewer words • Introduce a second LM at the word level • Log-linear model: add a separate feature • Moses decoder: add word-level “view” on the morpheme-level hypotheses A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  12. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Decoding with Twin Language Models • Morpheme language model (LM) • Pros: alleviate data sparseness • Cons: phrases span fewer words • Introduce a second LM at the word level • Log-linear model: add a separate feature • Moses decoder: add word-level “view” on the morpheme-level hypotheses Current hypothesis Previous hypotheses • uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+ enSUF maahanmuuttoPRE+ politiikanSTM • Morpheme LM: “sSUF+ enSUF maahanmuuttoPRE+” ; “enSUF maahanmuuttoPRE+ politiikanSTM ” • Word LM: uusi , epädemokraattisen maahanmuuttopolitiikan Decoding A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  13. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Decoding with Twin Language Models • Morpheme language model (LM) • Pros: alleviate data sparseness • Cons: phrases span fewer words • Introduce a second LM at the word level • Log-linear model: add a separate feature • Moses decoder: add word-level “view” on the morpheme-level hypotheses Current hypothesis Previous hypotheses • uusiSTM , epäPRE+ demokraatSTM+ tSUF+ iSUF+ sSUF+ enSUF maahanmuuttoPRE+ politiikanSTM • Morpheme LM: “sSUF+ enSUF maahanmuuttoPRE+” ; “enSUF maahanmuuttoPRE+ politiikanSTM ” • Word LM: uusi , epädemokraattisen maahanmuuttopolitiikan This is: (1) different from scoring with two word-level LMs & (2) superior to n-best rescoring Decoding A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  14. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Building Twin Translation Models Parallel corpora Word Morpheme GIZA++ GIZA++ Word alignment Morpheme alignment Phrase Extraction Phrase Extraction problems ||| vaikeuksista PTw PTm Morphological segmentation problemSTM+ sSUF ||| ongelmaSTM+ tSUF PTw→m problemSTM+ sSUF ||| vaikeuSTM+ ksiSUF+ staSUF PT merging Decoding From the same source, we generate two translation models Enriching A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  15. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Phrase Table (PT) Merging • Lexicalized translation probabilities • Phrase translation probabilities problemSTM+ sSUF ||| vaikeuSTM+ ksiSUF+ staSUF ||| 0.07 0.11 0.01 0.01 2.7 2.7 1 ||| 0.37 0.60 0.11 0.14 2.7 1 2.7 problemSTM+ sSUF ||| vaikeuSTM+ ksiSUF+ staSUF ||| 0.07 0.11 0.01 0.01 2.7 ||| 0.37 0.60 0.11 0.14 2.7 Phrase penalty problemSTM+ sSUF ||| ongelmaSTM+ tSUF problemSTM+ sSUF ||| ongelmaSTM+ tSUF We take into account the fact: our twin translation models are of equal quality Enriching • Add-feature methods e.g., Chen et al., 2009 • Heuristic-driven • Interpolation-based methods e.g., Wu & Wang, 2007 • Linear interpolation of phrase and lexicalized translation probabilities  For two PTs originated from different sources A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  16. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Our method – Phrase translation probabilities The number of times the pair was extracted from the training dataset PTw→m PTm Enriching PTwm PTm • Preserve the normalized ML estimations (Koehn et al., 2003) • Use the raw counts of both models to compute A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  17. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Lexicalized translation probabilities problemSTM+ sSUF||| vaikeuSTM+ ksiSUF+sta problemSTM+ sSUF||| ongelmaSTM+ tSUF problemSTM+ sSUF ||| vaikeuSTM+ ksiSUF+sta problemSTM+ sSUF ||| ongelmaSTM+ staSUF PTm PTwm Morpheme Lexical Model (PTm) Word Lexical Model (PTw) (problemSTM+ sSUF|ongelmaSTM+ tSUF) (problemSTM+ sSUF| ongelmaSTM+ tSUF) (problems | ongelmat) Enriching PTwm PTm • Use linear interpolation • What happens if a phrase pair belongs to only one PT? • Previous methods: interpolate with 0 • Might cause some good phrases to be penalized • Our method: induce all scores before interpolation  Use lexical model of one PT to score phrase pairs of the other A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  18. Dataset & Settings Experiments • Dataset • Past shared task WPT05 (en/fi) • 714K sentence pairs • Split into T1, T2, T3, and T4 of sizes 40K, 80K, 160K, and 320K • Standard phrase-based SMT settings: • Moses • IBM Model 4 • Case insensitive BLEU A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  19. MT Baseline Systems Either the m-system does not perform as well as the w-system or BLEU is not capable of measuring morpheme improvements Experiments w-system – word level m-system – morpheme level m-BLEU – morpheme version of BLEU A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  20. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Morphological Enhancements - Individual Individual enhancement yields improvements for both small and large corpora Experiments phr: boundary-aware phrase extraction tune: MERT tuning – word BLEU lm: decoding with twin LMs A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  21. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Morphological Enhancements - combined • phr: boundary-aware phrase extraction • tune: MERT tuning – word BLEU • lm: decoding with twin LMs Morphological enhancements: on par with the w-system and yield sizable improvements over the m-system Experiments A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  22. PT scoring 1) Morphological segmentation Alignment training 2) Boundary-aware Phrase extraction 3) MERT tuning – word BLEU 4) Decoding with twin LMs 5) Translation Model Enrichment Translation Model Enrichment Our method outperforms the w-system baseline Experiments add-1: one extra feature add-2: two extra features Interpolation: linear interpolation ourMethod A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  23. Result Significance Analysis • Absolute improvement of 0.74 BLEU over the m-system, • non-trival relative improvement of 5.6% • Outperform the w-system by 0.24 points (1.56% relative) • Statistically significant with p < 0.01 (Collins’ sign test) A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  24. Translation Proximity Match Hypothesis: our approach yields translations close to the reference wordforms but unjustly penalized by BLEU Analysis • Automatically extract phrase triples (src, out, ref ) • use src as the pivot: translation output (out) & reference (ref) • longest common subsequence ratio • high character-level similarity between out & ref (economic and social, taloudellisia ja sosiaalisia, taloudellisten ja sosiaalisten) • 16,262 triples: 6,758 match the references exactly • the remaining triples were close wordforms A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  25. Human Evaluation The judges consistently preferred: (1) ourSystem to the m-system, (2) ourSystem to thew-system, (3) w-system to the m-system Analysis • 4 native Finnish speakers • 50 random test sentences • follow WMT’09 evaluation: • provided judges with the source sentence, its reference translation & • outputs of (m-system, w-system, ourSystem) shown in random order • asked for three pairwise judgments A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  26. Sample translations Match reference Wrong case Confusing meaning Match reference Wrong case OOV Change the meaning ourSystem consistently outperforms the m-system & w-system, and seems to blend well translations from both baselines! src: we were very constructive and we negotiated until the last minute of these talks in the hague . ref: olimme erittain rakentavia ja neuvottelimme haagissa viime hetkeen saakka . our: olemme olleet hyvin rakentavia ja olemme neuvotelleet viime hetkeen saakka naiden neuvottelujen haagissa . w : olemme olleet hyvin rakentavia ja olemme neuvotelleet viime tippaan niin naiden neuvottelujen haagissa . m : olimme erittain rakentavan ja neuvottelimme viime hetkeen saakka naiden neuvotteluiden haagissa . Rank: our > m ≥ w src: it would be a very dangerous situation if the europeans were to become logistically reliant on russia . ref: olisi erittäin vaarallinen tilanne , jos eurooppalaiset tulisivat logistisesti riippuvaisiksi venäjästä . our: olisi erittäin vaarallinen tilanne , jos eurooppalaiset tulee logistisesti riippuvaisia venäjän . w : se olisi erittäin vaarallinen tilanne , jos eurooppalaisten tulisi logistically riippuvaisia venäjän . m : se olisi hyvin vaarallinen tilanne , jos eurooppalaisethaluavat tulla logistisesti riippuvaisia venäjän . Rank: our > w ≥ m A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  27. Conclusion Thank you! • Two key challenges: • bring the performance of morpheme systems to a level rivaling the standard word ones • incorporate morphological analysis directly into the translation process • We havebuilt a preliminary framework for the second challenge • Future work: • Extend the morpheme level framework • Tackle the second challenge A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

  28. Analysis – Translation Model Comparison Analysis • PTm+phr vs. PTm • PTm+phr is about half the size of PTm • 95.07% of the phrase pairs in PTm+phr are also in PTm boundary-aware phrase extraction selects good phrase pairs from PTm to be retained in PTm+phr • PTm+phr vs. PTwm • comparable in size: 22.5M and 28.9M pairs, • overlap is only 47.67% of PTm+phr enriching the translation model with PTwm helps improve coverage A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

More Related