Comparing Example-Based & Statistical Machine Translation

Comparing Example-Based & Statistical Machine Translation • Andy Way*† • Nano Gough*, Declan Groves† • National Centre for Language Technology • School of Computing, Dublin City University • {away,ngough,dgroves}@computing.dcu.ie • [*To appear in the Journal of Natural Language Engineering, June 2005] • [† To appear in the Workshop on Building and Using Parallel Texts: • Data-Driven MT and Beyond, ACL-05, June 2005] National Centre for Language Technology

Plan of the Talk • Basic Situation in MT today: • Statistical MT (SMT) • Example-Based MT (EBMT) • Differences between Phrase-based SMT & EBMT. • Our ‘Marker-based’ EBMT system. • Testing EBMT vs. word- & phrase-based SMT. • Results & Observations. • Concluding Remarks. • Future Research Avenues. National Centre for Language Technology

What is the Situation today in MT? • Most MT research undertaken today iscorpus-based (compared with rule-based methods). • Two main data-driven approaches: • Example-Based MT (EBMT) • Statistical MT (SMT) • SMT by farthe more dominant paradigm. National Centre for Language Technology

How does EBMT work? EX (input) search FX (output) F2 F4 National Centre for Language Technology

A (much simplified) Example • Given in corpusJohn went to schoolJean est allé à l’école.The butcher’s is next to the baker’s La boucherie est à côté de la boulangerie. • Isolate useful fragmentsJohn went toJean est allé àthe baker’sla boulangerie • We can now translateJohn went to the baker’sJean est allé à la boulangerie. National Centre for Language Technology

How does SMT work? • SMT deduces language & translation models from huge quantities of monolingual and bilingual data using a range of theoretical approaches to probability distribution and estimation. • Translation model establishes the set of target language words (and more recently, phrases) whichare most likely to be useful in translating the source string. • takes into account source and target word(and phrase) co-occurrence frequencies, sentencelengths and the relative sentence positions of source and targetwords. • Language model tries to assemble these words (and phrases) in the best orderpossible. • trained by determining all bigramand/or trigramfrequency distributions occurring in the training data National Centre for Language Technology

The Paradigms are Converging • Harder than it has ever been to describe the differences between the two methods. • This used to be easy: • from the beginning, EBMT has sought to translate new texts bymeans of a range of sub-sentential data—both lexical andphrasal—stored in the system's memory. • until quite recently, SMT models of translationwere based on the simple IBM word alignment models of [Brown et al., 1990]. National Centre for Language Technology

From word- to phrase-based SMT • SMT systems now learn phrasalas well as lexical alignments [e.g. Koehn, Och, Marcu 2003; Och, 2003]. • Unsurprisingly, the quality of today's phrase-based SMT systems isconsiderably better than that of the poorer word-based models. • Despite the fact that EBMT modelshave been modelling lexical and phrasal correspondences for 20 years, no papers on SMT acknowledgethis debt to EBMT, nor describe their approach as ‘example-based’ … National Centre for Language Technology

Differences between EBMT and Phrase-Based SMT? • EBMT alignments remain available for reuse in the system, whereas (similar) SMT alignments ‘disappear’ in the probability models. • SMT systems never ‘learn’ from previously encountered data, i.e. when SMT sees a string it’s seen before, it processes it in the same way as ‘unseen’ data—EBMT will just ‘look up’ such strings in its databases and output the translation quite straightforwardly. • Depending on the model, EBMT builds in (some) syntax at itscore—most SMT systemsonly use models ofsyntax in a post hoc reranking process, and even here, [Koehn et al., JHU Workshop 2003] demonstrated that ‘bolting on’ syntax in thismanner did not help improve translation quality; • Given (3), phrase-based SMT systems are likely to ‘learn’(some) chunks that EBMT systems would not. National Centre for Language Technology

SMT chunks are different from EBMT chunks • En: Mary did not slap the green witch  • Sp: Maria no dió una botefadaa la bruja verde. • (Lit: `Mary not gave a slap to the witch green‘) • From this aligned example, an SMT system wouldpotentially learn the following ‘phrases’ (along with many others): • slap the  dió una botefada a • slap the  dió una botefada a la • the green witch a la bruja verde • NB, SMT essentially learns n-gram sequences,rather than phrases per se. • [Koehn & Knight, AMTA-04 SMT Tutorial Notes] National Centre for Language Technology

Our Marker-Based EBMT System “The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes which appear in a limited set of grammatical contexts and which signal that context.” [Green, 1979] Markers for English (and French): National Centre for Language Technology

An Example • En: you click apply to view the effect of the selection • Fr: vous cliquez sur appliquer pour visualiser l'effet de lasélection • Source—target aligned sentences are traversed word by word and automatically tagged with their markercategories: • <PRON>you click apply <PREP>to view <DET>the effect <PREP>of <DET>the selection • <PRON>vous cliquez <PREP>sur appliquer <PREP>pour visualiser <DET>l'effet <PREP>de <DET>lasélection National Centre for Language Technology

Deriving Sub-Sentential Source—Target Chunks • From these tagged strings, we generate the following aligned marker chunks: • <PRON> you click apply : vous cliquez surappliquer • <PREP> to view : pour visualiser • <DET> the effect : l'effet • <PREP> of the selection :de la sélection • New source and target (not necessarily source—target!] fragments begin where marker wordsare met and end at the next marker word [+ cognates, MI etc  source—target sub-sentential alignments]. • One further constraint: each chunk must contain at least onenon-marker word (cf. 4th marker chunk). National Centre for Language Technology

Deriving Lexical Mappings • Where chunks contain justone non-marker word in both source and target, we assume they aretranslations. • Thus we can extract the following ‘word-level’ translations: • <PREP> to : pour • <LEX> view : visualiser • <LEX> effect : effet • <PRON> you : vous • <DET> the : l’ • <PREP> of : de National Centre for Language Technology

Deriving Generalised Templates • In a final pre-processing stage, we producea set of generalised marker templates by replacing markerwords with theirtags: • <PRON> click apply : <PRON> cliquez surappliquer • <PREP> view : <PREP> visualiser • <DET> effect : <DET> effet • <PREP> the selection : <PREP> la sélection • Any marker tag pair can now be inserted at the appropriate tag location. • More general examples add flexibility to the matching processand improve coverage (and quality). National Centre for Language Technology

Summary of Knowledge Sources • the original sententially-aligned source—target pairs; • the marker-aligned chunks; • the generalised marker chunks; • the word-level lexicon. • New strings are segmentedinto all possible n-grams that might be retrieved from thesystem's memories. • Resources searched in the order provided here, from maximal (specific source—targetsentence-pairs) to minimal context (word-for-word translation). National Centre for Language Technology

Application Areas for our EBMT System • Seeding System Memories with Penn-II Treebank phrases and translations [AMTA-02]. • Controlled Language & EBMT [MT Summit-03, EAMT-04, MT Journal-05]. • Integration with web-based MT Systems [CL Journal-03]. • Using the Web for Translation Validation (and Correction, if required). • Scalable EBMT [TMI-04, NLE Journal-05, ACL-05]. • Largest EnglishFrench EBMT System. • Robust, Wide-Coverage, Good Quality. • Outperforms good on-line MT Systems. National Centre for Language Technology

What are we interested in finding out? • Whether our marker-based EBMT system could outperform (1) word-based and (2) phrase-based SMT systems compiled from generally available tools; • Whether such SMT systems outperform our EBMT system whengiven ‘enough’ training text. • Whether seeding SMT (and EBMT) systems with SMT and/or EBMT data improves translation quality. • NB, (astonishingly), no previous published research on comparing EBMT and SMT … National Centre for Language Technology

What have we done vs. what are we doing? • WBSMT vs. EBMT • PBSMT seeded with: • SMT chunks; • EBMT chunks • Both knowledge sources (‘Hybrid Example-Based SMT’). • PBSMT vs. EBMT • Ongoing work • EBMT seeded with: • SMT chunks; • EBMT chunks • Merged knowledge sources (‘Hybrid Statistical EBMT’). National Centre for Language Technology

Word-Based SMT vs. EBMT • Marker-Based EBMT system [Gough & Way, TMI-04] • To develop language and translation models for the WBSMT system, we used: • Giza++ (for word-alignment) • the CMU-Cambridge statisticaltoolkit (for computing the language and translation models) • the ISI ReWrite Decoder (for deriving translations) National Centre for Language Technology

Experiment 1 Set-Up • 207K English—French Sun TM. • Randomlyextracted 4K sentence test set. • Split remaining sentences into three training sets: roughly 50K (1.1M words), 100Kand 203K (4.8M words)sentence-pairs to test impact of training set size. • Translation performed at each stage from English—French andFrench—English. • Resulting translations evaluated using arange of automatic metrics. National Centre for Language Technology

WBSMT vs. EBMT: English—French • All metrics bar one suggest that EBMT can outperform WBSMT from French—English; • Only exception is for TS1, where WBSMT outperforms EBMT in terms of precision (.674 compared to .653) National Centre for Language Technology

WBSMT vs. EBMT: English—French • In general, scores incrementally improve as training data increases. • But apart from SER, metrics suggest that training on justover 100K sentences pairs yields better results than training on just over 200K. • Why? Maybe due tooverfitting or odd data … • Surprising: generally assumed thatincreasing training data in Machine Learning approaches will improve the quality of the output translations (variance analysis:bootstrap-resampling on test set [Koehn, EMNLP-04]; different test sets). • Note especially the similarity of the WER scores, and thedifference in SER values. Much more significant improvement for EBMT (20.6%) than for WBSMT (0.1%). National Centre for Language Technology

WBSMT vs. EBMT: French—English • All WBSMT scores higher than for French—English; • For EBMT, better translations from French—English for BLEU, Recall and SER; worse for WER (FR-EN: .508, EN-FR: .448) and precision (FR-EN: .678, EN-FR: .736); National Centre for Language Technology

WBSMT vs. EBMT: French—English • For TS1, EBMT does not outperform WBSMT from French—English foranyof the five metrics. • For TS2, EBMT beatsWBSMT in termsof BLEU, Recall and SER (66.5% compared to 81.3%for WBSMT), while WBSMT gets higher scores for Precisionand WER (46.2% compared to 55.2%). • For TS3, WBSMT again beats EBMT in terms of Precision (2.5%) and WER(4% - both less significant differences than for TS1 and TS2),but EBMT wins out according to the other three metrics—notably, bya huge 29.6% for SER. • BLEU: WBSMT obtains significantly higher scores forFrench—English compared to English—French: 8% higher for TS1, 6%higher for TS2, and 12% higher for TS3. Apart from TS1, the EBMTscores for the two different language directions are much more inline, indicating perhaps that EBMT may be more consistent even forthe same language pair in different directions. National Centre for Language Technology

Summary of Results • Both EBMT &WBSMT achieve better translation qualityfrom French—English compared to English—French. Of the fiveautomatic evaluation metrics for each of the three training sets, in9/15 cases WBSMT wins out over our EBMT system. • For English—French, in 14/15 cases EBMT beats WBSMT. • Summing these results together, EBMToutperforms WBSMT in 20 tests, while WBSMT does better in 10experiments. • Assuming all of these tests to be of equal importance,EBMT appears to outperform WBSMT by a factor of two to one. • While the results area little mixed, it is clear that EBMT tends to outperform WBSMT on thissublanguage and on these training sets. National Centre for Language Technology

Experiment 2: Phrase-Based SMT vs. EBMT • Same EBMT system as for WBSMT experiment • To develop language and translation models for the SMT system, we used: • Giza++ to extract word-alignments; • Refine these to extract Giza++ phrase-alignments; • Construct Probability Tables; • Pass these to CMU-SRI statisticaltoolkit & Pharaoh Decoder to derive translations. • Same Translation Pairs, Training Sets, Test Sets • Resulting translations evaluated using arange of automatic metrics National Centre for Language Technology

PBSMT vs. EBMT: English—French • PBSMT with Giza++ sub-sentential alignments wins out over PBSMT with EBMT data, but cf. size of data sets: • EBMT: 403,317 • PBSMT: 1.73M • PBSMT beats WBSMT, notably for BLEU; but 5% worse for WER. SER still (disappointingly) high • EBMT beats PBSMT, esp. for BLEU, Recall, WER & SER National Centre for Language Technology

PBSMT vs. EBMT: French—English • PBSMT with Giza++ sub-sentential alignments wins out over PBSMT with EBMT data (with same caveat) • PBSMT with both knowledge sources better for F—E than for E—F • PBSMT doesn’t beat WBSMT - ?? • EBMT beats PBSMT National Centre for Language Technology

Experiment 3a: Seeding Pharaoh with Giza++ Words and EBMT Phrases: English—French • Hybrid PBSMT system beats ‘baseline’ PBSMT for • BLEU, P&R, and SER; slightly worse WER • Data Size: 430K (cf. PBSMT 1.73M, EBMT 403K) • Still worse than EBMT scores National Centre for Language Technology

Experiment 3b: Seeding Pharaoh with Giza++ Words and EBMT Phrases: French—English • Hybrid PBSMT system beats ‘baseline’ PBSMT for • BLEU; slightly worse for P&R, and SER; quite a bit worse for WER; • Still shy of the results for EBMT. National Centre for Language Technology

Experiment 4a: Seeding Pharaoh with All Data, English—French • Hybrid System beats ‘semi-hybrid’ system on all metrics; • Loses out to EBMT system, except for Precision. • Data Set now >2M items. National Centre for Language Technology

Experiment 4b: Seeding Pharaoh with All Data, French—English • Hybrid System beats ‘semi-hybrid’ system on all metrics; • Hybrid System beats EBMT on BLEU & Precision; EBMT ahead for Recall & WER; still well ahead for SER. National Centre for Language Technology

Summary of Results: WBSMT vs. EBMT • None of these are ‘bad’ systems: for TS3, worst BLEU score is for WBSMT, EF, .322; • WBSMT loses out to EBMT 2:1 (but better overall for FE); • For TS3, WBSMT BLEU score of .446 and EBMT score of .461 are high scores; • For WBSMT vs. EBMT experiments, odd finding: higher scores for 100K training set: investigate in future work. National Centre for Language Technology

Summary of Results: PBSMT vs. EBMT • PBSMT scores better than for WBSMT, but odd result for FE …?! • Best PBSMT BLEU scores (with Giza++ data only): .375 (EF), .420 (FE); • Seeding PBSMT with EBMT data gets good scores: for BLEU, .364 (EF), .395 (FE); note differences in data size (1.73M vs. 403K); • PBSMT loses out to EBMT; • PBSMT SER still very high (83—87%). National Centre for Language Technology

Summary of Results: Semi-Hybrid Systems • Seeding Pharaoh with SMT words and EBMT phrases improves over baseline Giza++ seeded system; • Data size diminishes considerably (430K vs. 1.73M); • Still worse result for ‘semi-hybrid’ system for FE than for WBSMT… ?! • Still worse results than for EBMT. National Centre for Language Technology

Summary of Results: Fully Hybrid Systems • Better results than for ‘semi-hybrid’ systems: EF .426 (.396), FE .489 (.427); • Data size increases; • For FE, Hybrid system beats EBMT on BLEU (.461) & Precision; EBMT ahead for Recall & WER; still well ahead (27%) for SER. National Centre for Language Technology

Concluding Remarks • Despite the convergence between EBMT and SMT, further gains to be made; • Merging Giza++ and EBMT-induced data leads to an improved Hybrid Example-Based SMT system; • Lesson for SMT community: don’t disregard the large body of work on EBMT! • We expect in further work that adding SMT sub-sentential data to our EBMT system will also lead to improvements; • Lesson for EBMT-ers: SMT data can help you too! National Centre for Language Technology

Future Work • Carry out significance tests on these results. • Investigate what’s going on in 2nd 100K training set. • Develop ‘Statistical EBMT System’ as described; • Other issues in hybridity: • Use target LM in EBMT; • Replace EBMT recombination process with SMT decoder; • Try different decoders, LMs and TMs; • Factor in Marker Tags into SMT Probability Tables. • Experiment with other training data inother sublanguage domains, especially those where larger corpora areavailable (e.g. Canadian Hansards, European Parliament …); • Try other language pairs. National Centre for Language Technology

Comparing Example-Based & Statistical Machine Translation