Hybridity in MT: Experiments on the Europarl Corpus

Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24th May, NCLT Seminar Series 2006

Outline • Example-Based Machine Translation • Marker-Based EBMT • Statistical Machine Translation • Phrasal Extraction • Experiments: • Data Sources Used • EBMT vs PBSMT • Hybrid System Experiments • Improving EBMT lexicon • Making use of merged data sets • Conclusions • Future Work

Example-Based MT • As with SMT, makes use of information extracted from sententially-aligned bilingual corpus. In general: • SMT only uses parameters, throws away data • EBMT makes use of linguistic units directly. • During Translation: • Source side of bitext is searched for close matches • Source-target sub-sentential links are determined • Relevant target fragments retrieved and recombined to derive final translation.

Example-Based MT: An Example • Assumes an aligned bilingual corpus of examples against which input text is matched • Best match is found using a similarity metric based on word co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)

Example-Based MT: An Example • Identify useful fragments

Example-Based MT: An Example • Identify useful fragments • Recombination depends on nature of examples used on Monday Lundi John went to Jean est allé à the baker’s la boulangerie

Marker-Based EBMT “The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes which appear in a limited set of grammatical contexts and which signal that context.” (Green, 1979) • Universal psycholinguistic constraint: languages are marked for syntactic structure at surface level by close set of lexemes or morphemes • Use a set of closed-class marker words to segment aligned source and target sentences during a pre-processing stage

Marker-Based EBMT • Source-target sentence pairs are marked with their Marker categories EN: <PRON> you click apply <PREP> to view <DET> the effect <PREP> of <DET> the selection FR: <PRON> vous cliquez <PRON> sur appliquer <PREP> pour visualiser <DET>l’ effet <PREP> de <DET> la sélection • Aligned source-target chunks are created by segmenting the sentences based on these marker tags along with cognate and word co-occurrence information: <PRON> you click apply: <PRON> vous cliquez sur appliquer <PREP> to view: <PREP> pour visualiser <DET> the effect: <DET> l’effet <PREP> of the selection: <PREP> de la sélection

Marker-Based EBMT • Chunks containing only one non-marker word in both source and target languages can then be used to extract a word-level lexicon: <PREP> to: <PREP> pour <LEX> view: <LEX> visualiser <LEX> effect: <LEX> effet <DET> the: <DET> l <PREP> of: <PREP> de • In a final pre-processing stage, we producea set of generalized marker templates by replacing markerwords with theirtags: <PRON> click apply : <PRON>cliquez surappliquer <PREP> view : <PREP> visualiser <DET> effect : <DET>effet <PREP> the selection :<PREP> la sélection • Any marker tag pair can now be inserted at the appropriate tag location. • More general examples add flexibility to the matching processand improve coverage (and quality).

Marker-Based EBMT • During translation: • Resources are searched from maximal (specific source-targetsentence-pairs) to minimal context (word-for-word translation). • Retrieved example translation candidates are recombined, along with their weights, based on source sentence order • System outputs n-best list of translations

Phrase-Based SMT • Translation models now estimate both word-to-word and phrasal translation probabilities (allowing in addition many-one and many-many word mappings) • Phrases incorporate some idea of syntax. • Able to capture more meaningful relationships between words within phrases • In order to extract phrases, we can make use of word alignments

SMT Phrasal Extraction • Perform word alignment in both source-target and target-source directions • Take intersection of uni-directional alignments • Produces a set of highly confident word-alignments • Extend the intersection iteratively into the union by adding adjacent alignments within the alignment space (Och & Ney 2003, Koehn et al., 2003). • Extract all possible phrases from sentence pairs which correspond to these alignments (possibly including full sentences) • Phrase probabilities can be calculated from relative frequencies • Phrases and their probabilities make up phrase translation table (translation model).

Experiments: Data Resources • Made use of French-English training and testing sets of the Europarl corpus (Koehn, 2005) • Extracted training data from designated training sets, filtering based on sentence length and relative sentence length. • For testing, randomly extracted 5000 sentences from the Europarl common test set. • Avg. sentence lengths: 20.5 words (French), 19.0 words (English)

EBMT vs PBSMT • Compared the performance of our Marker-Based EBMT system against that of a PBSMT system built using: • Pharaoh Phrase-Based Decoder(Koehn, 2003) • SRI LM toolkit. • Refined alignment strategy (Och & Ney, 2003) • Trained on incremental data sets, tested on 5000 sentence test set • Performed translation for French-English and English-French

EBMT vs PBSMT: French-English • Doubling the amount of data improves performance across the board for both EBMT and PBSMT • PBSMT system clearly outperforms EBMT system, on average achieving 0.07 BLEU score higher • PBSMT achieves a significantly lower WER (e.g. 68.55 vs. 82.43 for the 322K data set) • Increasing amount of training data results in: • 3-5% increase in relative BLEU for PBSMT • 6.2% to 10.3% relative BLEU score improvement for EBMT 78K 156K 322K

EBMT vs PBSMT: English-French • PBSMTcontinues to outperform the EBMT system by some distance • e.g. 0.1933 vs 0.1488 BLEU score, 0.518 vs 0.4578 Recall for 322K data set • Difference between scores is somewhat less for English-French for French-English • EBMT system performance is much more consistent for both directions • PBSMT system performs 2% BLEU score worse (10% relative) for English-French than French-English • French-English is ‘easier’ • Less agreement errors, problems with boundary friction e.g. le -> the (French-English), the-> le, la, les, l’ (English-French) 78K 156K 322K

Hybrid System Experiments • Decided to merge elements of EBMT marker-based alignments with PBSMT phrases and words induced via GIZA++ • Number of Hybrid Systems • LEX-EBMT: Replaced EBMT lexicon with higher quality PBSMT word-alignments, to lower WER • H-EBMT vs H-PBSMT: Merged PBSMT words and phrases with EBMT data (words and phrases) and passed resulting data to baseline EBMT and baseline PBSMT systems • EBMT-LM and H-EBMT-LM: Reranked the output of EBMT and H-EBMT systems using the PBSMT system’s equivalent language model

Hybrid Experiments: French-English • Use of the improved lexicon (LEX-EBMT), leads to only slight improvements (average relative increase of 2.9% BLEU) • Adding Hybrid data improves above baselines, for both EBMT (H-EBMT) and PBSMT (H-PBSMT) • H-PBSMT system achieves higher BLEU score trained on 78K & 156K compared with PBSMT system when trained on twice as much data. • The addition of the language model to the H-EBMT system helps guide word order after lexical selection and thus improves results further

Hybrid Experiments: English-French • We see similar results for English-French as for French-English • Using the hybrid data set we get a 15% average relative increase in BLEU score for the EBMT system, and 6.2% for the H-PBSMT system over its baseline • The H-EBMT system performs almost as well as the baseline system trained on over 4 times the amount of data

Conclusions • In Groves & Way, 2005, we showed how an EBMT system outperforms a PBSMT system when trained on the Sun Microsystems’ data set • This time around, the baseline PBSMT system achieves higher quality than all variants of the EBMT system • Heterogenous Europarl vs. Homogeneous Sun data • Chunk coverage is lower on Europarl data set: 6% translations produced using chunks alone (Sun) vs. 1% on Europarl • EBMT system considered 13 words on average for direct translation • Significant improvements seen when using higher-quality lexicon • Improvements also seen when LM introduced • H-PBSMT system able to outperform baseline PBSMT system • Further gains to be made from hybrid corpus-based approaches

Future Work • Automatic detection of Marker Words • Plan to increase levels of hybridity • Code a simple EBMT decoder, factoring in Marker-Based recombination approach along with probabilities (rather than weights) • Use exact sentence matching as in EBMT, along with statistical weighting of knowledge sources • Integration of generalized templates into PBSMT system • Use Good-Turing methods to assign probabilities to fuzzy matching. • Often a fuzzy chunk match may be more favourable to a word-for-word translation • Plan to code a robust, wide-coverage Statistical EBMT system • Make use of EBMT principles in a statistically-driven system.

Hybridity in MT: Experiments on the Europarl Corpus