html5-img
1 / 22

Hybridity in MT: Experiments on the Europarl Corpus

Hybridity in MT: Experiments on the Europarl Corpus. Declan Groves 24 th May, NCLT Seminar Series 2006. Outline. Example-Based Machine Translation Marker-Based EBMT Statistical Machine Translation Phrasal Extraction Experiments: Data Sources Used EBMT vs PBSMT

airlia
Download Presentation

Hybridity in MT: Experiments on the Europarl Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24th May, NCLT Seminar Series 2006

  2. Outline • Example-Based Machine Translation • Marker-Based EBMT • Statistical Machine Translation • Phrasal Extraction • Experiments: • Data Sources Used • EBMT vs PBSMT • Hybrid System Experiments • Improving EBMT lexicon • Making use of merged data sets • Conclusions • Future Work

  3. Example-Based MT • As with SMT, makes use of information extracted from sententially-aligned bilingual corpus. In general: • SMT only uses parameters, throws away data • EBMT makes use of linguistic units directly. • During Translation: • Source side of bitext is searched for close matches • Source-target sub-sentential links are determined • Relevant target fragments retrieved and recombined to derive final translation.

  4. Example-Based MT: An Example • Assumes an aligned bilingual corpus of examples against which input text is matched • Best match is found using a similarity metric based on word co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)

  5. Example-Based MT: An Example • Assumes an aligned bilingual corpus of examples against which input text is matched • Best match is found using a similarity metric based on word co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)

  6. Example-Based MT: An Example • Identify useful fragments

  7. Example-Based MT: An Example • Identify useful fragments • Recombination depends on nature of examples used on Monday Lundi John went to Jean est allé à the baker’s la boulangerie

  8. Marker-Based EBMT “The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes which appear in a limited set of grammatical contexts and which signal that context.” (Green, 1979) • Universal psycholinguistic constraint: languages are marked for syntactic structure at surface level by close set of lexemes or morphemes • Use a set of closed-class marker words to segment aligned source and target sentences during a pre-processing stage

  9. Marker-Based EBMT • Source-target sentence pairs are marked with their Marker categories EN: <PRON> you click apply <PREP> to view <DET> the effect <PREP> of <DET> the selection FR: <PRON> vous cliquez <PRON> sur appliquer <PREP> pour visualiser <DET>l’ effet <PREP> de <DET> la sélection • Aligned source-target chunks are created by segmenting the sentences based on these marker tags along with cognate and word co-occurrence information: <PRON> you click apply: <PRON> vous cliquez sur appliquer <PREP> to view: <PREP> pour visualiser <DET> the effect: <DET> l’effet <PREP> of the selection: <PREP> de la sélection

  10. Marker-Based EBMT • Chunks containing only one non-marker word in both source and target languages can then be used to extract a word-level lexicon: <PREP> to: <PREP> pour <LEX> view: <LEX> visualiser <LEX> effect: <LEX> effet <DET> the: <DET> l <PREP> of: <PREP> de • In a final pre-processing stage, we producea set of generalized marker templates by replacing markerwords with theirtags: <PRON> click apply : <PRON>cliquez surappliquer <PREP> view : <PREP> visualiser <DET> effect : <DET>effet <PREP> the selection :<PREP> la sélection • Any marker tag pair can now be inserted at the appropriate tag location. • More general examples add flexibility to the matching processand improve coverage (and quality).

  11. Marker-Based EBMT • During translation: • Resources are searched from maximal (specific source-targetsentence-pairs) to minimal context (word-for-word translation). • Retrieved example translation candidates are recombined, along with their weights, based on source sentence order • System outputs n-best list of translations

  12. Phrase-Based SMT • Translation models now estimate both word-to-word and phrasal translation probabilities (allowing in addition many-one and many-many word mappings) • Phrases incorporate some idea of syntax. • Able to capture more meaningful relationships between words within phrases • In order to extract phrases, we can make use of word alignments

  13. SMT Phrasal Extraction • Perform word alignment in both source-target and target-source directions • Take intersection of uni-directional alignments • Produces a set of highly confident word-alignments • Extend the intersection iteratively into the union by adding adjacent alignments within the alignment space (Och & Ney 2003, Koehn et al., 2003). • Extract all possible phrases from sentence pairs which correspond to these alignments (possibly including full sentences) • Phrase probabilities can be calculated from relative frequencies • Phrases and their probabilities make up phrase translation table (translation model).

  14. Experiments: Data Resources • Made use of French-English training and testing sets of the Europarl corpus (Koehn, 2005) • Extracted training data from designated training sets, filtering based on sentence length and relative sentence length. • For testing, randomly extracted 5000 sentences from the Europarl common test set. • Avg. sentence lengths: 20.5 words (French), 19.0 words (English)

  15. EBMT vs PBSMT • Compared the performance of our Marker-Based EBMT system against that of a PBSMT system built using: • Pharaoh Phrase-Based Decoder(Koehn, 2003) • SRI LM toolkit. • Refined alignment strategy (Och & Ney, 2003) • Trained on incremental data sets, tested on 5000 sentence test set • Performed translation for French-English and English-French

  16. EBMT vs PBSMT: French-English • Doubling the amount of data improves performance across the board for both EBMT and PBSMT • PBSMT system clearly outperforms EBMT system, on average achieving 0.07 BLEU score higher • PBSMT achieves a significantly lower WER (e.g. 68.55 vs. 82.43 for the 322K data set) • Increasing amount of training data results in: • 3-5% increase in relative BLEU for PBSMT • 6.2% to 10.3% relative BLEU score improvement for EBMT 78K 156K 322K

  17. EBMT vs PBSMT: English-French • PBSMTcontinues to outperform the EBMT system by some distance • e.g. 0.1933 vs 0.1488 BLEU score, 0.518 vs 0.4578 Recall for 322K data set • Difference between scores is somewhat less for English-French for French-English • EBMT system performance is much more consistent for both directions • PBSMT system performs 2% BLEU score worse (10% relative) for English-French than French-English • French-English is ‘easier’ • Less agreement errors, problems with boundary friction e.g. le -> the (French-English), the-> le, la, les, l’ (English-French) 78K 156K 322K

  18. Hybrid System Experiments • Decided to merge elements of EBMT marker-based alignments with PBSMT phrases and words induced via GIZA++ • Number of Hybrid Systems • LEX-EBMT: Replaced EBMT lexicon with higher quality PBSMT word-alignments, to lower WER • H-EBMT vs H-PBSMT: Merged PBSMT words and phrases with EBMT data (words and phrases) and passed resulting data to baseline EBMT and baseline PBSMT systems • EBMT-LM and H-EBMT-LM: Reranked the output of EBMT and H-EBMT systems using the PBSMT system’s equivalent language model

  19. Hybrid Experiments: French-English • Use of the improved lexicon (LEX-EBMT), leads to only slight improvements (average relative increase of 2.9% BLEU) • Adding Hybrid data improves above baselines, for both EBMT (H-EBMT) and PBSMT (H-PBSMT) • H-PBSMT system achieves higher BLEU score trained on 78K & 156K compared with PBSMT system when trained on twice as much data. • The addition of the language model to the H-EBMT system helps guide word order after lexical selection and thus improves results further

  20. Hybrid Experiments: English-French • We see similar results for English-French as for French-English • Using the hybrid data set we get a 15% average relative increase in BLEU score for the EBMT system, and 6.2% for the H-PBSMT system over its baseline • The H-EBMT system performs almost as well as the baseline system trained on over 4 times the amount of data

  21. Conclusions • In Groves & Way, 2005, we showed how an EBMT system outperforms a PBSMT system when trained on the Sun Microsystems’ data set • This time around, the baseline PBSMT system achieves higher quality than all variants of the EBMT system • Heterogenous Europarl vs. Homogeneous Sun data • Chunk coverage is lower on Europarl data set: 6% translations produced using chunks alone (Sun) vs. 1% on Europarl • EBMT system considered 13 words on average for direct translation • Significant improvements seen when using higher-quality lexicon • Improvements also seen when LM introduced • H-PBSMT system able to outperform baseline PBSMT system • Further gains to be made from hybrid corpus-based approaches

  22. Future Work • Automatic detection of Marker Words • Plan to increase levels of hybridity • Code a simple EBMT decoder, factoring in Marker-Based recombination approach along with probabilities (rather than weights) • Use exact sentence matching as in EBMT, along with statistical weighting of knowledge sources • Integration of generalized templates into PBSMT system • Use Good-Turing methods to assign probabilities to fuzzy matching. • Often a fuzzy chunk match may be more favourable to a word-for-word translation • Plan to code a robust, wide-coverage Statistical EBMT system • Make use of EBMT principles in a statistically-driven system.

More Related