Using Comparable Corpora to Adapt a Translation Model to Domains

The 7th International Conference on Language Resources and Evaluation, Malta, May 2010. Using Comparable Corpora to Adapt a Translation Model to Domains Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada Department of Computer Science, Shizuoka University

Overview • Motivation and goal • Proposed method • Estimating noun translation pseudo-probabilities • Estimating noun-sequence translation pseudo-probabilities • Phrase-based SMT using translation pseudo-probabilities • Experiments • Discussion • Related work • Summary

Motivation and goal • Statistical machine translation • Able to learn a translation model from a parallel corpus • Suffer from the limited availability of large parallel corpora • Use comparable corpora for SMT • Estimate translation pseudo-probabilities from a bilingual dictionary and comparable corpora • Use the pseudo-probabilities estimated from in-domain comparable corpora to • Adapt a translation model learned from an out-of-domain parallel corpus, or • Augment a translation model learned from a small in-domain parallel corpus

Motivation and goal Proposed method Estimating noun translation pseudo-probabilities Estimating noun-sequence translation pseudo-probabilities Phrase-based SMT using translation pseudo-probabilities Experiments Discussion Related work Summary Overview

Word associations suggest particular senses or translations of a polysemous word (Yarowsky 1993) (tank, soldier)  the “military vehicle” sense or translation “戦車[SENSHA]” of “tank” (tank, gasoline)  the “container for liquid or gas” sense or translation “タンク[TANKU]” of “tank” Comparable corpora allow us to determine which word associations suggest which translations of a polysemous word (Kaji & Morimoto 2002) Assume that the more word associations that suggest a translation, the higher the probability of the translation word would be Basic idea for estimating word translation pseudo-probabilities from comparable corpora

(タンク[TANKU], 燃料[NENRYOU]) (tank, fuel) (タンク[TANKU], ガソリン[GASORIN]) (tank, gasoline) (戦車[SENSHA], ミサイル[MISAIRU]) (tank, missile) (戦車[SENSHA], 兵士[HEISHI]) (tank, soldier) Naive method for estimating word translation pseudo-probabilities Japanese corpus English corpus Extract word associations Extract word associations English-Japanese dictionary Align word associations “Fuel”, “gasoline”, and others suggest “タンク[TANKU]” “Missile”, “soldier”, and others suggest “戦車[SENSHA]” Calculate the percentage of associated words suggesting each translation Pps(タンク[TANKU]|tank)=|{fuel, gasoline, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|) Pps(戦車[SENSHA]|tank)=|{missile, soldier, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|)

Failure in word-association alignment (tank, Chechen)  ? due to the disparity in topical coverage between two language corpora (tank, Chechen) ? (戦車[SENSHA], チェチェン[CHECHEN]) due to the incomplete coverage of the intermediary bilingual dictionary Incorrect word-association alignment (tank, troop)  (水槽[SUISOU], 群れ[MURE]) due to incidental word-for-word correspondence between word associations that do not really correspond to each other Difficulties the naive method suffers from

Two words associated with a third word are likely to suggest the same sense or translation of the third word when they are also associated with each other “Soldier” and “troop”, both of which are associated with “tank”, are associated with each other “Soldier” and “troop” suggest the same translation “戦車[SENSHA]” Define a correlation between an associated word and a translation using the correlations between other associated words and the translation C(troop, 戦車[SENSHA])  MI(troop, tank)  {MI(troop, soldier)  C(soldier, 戦車[SENSHA]) + MI(troop, missile)  C(missile, 戦車[SENSHA]) + …} C(troop, タンク[TANKU])  MI(troop, tank)  {MI(troop, soldier)  C(soldier, タンク[TANKU]) + MI(troop, missile)  C(missile, タンク[TANKU]) + …} How to overcome the difficulties

Calculate the correlations iteratively starting with the initial values determined according to the results of word-association alignment via a bilingual dictionary • C0(associated_word, translation) • Alignment

Overview of our method for estimating noun translation pseudo-probabilities Japanese corpus English corpus * Window size = 10 content words Extract pairs of words co-occurring in a window* Extract pairs of words co-occurring in a window* English-Japanese dictionary Calculate point-wise mutual information Calculate point-wise mutual information English word associations Japanese word associations Align Initial value of correlation matrix of English associated words vs. Japanese translations for an English noun Calculate pairwise correlation between associated words and translations iteratively Correlation matrix of associated words vs. translations Assign each associated word to the translation with which it has the highest correla-tion and calculate the percentage of associated words assigned to each translation Noun translation pseudo-probabilities

Example correlation matrix and estimated noun translation pseudo-probabilities

Our method for estimating noun-sequence translation pseudo-probabilities English-Japanese dictionary English corpus Japanese corpus E(1)=e1(1)e2(1)… em(1), E(2)=e1(2)e2(2)… em(2), …, E(n)=e1(n)e2(n)… em(n) Extract a noun sequence with its frequency Retrieve compo-sitional transla-tions and count their frequencies Generate all compositional translations F=f1f2…fm Estimate according to constituent-word translation pseudo-probabilities Estimate according to occurrence frequencies Combine two estimates

Phrase-based SMT using translation pseudo-probabilities In-domain target- language corpus In-domain source- language corpus Out-of-domain (or in-domain) parallel corpus Giza++ & heuristics Estimate translation pseudo-probabilities Bilingual dictionary Basic phrase table In-domain phrase table (pseudo-probabilities) Merge SRILM Adapted (or augmented) phrase table In-domain language model Moses decoder Source language text Target language text

Experiment A Adapt a phrase table learned from an out-of-domain parallel corpus by using in-domain comparable corpora Experiment B Augment a phrase table learned from an in-domain small parallel corpus by using in-domain larger comparable corpora Experimental setting

Our method in four cases using a different volume of comparable corpora Japanese: all, English: all Japanese: half, English: all Japanese: all, English: half Japanese: half, English: half Two baseline methods using the phrase table learned from the parallel corpus Baseline without dictionary Baseline with dictionary: Phrase table were augmented with the bilingual dictionary [Note] The TL language model learned from the whole TL monolingual corpus was used commonly in all cases involving our method and the baseline methods Evaluation metric: BLEU-4

BLEU-4 score Our method rather slightly improved the BLEU score The effect of the difference in volume of comparable corpora remains unclear Simply adding a bilingual dictionary improved the out-of-domain phrase table, but did not improve the in-domain phrase table Experimental results

Optimization of the parameters Parameters, including the window size and thresholds for word occurrence frequency, co-occurrence frequency, and pointwise mutual information, affect the correlation matrix of associated words vs. translations How to optimize the values for the parameters remains unsolved Alternatives for word-association measure Pointwise mutual information, which tends to overestimate low-frequency words, is not the most suitable for acquiring word associations Need to compare with alternatives such as log-likelihood ratio and the Dice coefficient Discussions

Refinement of the definition of translation pseudo-probability Need to consider the frequencies of associated words as well as the dependence among associated words Need to reconsider the strategy assigning an associated word to only one translation Estimate of verb translation pseudo-probabilities Need to use syntactic co-occurrence, instead of co-occurrence in a widow, to extract verb-noun associations from corpora Need to define pariwise correlation between associated nouns and translations recursively based on heuristics where two nouns associated with a verb are likely to suggest the same sense of the verb when they belong to the same semantic class

Related work Many studies on bilingual lexicon acquisition from bilingual comparable corpora have been reported since the mid 90s, but few studies on word translation probability estimate from bilingual comparable corpora Estimate of word translation probabilities from comparable corpora using an EMalgorithm (Koehn & Knight 2000)could be greatly affected by the occurrence frequencies of translation candidates in the TL corpus In contrast, our method produces translation pseudo-probabilities that reflect the distribution of the senses of the SL word in the SL corpus Methods for extracting parallel sentence pairs from bilingual comparable corpora (Zhao & Vogel, 2002; Utiyama & Isahara2003; Fung & Cheung, 2004; Munteanu & Marcu, 2005); extracted parallel sentences could be used to learn a translation model with a conventional method based on word-for-word alignment. This approach is applicable only to closely comparable corpora. In contrast, our method is applicable even to a pair of unrelated monolingual corpora.

Summary A method for estimating translation pseudo-probabilities from a bilingual dictionary and bilingual comparable corpora was created Assumption:The more associated words a translation is correlated with, the higher its translation probability Essence of the method: Calculate pairwise correlations between associated words of anSL wordand its TL translations A phrase-based SMT framework using out-of-domain parallel corpus and in-domain comparable corpora was proposed An experimentshowed promising results; the BLEU score was improved by using the translation pseudo-probabilities estimated from in-domain comparable corpora. Future work includes optimizing the parameters and extending the method to estimate translation pseudo-probabilities for verbs.

Using Comparable Corpora to Adapt a Translation Model to Domains

Using Comparable Corpora to Adapt a Translation Model to Domains

Presentation Transcript

Learning Translation Lexicons from Comparable Corpora

Using corpora to define target-language use in translation

Comparable Corpora

Generalising lexical translation strategies for MT using comparable corpora

ADAPT: Abstraction Hierarchies to Succinctly Model Teamwork

Extracting bilingual terminologies from comparable corpora

Comparable Corpora BootCat (CCBC)

Comparable Corpora for Terminology

Comparable corpora and its application

Using Xaira to explore corpora

Comparable to Adderall

Using Corpora to Teach Vocabulary

Using Computers to Adapt to Changing Markets

Transforming Parallel Corpora to Translation Memory

Using corpora in contrastive and translation studies

USING TRANSLATION CORPORA TO EXAMINE GENRE-SPECIFIC PRACTICES :

Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation

Using corpora to define target-language use in translation

Corpora and Translation

Using corpora in contrastive and translation studies

USING TRANSLATION CORPORA TO EXAMINE GENRE-SPECIFIC PRACTICES :

Using Corpora