Human Language Technology

Human Language Technology Statistical MT Statistical MT

Approaching MT • There are many different ways of approaching the problem of MT. • The choice of approach is complex and depends upon: • Task requirements • Human resources • Linguistic resources Statistical MT

Criterial Issues • Do we want a translation system for one language pair or for many language pairs? • Can we assume a constrained vocabulary or do we need to deal with arbitrary text? • What resources exist for the languages that we are dealing with? • How long will it take us to develop the resources and what human resources? Statistical MT

Parallel Data • Lots of translated text available: 100s of million words of translated text for some language pairs • a book has a few 100,000s words • an educated person may read 10,000 words a day • 3.5 million words a year • 300 million a lifetime • Computers can see more translated text than humans read in a lifetime • Machine can learn how to translate foreign languages. [Koehn 2006] Statistical MT

Statistical Translation • Robust • Domain independent • Extensible • Does not require language specialists • Does requires parallel texts • Uses noisy channel model of translation Statistical MT

Noisy Channel ModelSentence Translation (Brown et. al. 1990) target sentence sourcesentence sentence Statistical MT

Statistical Modelling S • Observation = T • Learn P(T|S) from a parallel corpus • Not sufficient data to estimate P(T|S) directly [from Koehn 2006] T Statistical MT

The Problem of Translation • Given a sentence T of the target language, seek the source sentence S from which a translator produced T, i.e. find S that maximises P(T|S) • By Bayes' theorem P(T|S) = P(S) x P(S|T)/ P(T) whose denominator is independent of S. • Hence it suffices to maximise P(S) x P(S|T) Statistical MT

The Three Components of a Statistical MT model • Method for computing language model probabilities P(S) • Method for computing translation probabilities (P(S|T)) • Method for searching amongst source sentences for one that maximisesP(S) * P(T|S) Statistical MT

A Statistical MT System S T Source Language Model Translation Model P(S) * P(S|T) = P(T|S) T S Decoder Statistical MT

Three Kinds of Model Statistical MT

Word-Based Translation Models • Translation process is decomposed into smaller steps • Each is tied to words • Based on IBM Models [Brown et al., 1993] [from Koehn 2006] Statistical MT

SOURCE the cat sleeps the dog sleeps the horse eats TARGET le chat dort le chien dort le cheval mange Word TM derived from Bitext Statistical MT

le chat dort(T)/the cat sleeps(S) Statistical MT

le chien dort(T)/the dog sleeps(S) Statistical MT

le cheval mange/the horse eats P(t|s) 3/9 1/9 1/9 1/9 2/9 1/9 Statistical MT

Parameter Estimation • Based on counting occurrences within monolingual and bilingual data. • For language model, we need only source language text. • For translation model, we need pairs of sentences that are translations of each other. • Use EM (Expectation Maximisation) Algorithm (Baum 1972) to optimize model parameters. Statistical MT

EM Algorithm • Word Alignments:for sentence pair ("a b c", "x y z")are formed from arbitrary pairings from the two sentences and include:(a.x,b.y,c.z), (a.z,b.y,c.x), etc. • There is a large number of possible alignments, since we also allow, e.g.(ab.x,0.y,c.z), Statistical MT

EM Algorithm • Make initial estimate of parameters. This can be used to compute the probability of any possible word alignment. • Re-estimateparameters by ranking each possible alignment by its probability according to initial guess. • Repeated iterations assign ever greater probability to the set of sentences actually observed. Algorithm leads to a local maximum of the probability of observed sentence pairs as a function of the model parameters Statistical MT

Parameters forIBM Translation Model • Word Translation Probability, P(t|s)probability that source word s is translated as target word t. • Fertility P(n|s)probability that source word s is translated by n target words (25 ≥ n≥0). • Distortion:P(i|j,l)probability that source word at position j is translated by target word at position i in target sentence of length l. Statistical MT

Experiment 1 (Brown et. al. 1990) • Hansard. 40,000 pairs of sentences = approx. 800,000 words in each language. • Considered 9,000 most common words in each language. • Assumptions (initial parameter values) • each of the 9000 target words equally likely as translations of each of the source words. • each of the fertilities from 0 to 25 equally likely for each of the 9000 source words • each target position equally likely given each source position and target length Statistical MT

French Probability le .610 la .178 l’ .083 les .023 ce .013 il .012 de .009 à .007 que .007 Fertility Probability 1 .871 0 .124 2 .004 English: the Statistical MT

French Probability pas .469 ne .460 non .024 pas du tout .003 faux .003 plus .002 ce .002 que .002 jamais .002 Fertility Probability 2 .758 0 .133 1 .106 English: not Statistical MT

French Probability bravo .992 entendre .005 entendu .002 entends .001 Fertility Probability 0 .584 1 .416 English: hear Statistical MT

Flaws in Word-Based Translation • Model handles many:one P(ttt|s) but not one:many P(t|sss) translations e.g. • Zeitmangel erschwertdas Problem . • lack of time makes more difficultthe problem . • Correct translation: Lack of time makes the problem more difficult. • MT output: Time makes the problem. [from Koehn 2006] Statistical MT

Flaws Word-Based Translation (2) • Phrasal Translation: P(ttt|ssss) e.g. erübrigt sich /there is no point in • Eine Diskussion erübrigt sich demnach . • a discussion is made unnecessary itself therefore . • Correct translation: Therefore, there is no point in a discussion. • MT output: A debate turned therefore . [from Koehn 2006] Statistical MT

Flaws in Word BasedTranslation (3) Syntactic transformations • Example Object/subject reordering • Den Vorschlag lehnt die Kommission abthe proposal rejects the commission off • Correct translation: The commission rejects the proposal . • MT output: The proposal rejects the commission. [from Koehn 2006] Statistical MT

Phrase Based Translation Models • Foreign input is segmented in phrases. • Phrases are any sequence of words, not necessarily linguistically motivated. • Each phrase is translated into English • Phrases are reordered. [from Koehn 2006] Statistical MT

Syntax Based Translation Models Statistical MT

Word Based Decoding: searching for the best translation (Brown 1990) • Maintain list of hypotheses. • Initial hypothesis: (Jean aime Marie | *) • Search proceeds iteratively. • At each iteration we extend most promising hypotheses with additional wordsJean aime Marie | John(1) *Jean aime Marie | * loves(2) *Jean aime Marie | * Mary(3) *Jean aime Marie | Jean(1) * • Parenthesised numbers indicate corresponding position in target sentence Statistical MT

Phrase-Based Decoding • Build translation left to right • select foreign word(s) to be translated • find English phrase translation • add English phrase to end of partial translation [Koehn 2006] Statistical MT

Decoding Process • one to many translation (N.B. unlike word based) [Koehn 2006] Statistical MT

Decoding Process • many to one translation [Koehn 2006] Statistical MT

Decoding Process • translation finished [Koehn 2006] Statistical MT

Hypothesis Expansion • Start with empty hypothesis • e: no English words • f: no foreign words covered • p: probability 1 [Koehn 2006] Statistical MT

Hypothesis Expansion Statistical MT

Hypothesis Expansion • further hypothesis expansion [Koehn 2006] Statistical MT

Decoding Process • adding more hypotheses leads to explosion of search space. [Koehn 2006] Statistical MT

Hypothesis Recombination • Sometimes different choices of hypothesis lead to the same translation result. • Such paths can be combined. [Koehn 2006] Statistical MT

Hypothesis Recombination • Drop weaker path • Keep pointer from weaker path [Koehn 2006] Statistical MT

Pruning • Hypothesis recombination is not sufficient • Heuristically discard weak hypotheses early • Organize Hypothesis in stacks, e.g. by • same foreign words covered • same number of foreign words covered (Pharaoh does this) • same number of English words produced • Compare hypotheses in stacks, discard bad ones • histogram pruning: keep top n hypotheses in each stack (e.g., n=100) • threshold pruning: keep hypotheses that are at most times the cost of best hypothesis in stack (e.g., = 0.001) Statistical MT

Hypothesis Stacks • Organization of hypothesis into stacks • here: based on number of foreign words translated • during translation all hypotheses from one stack are expanded • expanded Hypotheses are placed into stacks • one to many translation [Koehn 2006] Statistical MT

Comparing Hypotheses covering Same Number of Foreign Words • Hypothesis that covers easy part of sentence is preferred • Need to consider future cost of uncovered parts • Should take account of one to many translation [Koehn 2006] Statistical MT

Future Cost Estimation • Use future cost estimates when pruning hypotheses • For each uncovered contiguous span: • look up future costs for each maximal contiguous uncovered span • add to actually accumulated cost for translation option for pruning [Koehn 2006] Statistical MT

Pharoah • A beam search decoder for phrase-based models • works with various phrase-based models • beam search algorithm • time complexity roughly linear with input length • good quality takes about 1 second per sentence • Very good performance in DARPA/NIST Evaluation • Freely available for researchers http://www.isi.edu/licensed-sw/pharaoh/ • Coming soon: open source version of Pharaoh Statistical MT

Pharoah Demo % echo ’das ist ein kleines haus’ | pharaoh -f pharaoh.ini > out Pharaoh v1.2.9, written by Philipp Koehn a beam search decoder for phrase-based statistical machine translation models (c) 2002-2003 University of Southern California (c) 2004 Massachusetts Institute of Technology (c) 2005 University of Edinburgh, Scotland loading language model from europarl.srilm loading phrase translation table from phrase-table, stored 21, pruned 0, kept 21 loaded data structures in 2 seconds reading input sentences translating 1 sentences.translated 1 sentences in 0 seconds [3mm] % cat out this is a small house Statistical MT

Brown Experiment 2 • Perform translation using 1000 most frequent words in the English corpus. • 1,700 most frequently used French words in translations of sentences completely covered by 1000 word English vocabulary. • 117,000 pairs of sentences completely covered by both vocabularies. • Parameters of English language model from 570,000 sentences in English part. Statistical MT

Experiment 2 contd • 73 French sentences tested from elsewhere in corpus. Results were classified as • Exact – same as actual translation • Alternate – same meaning • Different – legitimate translation but different meaning • Wrong – could not be intepreted as a translation • Ungrammatical – grammatically deficient • Corrections to the last three categories were made and keystrokes were counted Statistical MT

Human Language Technology