Machine Translation- 2. Autumn 2008. Lecture 17 4 Sep 2008. Statistical Machine Translation. Goal: Given foreign sentence f : “Maria no dio una bofetada a la bruja verde” Find the most likely English translation e : “Maria did not slap the green witch”. Statistical Machine Translation.

Download Presentation

Machine Translation- 2

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

Statistical Machine Translation • Goal: • Given foreign sentence f: • “Maria no dio una bofetada a la bruja verde” • Find the most likely English translation e: • “Maria did not slap the green witch”

Statistical Machine Translation • Most likely English translation e is given by: • P(e|f) estimates conditional probability of any e given f

What makes a good translation • Translators often talk about two factors we want to maximize: • Faithfulness or fidelity • How close is the meaning of the translation to the meaning of the original • (Even better: does the translation cause the reader to draw the same inferences as the original would have) • Fluency or naturalness • How natural the translation is, just considering its fluency in the target language

Statistical MT Systems Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Broken English Spanish English What hunger have I, Hungry I am so, I am so hungry, Have I that hunger … Que hambre tengo yo I am so hungry

Statistical MT Systems Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Broken English Spanish English Translation Model P(s|e) Language Model P(e) Que hambre tengo yo I am so hungry Decoding algorithm argmax P(e) * P(s|e) e

Three Problems for Statistical MT • Language model • Given an English string e, assigns P(e) by formula • good English string -> high P(e) • random word sequence -> low P(e) • Translation model • Given a pair of strings <f,e>, assigns P(f | e) by formula • <f,e> look like translations -> high P(f | e) • <f,e> don’t look like translations -> low P(f | e) • Decoding algorithm • Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e)

Sentence Alignment • If document De is translation of document Df how do we find the translation for each sentence? • The n-th sentence in De is not necessarily the translation of the n-th sentence in document Df • In addition to 1:1 alignments, there are also 1:0, 0:1, 1:n, and n:1 alignments • Approximately 90% of the sentence alignments are 1:1

Sentence Alignment (c’ntd) • There are several sentence alignment algorithms: • Align (Gale & Church): Aligns sentences based on their character length (shorter sentences tend to have shorter translations then longer sentences). Works astonishingly well • Char-align: (Church): Aligns based on shared character sequences. Works fine for similar languages or technical domains • K-Vec (Fung & Church): Induces a translation lexicon from the parallel texts based on the distribution of foreign-English word pairs.

Computing Translation Probabilities • Given a parallel corpus we can estimate P(e | f) The maximum likelihood estimation of P(e | f) is: freq(e,f)/freq(f) • Way too specific to get any reasonable frequencies! Vast majority of unseen data will have zero counts! • P(e | f ) could be re-defined as: • Problem: The English words maximizing P(e | f ) might not result in a readable sentence

Computing Translation Probabilities (c’tnd) • We can account for adequacy: each foreign word translates into its most likely English word • We cannot guarantee that this will result in a fluent English sentence • Solution: transform P(e | f) with Bayes’ rule: P(e | f) = P(e) P(f | e) / P(f) • P(f | e) accounts for adequacy • P(e) accounts for fluency

Decoding • The decoder combines the evidence from P(e) and P(f | e) to find the sequence e that is the best translation: • The choice of word e’ as translation of f’ depends on the translation probability P(f’ | e’) and on the context, i.e. other English words preceding e’

Noisy Channel Model • Generative story: • Generate e with probability p(e) • Pass e through noisy channel • Out comes f with probability p(f|e) • Translation task: • Given f, deduce most likely e that produced f, or:

Translation Model • How to model P(f|e)? • Learn parameters of P(f|e) from a bilingual corpus S of sentence pairs <ei,fi> : < e1,f1 > = <the blue witch, la bruja azul> < e2,f2 > = <green, verde> … < eS,fS > = <the witch, la bruja>

Translation Model • Insufficient data in parallel corpus to estimate P(f|e) at the sentence level (Why?) • Decompose process of translating e -> f into small steps whose probabilities can be estimated

Translation Model • English sentence e = e1…el • Foreign sentence f = f1…fm • Alignment A = {a1…am}, where ajε {0…l} • A indicates which English word generates each foreign word

Alignments e: “the blue witch” f: “la bruja azul” A = {1,3,2} (intuitively “good” alignment)

Alignments e: “the blue witch” f: “la bruja azul” A = {1,1,1} (intuitively “bad” alignment)

Alignments e: “the blue witch” f: “la bruja azul” (illegal alignment!)

Alignments • Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m?

Alignments • Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m? • Answer: • Each foreign word can align with any one of |e| = l words, or it can remain unaligned • Each foreign word has (l + 1) choices for an alignment, and there are |f| = m foreign words • So, there are (l+1)^m alignments for a given e and f

Alignments • Question: If all alignments are equally likely, what is the probability of any one alignment, given e?

Alignments • Question: If all alignments are equally likely, what is the probability of any one alignment, given e? • Answer: • P(A|e) = p(|f| = m) * 1/(l+1)^m • If we assume that p(|f| = m) is uniform over all possible values of |f|, then we can let p(|f| = m) = C • P(A|e) = C /(l+1)^m

Generative Story e: “blue witch” f: “bruja azul” ? How do we get from e to f?

Language Modeling • Determines the probability of some English sequence of length l • P(e) is hard to estimate directly, unless l is very small • P(e) is normally approximated as: where m is size of the context, i.e. number of previous words that are considered, normally m=2 (tri-gram language model

Translation Modeling • Determines the probability that the foreign word f is a translation of the English word e • How to compute P(f | e) from a parallel corpus? • Statistical approaches rely on the co-occurrence of e and f in the parallel data: If e and f tend to co-occur in parallel sentence pairs, they are likely to be translations of one another

Finding Translations in a Parallel Corpus • Into which foreign words f, . . . , f’ does e translate? • Commonly, four factors are used: • How often do e and f co-occur? (translation) • How likely is a word occurring at position i to translate into a word occurring at position j? (distortion) For example: English is a verb-second language, whereas German is a verb-final language • How likely is e to translate into more than one word? (fertility) For example: defeated can translate into eine Niederlage erleiden • How likely is a foreign word to be spuriously generated? (null translation)

IBM Models 1–5 • Model 1: Bag of words • Unique local maxima • Efficient EM algorithm (Model 1–2) • Model 2: General alignment: • Model 3: fertility: n(k | e) • No full EM, count only neighbors (Model 3–5) • Deficient (Model 3–4) • Model 4: Relative distortion, word classes • Model 5: Extra variables to avoid deficiency

IBM Model 1 • Model parameters: • T(fj | eaj ) = translation probability of foreign word given English word that generated it

IBM Model 1 • Generative story: • Given e: • Pick m = |f|, where all lengths m are equally probable • Pick A with probability P(A|e) =1/(l+1)^m, since all alignments are equally likely given l and m • Pick f1…fm with probability where T(fj | eaj )is the translation probability of fj given the English word it is aligned to

IBM Model 1 Example e: blue witch” f: “f1 f2” Pick A = {2,1} with probability 1/(l+1)^m

IBM Model 1 Example e: blue witch” f: “bruja f2” Pick f1 = “bruja” with probability t(bruja|witch)

IBM Model 1 Example e: blue witch” f: “bruja azul” Pick f2 = “azul” with probability t(azul|blue)

IBM Model 1: Parameter Estimation • How does this generative story help us to estimate P(f|e) from the data? • Since the model for P(f|e) contains the parameter T(fj | eaj ),we first need to estimate T(fj | eaj)

lBM Model 1: Parameter Estimation • How to estimate T(fj | eaj )from the data? • If we had the data and the alignments A, along with P(A|f,e), then we could estimate T(fj | eaj ) using expected counts as follows:

lBM Model 1: Parameter Estimation • How to estimate P(A|f,e)? • P(A|f,e) = P(A,f|e) / P(f|e) • But • So we need to compute P(A,f|e)… • This is given by the Model 1 generative story:

IBM Model 1 Example e: “the blue witch” f: “la bruja azul” P(A|f,e) = P(f,A|e)/ P(f|e) =

IBM Model 1: Parameter Estimation • So, in order to estimate P(f|e), we first need to estimate the model parameter T(fj | eaj ) • In order to compute T(fj | eaj ), we need to estimate P(A|f,e) • And in order to compute P(A|f,e), we need to estimate T(fj | eaj )…

IBM Model 1: Parameter Estimation • Training data is a set of pairs < ei, fi> • Log likelihood of training data given model parameters is: • To maximize log likelihood of training data given model parameters, use EM: • hidden variable = alignments A • model parameters = translation probabilities T

EM • Initialize model parameters T(f|e) • Calculate alignment probabilities P(A|f,e) under current values of T(f|e) • Calculate expected counts from alignment probabilities • Re-estimate T(f|e) from these expected counts • Repeat until log likelihood of training data converges to a maximum

IBM Model 1 Example • Parallel ‘corpus’: the dog :: le chien the cat :: le chat • Step 1+2 (collect candidates and initialize uniformly): P(le | the) = P(chien | the) = P(chat | the) = 1/3 P(le | dog) = P(chien | dog) = P(chat | dog) = 1/3 P(le | cat) = P(chien | cat) = P(chat | cat) = 1/3 P(le | NULL) = P(chien | NULL) = P(chat | NULL) = 1/3