Machine Translation- 3 - PowerPoint PPT Presentation

Machine Translation- 3

1 / 100
Machine Translation- 3

Machine Translation- 3

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

1. Machine Translation- 3 Autumn 2008 Lecture 18 8 Sep 2008

2. Translation Steps

3. IBM Models 1–5 • Model 1: Bag of words • Unique local maxima • Efficient EM algorithm (Model 1–2) • Model 2: General alignment: • Model 3: fertility: n(k | e) • No full EM, count only neighbors (Model 3–5) • Deficient (Model 3–4) • Model 4: Relative distortion, word classes • Model 5: Extra variables to avoid deficiency

4. IBM Model 1 • Model parameters: • T(fj | eaj ) = translation probability of foreign word given English word that generated it

5. IBM Model 1 • Generative story: • Given e: • Pick m = |f|, where all lengths m are equally probable • Pick A with probability P(A|e) =1/(l+1)^m, since all alignments are equally likely given l and m • Pick f1…fm with probability where T(fj | eaj )is the translation probability of fj given the English word it is aligned to

6. IBM Model 1 Example e: “blue witch”

7. IBM Model 1 Example e: “blue witch” f: “f1 f2” Pick m = |f| = 2

8. IBM Model 1 Example e: blue witch” f: “f1 f2” Pick A = {2,1} with probability 1/(l+1)^m

9. IBM Model 1 Example e: blue witch” f: “bruja f2” Pick f1 = “bruja” with probability t(bruja|witch)

10. IBM Model 1 Example e: blue witch” f: “bruja azul” Pick f2 = “azul” with probability t(azul|blue)

11. IBM Model 1: Parameter Estimation • How does this generative story help us to estimate P(f|e) from the data? • Since the model for P(f|e) contains the parameter T(fj | eaj ),we first need to estimate T(fj | eaj)

12. lBM Model 1: Parameter Estimation • How to estimate T(fj | eaj )from the data? • If we had the data and the alignments A, along with P(A|f,e), then we could estimate T(fj | eaj ) using expected counts as follows:

13. lBM Model 1: Parameter Estimation • How to estimate P(A|f,e)? • P(A|f,e) = P(A,f|e) / P(f|e) • But • So we need to compute P(A,f|e)… • This is given by the Model 1 generative story:

14. IBM Model 1 Example e: “the blue witch” f: “la bruja azul” P(A|f,e) = P(f,A|e)/ P(f|e) =

15. IBM Model 1: Parameter Estimation • So, in order to estimate P(f|e), we first need to estimate the model parameter T(fj | eaj ) • In order to compute T(fj | eaj ), we need to estimate P(A|f,e) • And in order to compute P(A|f,e), we need to estimate T(fj | eaj )…

16. IBM Model 1: Parameter Estimation • Training data is a set of pairs < ei, fi> • Log likelihood of training data given model parameters is: • To maximize log likelihood of training data given model parameters, use EM: • hidden variable = alignments A • model parameters = translation probabilities T

17. EM • Initialize model parameters T(f|e) • Calculate alignment probabilities P(A|f,e) under current values of T(f|e) • Calculate expected counts from alignment probabilities • Re-estimate T(f|e) from these expected counts • Repeat until log likelihood of training data converges to a maximum

18. IBM Model 1 Example • Parallel ‘corpus’: the dog :: le chien the cat :: le chat • Step 1+2 (collect candidates and initialize uniformly): P(le | the) = P(chien | the) = P(chat | the) = 1/3 P(le | dog) = P(chien | dog) = P(chat | dog) = 1/3 P(le | cat) = P(chien | cat) = P(chat | cat) = 1/3 P(le | NULL) = P(chien | NULL) = P(chat | NULL) = 1/3

19. IBM Model 1 Example • Step 3: Iterate • NULL the dog :: le chien • j=1 total = P(le | NULL)+P(le | the)+P(le | dog)= 1 tc(le | NULL) += P(le | NULL)/1 = 0 += .333/1 = 0.333 tc(le | the) += P(le | the)/1 = 0 += .333/1 = 0.333 tc(le | dog) += P(le | dog)/1 = 0 += .333/1 = 0.333 • j=2 total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1 tc(chien | NULL) += P(chien | NULL)/1 = 0 += .333/1 = 0.333 tc(chien | the) += P(chien | the)/1 = 0 += .333/1 = 0.333 tc(chien | dog) += P(chien | dog)/1 = 0 += .333/1 = 0.333

20. IBM Model 1 Example • NULL the cat :: le chat • j=1 total = P(le | NULL)+P(le | the)+P(le | cat)=1 tc(le | NULL) += P(le | NULL)/1 = 0.333 += .333/1 = 0.666 tc(le | the) += P(le | the)/1 = 0.333 += .333/1 = 0.666 tc(le | cat) += P(le | cat)/1 = 0 +=.333/1 = 0.333 • j=2 total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1 tc(chat | NULL) += P(chat | NULL)/1 = 0 += .333/1 = 0.333 tc(chat | the) += P(chat | the)/1 = 0 += .333/1 = 0.333 tc(chat | cat) += P(chat | dog)/1 = 0 += .333/1 = 0.333

21. IBM Model 1 Example • Re-compute translation probabilities • total(the) = tc(le | the) + tc(chien | the) + tc(chat | the) = 0.666 + 0.333 + 0.333 = 1.333 P(le | the) = tc(le | the)/total(the) = 0.666 / 1.333 = 0.5 P(chien | the) = tc(chien | the)/total(the) = 0.333/1.333 0.25 P(chat | the) = tc(chat | the)/total(the) = 0.333/1.333 0.25 • total(dog) = tc(le | dog) + tc(chien | dog) = 0.666 P(le | dog) = tc(le | dog)/total(dog) = 0.333 / 0.666 = 0.5 P(chien | dog) = tc(chien | dog)/total(dog) = 0.333 / 0.666 = 0.5

22. IBM Model 1 Example • Iteration 2: • NULL the dog :: le chien • j=1 total = P(le | NULL)+P(le | the)+P(le | dog)= 1.5 = 0.5 + 0.5 + 0.5 = 1.5 tc(le | NULL) += P(le | NULL)/1 = 0 += .5/1.5 = 0.333 tc(le | the) += P(le | the)/1 = 0 += .5/1.5 = 0.333 tc(le | dog) += P(le | dog)/1 = 0 += .5/1.5 = 0.333 • j=2 total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1 = 0.25 + 0.25 + 0.5 = 1 tc(chien | NULL) += P(chien | NULL)/1 = 0 += .25/1 = 0.25 tc(chien | the) += P(chien | the)/1 = 0 += .25/1 = 0.25 tc(chien | dog) += P(chien | dog)/1 = 0 += .5/1 = 0.5

23. IBM Model 1 Example • NULL the cat :: le chat • j=1 total = P(le | NULL)+P(le | the)+P(le | cat)= 1.5 = 0.5 + 0.5 + 0.5 = 1.5 tc(le | NULL) += P(le | NULL)/1 = 0.333 += .5/1 = 0.833 tc(le | the) += P(le | the)/1 = 0.333 += .5/1 = 0.833 tc(le | cat) += P(le | cat)/1 = 0 += .5/1 = 0.5 • j=2 total = P(chat | NULL)+P(chat | the)+P(chat | cat)=1 = 0.25 + 0.25 + 0.5 = 1 tc(chat | NULL) += P(chat | NULL)/1 = 0 += .25/1 = 0.25 tc(chat | the) += P(chat | the)/1 = 0 += .25/1 = 0.25 tc(chat | cat) += P(chat | cat)/1 = 0 += .5/1 = 0.5

24. IBM Model 1 Example • Re-compute translations (iteration 2): • total(the) = tc(le | the) + tc(chien | the) + tc(chat | the) = .833 + 0.25 + 0.25 = 1.333 P(le | the) = tc(le | the)/total(the) = .833 / 1.333 = 0.625 P(chien | the) = tc(chien | the)/total(the) = 0.25/1.333 = 0.188 P(chat | the) = tc(chat | the)/total(the) = 0.25/1.333 = 0.188 • total(dog) = tc(le | dog) + tc(chien | dog) = 0.333 + 0.5 = 0.833 P(le | dog) = tc(le | dog)/total(dog) = 0.333 / 0.833 = 0.4 P(chien | dog) = tc(chien | dog)/total(dog) = 0.5 / 0.833 = 0.6

25. IBM Model 1Example • After 5 iterations: P(le | NULL) = 0.755608028335301 P(chien | NULL) = 0.122195985832349 P(chat | NULL) = 0.122195985832349 P(le | the) = 0.755608028335301 P(chien | the) = 0.122195985832349 P(chat | the) = 0.122195985832349 P(le | dog) = 0.161943319838057 P(chien | dog) = 0.838056680161943 P(le | cat) = 0.161943319838057 P(chat | cat) = 0.838056680161943

26. IBM Model 1 Recap • IBM Model 1 allows for an efficient computation of translation probabilities • No notion of fertility, i.e., it’s possible that the same English word is the best translation for all foreign words • No positional information, i.e., depending on the language pair, there might be a tendency that words occurring at the beginning of the English sentence are more likely to align to words at the beginning of the foreign sentence

27. IBM Model 2 • Model parameters: • T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated it • d(i|j,l,m) = distortion probability, or probability that fj is aligned to ei , given l and m

28. IBM Model 3 • Model parameters: • T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated it • r(j|i,l,m) = reverse distortion probability, or probability of position fj, given its alignment to ei, l, and m • n(ei) = fertility of word ei , or number of foreign words aligned to ei • p1 = probability of generating a foreign word by alignment with the NULL English word

29. IBM Model 3 • IBM Model 3 offers two additional features compared to IBM Model 1: • How likely is an English word e to align to k foreign words (fertility)? • Positional information (distortion), how likely is a word in position i to align to a word in position j?

30. IBM Model 3: Fertility • The best Model 1 alignment could be that a single English word aligns to all foreign words • This is clearly not desirable and we want to constrain the number of words an English word can align to • Fertility models a probability distribution that word e aligns to k words: n(k,e) • Consequence: translation probabilities cannot be computed independently of each other anymore • IBM Model 3 has to work with full alignments, note there are up to (l+1)m different alignments

31. IBM Model 3 • Generative Story: • Choose fertilities for each English word • Insert spurious words according to probability of being aligned to the NULL English word • Translate English words -> foreign words • Reorder words according to reverse distortion probabilities

32. IBM Model 3 Example • Consider the following example from [Knight 1999]: • Maria did not slap the green witch

33. IBM Model 3 Example • Maria did not slap the green witch • Maria not slap slap slap the green witch • Choose fertilities: phi(Maria) = 1

34. IBM Model 3 Example • Maria did not slap the green witch • Maria not slap slap slap the green witch • Maria not slap slap slap NULL the green witch • Insert spurious words: p(NULL)

35. IBM Model 3 Example • Maria did not slap the green witch • Maria not slap slap slap the green witch • Maria not slap slap slap NULL the green witch • Maria no dio una bofetada a la verde bruja • Translate words: t(verde|green)

36. IBM Model 3 Example • Maria no dio una bofetada a la verde bruja • Maria no dio una bofetada a la bruja verde • Reorder words

37. IBM Model 3 • For models 1 and 2: • We can compute exact EM updates • For models 3 and 4: • Exact EM updates cannot be efficiently computed • Use best alignments from previous iterations to initialize each successive model • Explore only the subspace of potential alignments that lies within same neighborhood as the initial alignments

38. IBM Model 4 • Model parameters: • Same as model 3, except uses more complicated model of reordering (for details, see Brown et al. 1993)

39. IBM Model 1 + Model 3 • Iterating over all possible alignments is computationally infeasible • Solution: Compute the best alignment with Model 1 and change some of the alignments to generate a set of likely alignments (pegging) • Model 3 takes this restricted set of alignments as input

40. Pegging • Given an alignment a we can derive additional alignments from it by making small changes: • Changing a link (j,i) to (j,i’) • Swapping a pair of links (j,i) and (j’,i’) to (j,i’) and (j’,i) • The resulting set of alignments is called the neighborhood of a

41. IBM Model 3: Distortion • The distortion factor determines how likely it is that an English word in position i aligns to a foreign word in position j, given the lengths of both sentences: d(j | i, l, m) • Note, positions are absolute positions

42. Deficiency • Problem with IBM Model 3: It assigns probability mass to impossible strings • Well formed string: “This is possible” • Ill-formed but possible string: “This possible is” • Impossible string: • Impossible strings are due to distortion values that generate different words at the same position • Impossible strings can still be filtered out in later stages of the translation process

43. Limitations of IBM Models • Only 1-to-N word mapping • Handling fertility-zero words (difficult for decoding) • Almost no syntactic information • Word classes • Relative distortion • Long-distance word movement • Fluency of the output depends entirely on the English language model

44. Decoding • How to translate new sentences? • A decoder uses the parameters learned on a parallel corpus • Translation probabilities • Fertilities • Distortions • In combination with a language model the decoder generates the most likely translation • Standard algorithms can be used to explore the search space (A*, greedy searching, …) • Similar to the traveling salesman problem

45. Three Problems for Statistical MT • Language model • Given an English string e, assigns P(e) by formula • good English string -> high P(e) • random word sequence -> low P(e) • Translation model • Given a pair of strings <f,e>, assigns P(f | e) by formula • <f,e> look like translations -> high P(f | e) • <f,e> don’t look like translations -> low P(f | e) • Decoding algorithm • Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e) Slide from Kevin Knight

46. The Classic Language ModelWord N-Grams Goal of the language model -- choose among: He is on the soccer field He is in the soccer field Is table the on cup the The cup is on the table Rice shrine American shrine Rice company American company Slide from Kevin Knight

47. Intuition of phrase-based translation (Koehn et al. 2003) • Generative story has three steps • Group words into phrases • Translate each phrase • Move the phrases around

48. Generative story again • Group English source words into phrases e1, e2, …, en • Translate each English phrase ei into a Spanish phrase fj. • The probability of doing this is (fj|ei) • Then (optionally) reorder each Spanish phrase • We do this with a distortion probability • A measure of distance between positions of a corresponding phrase in the 2 lgs. • “What is the probability that a phrase in position X in the English sentences moves to position Y in the Spanish sentence?”

49. Distortion probability • The distortion probability is parameterized by • ai-bi-1 • Where ai is the start position of the foreign (Spanish) phrase generated by the ith English phrase ei. • And bi-1 is the end position of the foreign (Spanish) phrase generated by the I-1th English phrase ei-1. • We’ll call the distortion probability d(ai-bi-1). • And we’ll have a really stupid model: • d(ai-bi-1) = |ai-bi-1| • Where  is some small constant.