Statistical Machine Translation

Statistical Machine Translation Or, How to Put a Lot of Work Into Getting the Wrong Answer Timothy White Presentation For CSC CSC 9010: Natural Language Processing, 2005

Historic Importance • Almost immediately recognized as an important application of computers • Warren Weaver (1949): “I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text.” • In 1954 IBM demonstrated a word-for-word system Adapted from [Dorr04], slide 3

Modern Importance • Commercial • Translation one of the most used Google features, extensive use of Babelfish • EU spends >$1 Billion for translations per year • Many businesses conducting multi-international operations also a huge potential market • Academic • Requires knowledge of many NLP areas: lexical semantics, parsing, morphological analysis, statistical modeling, etc. • Would greatly increase ease of sharing knowledge and research among the world-wide scientific community Adapted from [Dorr04], slides 4-5

Goals of this Presentation • Present basic concepts underlying modern Statistical Machine Translation • Present as an example the IBM Model 3 system • Describe methods of utilizing Language/Translation Models to produce actual translations • .: Set a foundation for self-exploration

Statistical Machine Translation • Motivation: • to produce a (translated) sentence β that is grammatically correct in language B that is semantically identical to a source sentence α in language A • The Noisy Channel: • assume that sentence β was meant to be written, but was corrupted by ‘noise’ and came out as α in language A. • therefore, determine sentence β by considering 1) what sentences are written and 2) how language α becomes language β • Philosophy: • a person thought sentence β in language B, but accidentally communicated it in language A; so we must figure out what β was intended by α. [Knig99]

Bayes’ Theorem • In English: • The probability of an event (β) occurring if we know that event (α) occurred is equal to the probability of α occurring given β times the probability of α • Logically: • probability(β if α) = probability(α if β)*probability(β) probability(α) • Formally: • P(β|α) = P(α|β)P(β) P(α)

Why It Is Relevant • We know that sentence α occurred. • Lets make every translation possible and call the group a haystack; somewhere in the bucket there must be the needle that is the 'perfect' translation…But, how could we find it? • If we know the probability of a translation occurring (i.e. its grammatical correctness) we can eliminate 99% of the sentences. Perfect Translation: the translation that would be produced if a team of experts were to translate the sentence.

Applying Bayes’ Theorem • The most likely translation (β) of the source sentence (α) is maximumValue(P(β) * P(α|β)) ** P(α) is constant and for all possible translations t cancels out (and = 1). • The likelihood of a translation is a product of • 1) the grammatical correctness of sentence β (i.e. likelihood someone would say it) ---- P(β) • 2) the semantic correctness of the sentence β compare to α (i.e. likelihood that a person would translate the sentence α into β). ---- P(α|β) • The probability that if someone thought β in language B, they would say α in language A. • The best translation is the sentence that has the highest score [Knig99]

Plan of Attack • In order to translate a sentence: • 1) create a word-for-word translation of the source sentence • 2) add, replace, and reorder words to increase grammatical accuracy • 3) repeat step 2 until a maximal translation has been found (until there are no more possible modifications that increase the overall score of the sentence) • For our purposes we will assume B is English and A is Spanish. * Not necessarily in that order *

Parts of a Statistical Machine Translation System • Language Model: assigns a probability P(β) to any destination language string representing the grammatical correctness of that sentence ~ in the textbook called ‘Fluency’ • Translation Model: assigns a probability P(α|β) to any pair of strings on languages A and B representing the meaningful similarity of the sentences ~ in the textbook called ‘Faithfulness’ • Decoder: takes a sentence α and attempts to return the sentence β that maximizes P(β|α) utilizing Bayes’ Theorem and the Language Model and Translation Model • Typically creates a forest of guessed β’s, then finds the best by evaluating them according to the Language and Translation models. ~ in the textbook called ‘Search’

Language Model • “How likely is this sentence to be uttered by a native B speaker?” • Provides a measure of the grammatical accuracy of the sentence • How do you compare Language Models? • Using a corpus of test data, compare the P(model | test-data) • P(model | test-data) = P(model) * P(test-data | model) / P(data) • But P(model) & P(data) same for all models • P(model | data) = P(data | model) = P(test-data) • Therefore, language models that assign higher numbers to test-data are the best. • A language model is therefore simply a N-gram/Brill/CFG/etc. structure that computes probabilities of word combinations instead of using those same probabilities to assign Part-Of-Speech. • Because the probabilities assigned to data will be extremely small, usually compared by the equation: 2-log(P(e))/N • This is called the perplexity; as P(e) decreases perplexity increases – small perplexity is better [Knig99]

Translation Model • Based on a ‘philosophy’ of translation: • A sentences are converted into predicate logic/logical assertions. These assertions are then converted into language B. “Bob and Jane are friends” => Friends(Bob,Jane) => Bob y Jane son amigos. • A sentence in language B gets syntactically parsed (i.e. POS) into a tree-diagram and then the tree is rearranged to match language A syntax. Finally, all words are then translated into language A. • The “Transfer Metaphor” • Words in language B are translated word-for-word into language A and then randomly scrambled. • Most basic, commonly used model(s): IBM Model 1-5 .

Translation Model: IBM Model 3 • Developed for IBM by Brown, Della Pietra, Della Pietra, and Mercer, • described in “The mathematics of statistical machine translation: Parameter estimation.” in 1993 • Very easy to understand and implement. • When used in conjunction with a good Language Model will weed out poor permutations. • Basic idea is translate sentence word-for-word, then return a permutation of the old sentence. We rely on the Language Model to determine how good this permutation is. We keep feeding back permutations until a good one has been found. • Accounts for words that don’t have a 1:1 translation.

IBM Model 3:Fertility • Definition: the number of words in language A that are required to translate a given word in language B. • Example: • Mary did not slap the green witch. English • Mary no daba una botefada a la bruja verde. Spanish • The word ‘slap’ in English translates to ‘daba una botefada’ in Spanish • Thus, when translating English to Spanish, ‘slap’ has a fertility of 3 Adapted from [Knig99]

IBM Model 3: Example the FirstTranslate “Mary did not slap the green witch” from English to Spanish • Step 1: Recopy sentence, copying every word the number of times dictated by its fertility. • Fertilities: did = 0, Mary = not = green = witch = the = 1, slap = 3. 1 ‘spurious’ word • Mary not slap slap slap the the green witch • Step 2: Translate the words into language A. • Mary no daba una botefada a la verde bruja • Step 3: Permute the words to find the proper structure. • Mary no daba una botefada a la bruja verde Adapted from [Knig99]

IBM Model 3 Structure • Translation: Τ(a|b) = probability of producing a from b • Τ(‘verde’|’green’) = probability of translating ‘green’ into ‘verde’ • Τ = two-dimensional table of floating point values • Fertility: η(x|b) = probability b will produce exactly x words when translated • η(3|’slap’) = probability slap will be translated as exactly 3 words • η = two-dimensional table of floating point values • Distortion: δ(x|y,q,r) = probability word in location y will end up in location x when translated given that the original sentence has q words and the translated sentence has r words. • δ(7|6,7,9) = probability that the word in the 7th position of the translated sentence (‘bruja’) will have resulted from the word in the 6th position in the original sentence (‘witch’) when the original sentence was of length 7 and the translated sentence is of length 9 • ‘Mary did not slap the green *witch*’ -> ‘Mary no daba una botefada a la *bruja* verde’ • δ = one-dimensional table of floating point values • Spurious Production: the chance that words may appear in the translation for which there was no directly corresponding word in the original sentence. • Assign a probability ρ; every time a word is translated there is a ρ chance that a spurious word is produced. • Ρ = a floating point number [Knig99]

IBM Model 3 Algorithm • Step 1: For each original word αi, choose a fertility θi with probability η(θi | αi); the sum of all these fertilities being Μ • Step 2: Choose θ0 ‘spurious’ translated words to be generated from α0=NULL, using probability ρ and Μ. • Step 3: For each i = 1..n, and each k = 1.. θi, choose a translated word τik with probability Τ(τik | αi). • Step 4: For each i = 1..n and each k = 1.. θi, choose a target translated position πik with probability δ(σik|i,n,m). • Step 5: For each k = 1.. θ0, choose a position π0k from the θ0 – (k+1) remaining vacant positions in 1..Μ for a total probability of 1/ (θ0!). • Step 6: Output the translated sentence with words τi,k in positions πik. [Knig99]

IBM Model 3: Example the Second Translating “Mary did not slap the green witch” from English to Spanish • Step 1: For each original word αi, choose a fertility θi with probability η(θi | αi); the sum of all these fertilities being Μ • α0 NULL, θ0 = ?; α1 = ‘Mary’, θ1 = 1; α2 = ‘did’, θ2 = 0; α3 = ‘not’, θ3 = 1; α4 = ‘slap’, θ4 = 3; α5 = ‘the’, θ5 = 1; α6 = ‘green’, θ6 = 1; α7 = ‘witch’, θ7 = 1 • Μ = 8 • Step 2: Choose θ0 ‘spurious’ translated words to be generated from α0=NULL, using probability ρ and Μ. • θ0 = 1 • Step 3: For each i = 1..n, and each k = 1.. θi, choose a translated word τik with probability Τ(τik | αi). • θ0 = NULL ; τ11 = ‘Mary’; τ20 = null; τ31 = ‘no’; τ41 = ‘daba’; τ42 = ‘una’; τ43 = ‘botefada’; τ51 = ‘la’; τ61 = ‘verde’; τ71 = ‘bruja’ • Step 4: For each i = 1..n and each k = 1..θi, choose a target translated position πik with probability δ(σik|i,n,m). • π11 = 0; π31 = 1; π41 = 2; π42 = 3; π43 = 4; π51 = 6; π61 = 8; π71 = 7 • Step 5: For each k = 1.. θ0, choose a position π0k from the θ0 – (k+1) remaining vacant positions in 1..Μ for a total probability of 1/ (θ0!). • π01 = 5 • Step 6: Output the translated sentence with words τi,k in positions πik. • “Mary no daba una botefada a la bruja verde” Adapted from [Knig99]

Step 1: For each original word αi, choose a fertility θi with probability η(θi | αi); the sum of all these fertilities being Μ • For every word in the original sentence, determine the most likely number of corresponding translated words. • α1 = ‘Mary’, θ1 = 1: η(1 | 'Mary') = 1.0 • α2 = ‘did’, θ2 = 0: η(0 | 'did') = 1.0 • α3 = ‘not’, θ3 = 1: η(1 | 'no') = .75, η(2 | 'no') = .25 • α4 = ‘slap’, θ4 = 3: η(3 | 'slap') = .66, η(2 | 'slap') = .34 • α5 = ‘the’, θ5 = 1: η(1 | 'the') = 1.0 • α6 = ‘green’, θ6 = 1: η(1 | 'green') = 1.0 • α7 = ‘witch’, θ7 = 1: η(1 | 'witch') = 1.0 • M = 1 + 0 + 1 + 3 + 1 + 1 + 1 = 8

Step 2: Choose θ0 ‘spurious’ translated words to be generated from α0=NULL, using probability ρ and Μ • Try to guess how many words will appear in the sentence that are not directly related to any of the words from the original sentence • α0 = ‘a’, θ0 = 1: ρ = .111, M = 9, .111 * 9 = 1 • α1 = ‘Mary’, θ1 = 1 • α2 = ‘did’, θ2 = 0 • α3 = ‘not’, θ3 = 1 • α4 = ‘slap’, θ4 = 3 • α5 = ‘the’, θ5 = 1 • α6 = ‘green’, θ6 = 1 • α7 = ‘witch’, θ7 = 1 • M = 8

Step 3:For each i = 1..n, and each k = 1.. θi, choose a translated word τik with probability Τ(τik | αi) • Choose translations based on the most probable translation for a given word and model. • θ0 = null: α0 = null Τ(null | null) = 1.0 • τ11 = 'Mary': α1 = ‘Mary’ Τ('Mary' | 'Mary) = 1.0 • τ20 = null: α2 = ‘did’ Τ(null | 'did') = 1.0 • τ31 = 'no': α3 = ‘not’ T('no' | 'not') = .7, T('nada' | 'not' ) = .3 • τ41 = 'daba': α4 = ‘slap’ T('daba' | 'slap') = 1.0* • τ42 = 'una': α4 = ‘slap’ T('una' | 'slap') = .55, T('un' | 'slap') = .45 • τ43 = 'botefada': α4 = ‘slap’ T('botefada' | 'slap') = 1.0 • τ51 = 'la': α5 = ‘the’ T('la' | 'the') = .55, T('el' | 'the') = .45 • τ61 = 'verde': α6 = ‘green ’ T('verde' | 'green') = 1.0 • τ71 = 'bruja': α7 = ‘witch’ T('bruja' | 'witch') = 1.0 The probability that the first (τi1) translated word corresponding to 'slap' will be 'daba'

Step 4:For each i = 1..n and each k = 1.. θi, choose a target translated position πik with probability δ(σik|i,n,m) • π11 = 0 δ(0|1,7,9) = .75 • π31 = 1 δ(1|1,7,9) = .75 • π41 = 2 δ(2|4,7,9) = .3 • π42 = 3 δ(3|4,7,9) = .3 • π43 = 4 δ(4|4,7,9) = .3 • π51 = 6 δ(6|5,7,9) = .75 • π61 = 8 δ(8|6,7,9) = .75 • π71 = 7 δ(7|7,7,9) = .75

Step 5:For each k = 1.. θ0, choose a position π0k from the θ0 – (k+1) remaining vacant positions in 1..Μ for a total probability of 1/ (θ0!) • π00 = 5 θ0 = 1, k=0, 1 vacant position in M • Total Probability = 1/(1!) = 1

Step 6:Output the translated sentence with words τi,k in positions πik. • Display the result. • σ1 = ‘Mary’ • σ2 = ‘no’ • σ3 = ‘daba’ • σ4 = ‘una’ • σ5 = ‘botefada’ • σ6 = ‘a’ • σ7 = ‘la’ • σ8 = ‘bruja’ • σ9 = ‘verde’ • “Mary no daba una botefada a la bruja verde”

Warning • All data in the previous example was completely made up. • Each function could be implemented in almost any manner. • A simple one would be to store ρ as a floating point value, Τ and π as a two-dimensional floating point matrix, and η as a one-dimensional floating point matrix.

Decoding • Takes a previously unseen sentence α and tries to find the translation β that maximizes P(β|α) [= P(β) * P(α|β)]. • If translation is constrained such that the translation and the source sentences have the same word order then decoding can be done in linear time . • If translation is constrained such that the syntax of the translated sentence can be obtained from rotations around binary tree nodes (simple tree re-ordering) then decoding requires high-polynomial time. • For most languages, requiring what amounts to arbitrary word reordering, decoding is proven NP-complete. • An optimal solution: map translation of a sentence onto the Traveling Salesman Problem. Then use available Linear Integer Programming software to find an optimal solution. [Germ01]

A Greedy Solution • Start decoding with an approximate solution, and incrementally improve it until no more improvements can be found. • Begin with a word-for-word most-likely translation. • At each step in decoding, modify translation sentence with the improvement that most increases the P(β) of the sentence • Main Improvements: • translateOneOrTwoWords(j1,e1,j2,e2): modifies the translations of word(s) located at j1 (and j2) into e1 and e2, which are inserted at locations that most increase the P(β) of the sentence • translateAndInsert(j,e1,e2): changes translation of word at j1 into e1 and inserts e2 in the location that most increases the P(β) of the sentence • swapSegments(i1,i2,j1,j2): swaps non-overlapping word segments from [i1,i2] and [j1,j2]. • Can find ‘reasonable’ translations in just several seconds per sentence. • Overall, this greedy algorithm is O(n6), where n is the number of words in the sentence. Proposed by Germann et al., in Fast Decoding and Optimal Decoding for Machine Translation in 2001.

Greedy Encoding: Example the ThirdTranslate “Bien entendu, il parle de una belle victoire” from French into English • Create initial word-for-word most-likely translation: • "bien entendu, il parle de una belle victoire" => • "well heard, it talking a beautiful victory" • Modify using translateOneOrTwoWords(5,talks,7,great) • "bien entendu, il parle de una belle victoire" => • "well heard, it talks a great victory" • Modify using translateOneOrTwoWords(2,understood,0,about) • "bien entendu, il parle de una belle victoire" => • "well understood, it talks about a great victory" • Modify using translateOneOrTwoWords(4,he,null,null) • "bien entendu, il parle de una belle victoire" => • "well understood, he talks about a great victory" • Modify using translateOneOrTwoWords(1,quite,2,naturally) • "bien entendu, il parle de una belle victoire" => • "quite naturally, he talks about a great victory” • Final Result: "quite naturally, he talks about a great victory" Adapted from [Germ01]

Improving Greedy Algorithms • The above algorithm can be simplified, obtaining O(n) by: • Limiting the ability to swap sentence segments • On first sweep through identify all independent improvements that increase P(β) • Limit improvements to these sections and consider rest of sentence ‘optimized’ • The speedup gained by adding these constraints greatly offsets the decrease in accuracy, allowing • Multiple searches utilizing different start permutations. [Germ03]

Incorporating Syntax into STM • What if we use syntactic phrases as the foundation for out model, instead of words • Extract cues about syntactic structure from training data in addition to word frequencies and alignments • Can result in more "perfect translations" where the outputted sentence needs no human editing. [Char03]

Syntax Based MT Translation Model: • Given an A parse tree, produce a B sentence. • 3 Steps: • Reorder Nodes of the parse tree • Insert optional words at each node • Translate a leaf (word) from A into B • Utilizes a Probabilistic Context Free Grammar extracted from a training corpus • (Each operation is dependent on the corresponding probability) • Also accounts for phrasal translations – a phrase directly translatable from A to B without any operations • not -> ne…pas in English -> French • The "Transfer Metaphor" [Char03]

Syntax Based MT • Decoding: • Build a large forest of nearly all possible parse trees for a sentence • Remove all parse trees with a PCFG probability below threshold (i.e. .00001) • Use a lexicalized PCFG* to evaluate remaining trees (is too complicated to apply to full forest) • Choose the best translation. While the parse trees are therefore being analyzed utilizing the non-lexical model, empiral evidence has shown that the threshold of .00001 will remove most irrelevant parses without affecting parse accuracy (because there are very few lexical combinations that can overcome such a disadvantage). [Char03]

Conclusion • Statistical Machine Translation relies on utilizing a Language Model and Translation Model to maximize the equation P(β|α) = P(α|β)P(β) • Translations produced are good enough to get a general idea about what is contained, but usually need human editing of results • Utilizing Syntax can increase the number of sentences that do not require post-editing. • Much time and effort is put into producing inaccurate results. • If you followed this superbly interesting and supremely stimulating presentation, you should now be capable of developing (at least in theory) a fairly powerful Machine Translation implementation!

References • Brown, P., Della Pietra, S., Della Pietra, V., Mercer, R. 1993. The mathematics of statistical machine translation: Parameter estimation.Computational Linguistics. (available online at http://acl.ldc.upenn.edu/J/J93/J93-2003.pdf ) • Charniak, E., Knight, K. and Yamada K. 2003. Syntax-based Language Models for Machine Translation. Proceedings of MT Summit IX 2003. New Orleans. (available online at http://www.isi.edu/natural-language/projects/rewrite/mtsummit03.pdf ) • Dorr, B. and Monz, C. 2004. Statistical Machine Translation. Presented as lecture for CMSC 723 at the University of Maryland (available online at http://www.umiacs.umd.edu/~christof/courses/cmsc723-fall04/lecture-notes/Lecture8-statmt.ppt ) • Germann, U. 2003. Greedy Decoding for Statistical Machine Translation in Almost Linear Time. Proceedings of HLT-NAACL 2003. Edmonton, AB, Canada. (available online at http://acl.ldc.upenn.edu/N/N03/N03-1010.pdf ) • Germann, U., Jahr, M., Knight, K., Marcu, D. and Yamada, K. 2001. Fast Decoding and Optimal Decoding for Machine Translation. Proceedings of the Conference of the Association for Computational Linguistics (ACL-2001), Toulouse, France, July 2001. (available online at http://www.isi.edu/natural-language/projects/rewrite/decoder.pdf ) • Jurafsky, D., Martin, J. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. (Textbook) • Knight, K. 1999. A Statistical MT Tutorial Workbook. Developed for the JHU 1999 Summer MT Workshop. (available online at http://www.isi.edu/natural-language/mt/wkbk.rtf )

Statistical Machine Translation