*Introduction to Natural Language Processing (600.465) Statistical Machine Translation

*Introduction to Natural Language Processing (600.465)Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic JHU CS 600.465/Jan Hajic

The Main Idea • Treat translation as a noisy channel problem: Input (Source) “Noisy” Output (target) The channel E: English words... (adds “noise”) F: Les mots Anglais... • TheModel: P(E|F) = P(F|E) P(E) / P(F) • Interested in rediscovering E given F: After the usual simplification (P(F) fixed): argmaxE P(E|F) = argmaxE P(F|E) P(E) ! JHU CS 600.465/ Intro to NLP/Jan Hajic

The Necessities • Language Model (LM) P(E) • Translation Model (TM): Target given source P(F|E) • Search procedure • Given E, find best F using the LM and TM distributions. • Usual problem: sparse data • We cannot create a “sentence dictionary” E ↔F • Typically, we do not see a sentence even twice! JHU CS 600.465/ Intro to NLP/Jan Hajic

The Language Model • Any LM will do: • 3-gram LM • 3-gram class-based LM • decision tree LM with hierarchical classes • Does not necessarily operates on word forms: • cf. later the “analysis” and “generation” procedures • for simplicity, imagine now it does operate on word forms JHU CS 600.465/ Intro to NLP/Jan Hajic

The Translation Models • Do not care about correct strings of English words (that’s the task of the LM) • Therefore, we can make more independence assumptions: • for start, use the “tagging” approach: • 1 English word (“tag”) ~ 1 French word (“word”) • not realistic: rarely even the number of words is the same in both sentences (let alone there is 1:1 correspondence!) • Þ use “Alignment”. JHU CS 600.465/ Intro to NLP/Jan Hajic

The Alignment 0 1 2 3 4 5 6 • e0 And the program has been implemented • f0 Le programme a été mis en application 0 1 2 3 4 5 6 7 • Linear notation: • f0(1) Le(2) programme(3) a(4) été(5) mis(6) en(6) application(6) • e0 And(0) the(1) program(2) has(3) been(4) implemented(5,6,7) JHU CS 600.465/ Intro to NLP/Jan Hajic

Alignment Mapping • In general: • |F| = m, |E| = l (length of sent.): • lm connections (each French word to any English word), • 2lm different alignments for any pair (E,F) (any subset) • In practice: • From English to French • each English word 1-n connections (n - empirical max.-fertility?) • each French word exactly 1 connection • therefore, “only”(l+1)m alignments ( << 2lm ) • aj = i (link from j-th French word goes to i-th English word) JHU CS 600.465/ Intro to NLP/Jan Hajic

Elements of Translation Model(s) • Basic distribution: • P(F,A,E) - the joint distribution of the English sentence, the Alignment, and the French sentence (length m) • Interested also in marginal distributions: P(F,E) = SA P(F,A,E) P(F|E) = P(F,E) / P(E) = SA P(F,A,E) / SA,F P(F,A,E) = SA P(F,A|E) • Useful decomposition [one of possible decompositions]: P(F,A|E) = P(m | E) Pj=1..m P(aj|a1j-1,f1j-1,m,E) P(fj|a1j,f1j-1,m,E) JHU CS 600.465/ Intro to NLP/Jan Hajic

Decomposition • Decomposition formula again: P(F,A|E) = P(m | E) Pj=1..m P(aj|a1j-1,f1j-1,m,E) P(fj|a1j,f1j-1,m,E) m - length of French sentence aj - the alignment (single connection) going from j-th French w. fj - the j-th French word from F a1j-1 - sequence of alignments ai up to the word preceding fj a1j - sequence of alignments ai up to and including the word fj f1j-1 - sequence of French words up to the word preceding fj JHU CS 600.465/ Intro to NLP/Jan Hajic

Decomposition and the Generative Model • ...and again: P(F,A|E) = P(m | E) Pj=1..m P(aj|a1j-1,f1j-1,m,E) P(fj|a1j,f1j-1,m,E) • Generate: • first, the length of the French given the English words E; • then, the link from the first position in F (not knowing the actual word yet) Þ now we know the English word • then, given the link (and thus the English word), generate the French word at the current position • then, move to the next position in F until m position filled. JHU CS 600.465/ Intro to NLP/Jan Hajic

Approximations • Still too many parameters • similar situation as in n-gram model with “unlimited” n • impossible to estimate reliably. • Use 5 models, from the simplest to the most complex (i.e. from heavy independence assumptions to light) • Parameter estimation: Estimate parameters of Model 1; use as an initial estimate for estimating Model 2 parameters; etc. JHU CS 600.465/ Intro to NLP/Jan Hajic

Model 1 • Approximations: • French length P(m | E) is constant (small e) • Alignment link distribution P(aj|a1j-1,f1j-1,m,E) depends on English length l only (= 1/(l+1)) • French word distribution depends only on the English and French word connected with link aj. • Þ Model 1 distribution: P(F,A|E) = e / (l+1)mPj=1..m p(fj|eaj) JHU CS 600.465/ Intro to NLP/Jan Hajic

Models 2-5 • Model 2 • adds more detail into P(aj|...): more “vertical” links preferred • Model 3 • adds “fertility” (number of links for a given English word is explicitly modeled: P(n|ei) • “distortion” replaces alignment probabilities from Model 2 • Model 4 • the notion of “distortion” extended to chunks of words • Model 5 is Model 4, but not deficient (does not waste probability to non-strings) JHU CS 600.465/ Intro to NLP/Jan Hajic

The Search Procedure • “Decoder”: • given “output” (French), discover “input” (English) • Translation model goes in the opposite direction: p(f|e) = .... • Naive methods do not work. • Possible solution (roughly): • generate English words one-by-one, keep only n-best (variable n) list; also, account for different lengths of the English sentence candidates! JHU CS 600.465/ Intro to NLP/Jan Hajic

Analysis - Translation - Generation (ATG) • Word forms: too sparse • Use four basic analysis, generation steps: • tagging • lemmatization • word-sense disambiguation • noun-phrase “chunks” (non-compositional translations) • Translation proper: • use chunks as “words” JHU CS 600.465/ Intro to NLP/Jan Hajic

Training vs. Test with ATG • Training: • analyze both languages using all four analysis steps • train TM(s) on the result (i.e. on chunks, tags, etc.) • train LM on analyzed source (English) • Runtime/Test: • analyze given language sentence (French) using identical tools as in training • translate using the trained Translation/Language model(s) • generate source (English), reversing the analysis process JHU CS 600.465/ Intro to NLP/Jan Hajic

Analysis: Tagging and Morphology • Replace word forms by morphologically processed text: • lemmas • tags • original approach: mix them into the text, call them “words” • e.g. She bought two books. Þ she buy VBP two book NNS. • Tagging: yes • but reversed order: • tag first, then lemmatize [NB: does not work for inflective languages] • technically easy • Hand-written deterministic rules for tag+form Þ lemma JHU CS 600.465/ Intro to NLP/Jan Hajic

Word Sense Disambiguation, Word Chunking • Sets of senses for each E, F word: • e.g. book-1, book-2, ..., book-n • prepositions (de-1, de-2, de-3,...), many others • Senses derived automatically using the TM • translation probabilities measured on senses: p(de-3|from-5) • Result: • statistical model for assigning senses monolingually based on context (also MaxEnt model used here for each word) • Chunks: group words for non-compositional translation JHU CS 600.465/ Intro to NLP/Jan Hajic

Generation • Inverse of analysis • Much simpler: • Chunks Þ words (lemmas) with senses (trivial) • Words (lemmas) with senses Þ words (lemmas) (trivial) • Words (lemmas) + tags Þ word forms • Additional step: • Source-language ambiguity: • electric vs. electrical, hath vs. has, you vs. thou: treated as a single unit in translation proper, but must be disambiguated at the end of generation phase; using additional pure LM on word forms. JHU CS 600.465/ Intro to NLP/Jan Hajic

*Introduction to Natural Language Processing (600.465)Statistical Translation: Alignment and Parameter Estimation Dr. Jan Hajič CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic JHU CS 600.465/Jan Hajic

Alignment • Available corpus assumed: • parallel text (translation E ↔F) • No alignment present (day marks only)! • Sentence alignment • sentence detection • sentence alignment • Word alignment • tokenization • word alignment (with restrictions) JHU CS 600.465/ Intro to NLP/Jan Hajic

Sentence Boundary Detection • Rules, lists: • Sentence breaks: • paragraphs (if marked) • certain characters: ?, !, ; (...almost sure) • The Problem: period . • could be end of sentence (... left yesterday. He was heading to...) • decimal point: 3.6 (three-point-six) • thousand segment separator: 3.200 (three-thousand-two-hundred) • abbreviation never at the end of sentence: cf., e.g., Calif., Mt., Mr. • ellipsis: ... • other languages: ordinal number indication (2nd ~ 2.) • initials: A. B. Smith • Statistical methods: e.g., Maximum Entropy JHU CS 600.465/ Intro to NLP/Jan Hajic

Sentence Alignment • The Problem: sentences detected only: • E: • F: • Desired output: Segmentation with equal number of segments, spanning continuously the whole text. • Original sentence boundaries kept: • E: • F: • Alignments obtained: 2-1, 1-1, 1-1, 2-2, 2-1, 0-1 • New segments called “sentences” from now on. JHU CS 600.465/ Intro to NLP/Jan Hajic

Alignment Methods • Several methods (probabilistic and not prob.) • character-length based • word-length based • “cognates” (word identity used) • using an existing dictionary (F: prendre ~ E: make, take) • using word “distance” (similarity): names, numbers, borrowed words, Latin origin words, ... • Best performing: • statistical, word- or character- length based (with some words perhaps) JHU CS 600.465/ Intro to NLP/Jan Hajic

Length-based Alignment • First, define the problem probabilistically: argmaxA P(A|E,F) = argmaxA P(A,E,F) (E,F fixed) • Define a “bead”: • E: • F: • Approximate: P(A,E,F) @Pi=1..nP(Bi), where Bi is a bead; P(Bi) does not depend on the rest of E,F. “bead” (2:2 in this case) JHU CS 600.465/ Intro to NLP/Jan Hajic

The Alignment Task • Given the model definition, P(A,E,F) @Pi=1..nP(Bi), find the partitioning of (E,F) into n beads Bi=1..n, that maximizes P(A,E,F) over training data. • Define Bi = p:qai, where p:q {0:1,1:0,1:1,1:2,2:1,2:2} • describes the type of alignment • Want to use some sort of dynamic programming: • Define Pref(i,j)... probability of the best alignment from the start of (E,F) data (1,1) up to (i,j) JHU CS 600.465/ Intro to NLP/Jan Hajic

P(1:0ak) Pref(i-2,j-2) Pref(i-2,j-1) Pref(i-1,j-2) Pref(i-1,j-1) Pref(i-1,j) Pref(i,j-1) P(2:2ak) P(2:1ak) P(1:2ak) P(0:1ak) P(1:1ak) Recursive Definition • Initialize: Pref(0,0) = 0. • Pref(i,j) = max ( Pref(i,j-1) P(0:1ak), Pref(i-1,j) P(1:0ak), Pref(i-1,j-1) P(1:1ak), Pref(i-1,j-2) P(1:2ak), Pref(i-2,j-1) P(2:1ak), Pref(i-2,j-2) P(2:2ak) ) • This is enough for a Viterbi-like search. • E: • F: i j JHU CS 600.465/ Intro to NLP/Jan Hajic

Probability of a Bead • Remains to define P(p:qak) (the red part): • k refers to the “next” bead, with segments of p and q sentences, lengths lk,e and lk,f. • Use normal distribution for length variation: • P(p:qak) = P(d(lk,e,lk,f,m,s2),p:q) @ P(d(lk,e,lk,f,m,s2))P(p:q) • d(lk,e,lk,f,m,s2) = (lk,f - mlk,e)/lk,es2 • Estimate P(p:q) from small amount of data, or even guess and re-estimate after aligning some data. • Words etc. might be used as better clues in P(p:qak) def. JHU CS 600.465/ Intro to NLP/Jan Hajic

Saving time • For long texts (> 104 sentences), even Viterbi (in the version needed) is not effective (o(S2) time) • Go paragraph by paragraph if they are aligned 1:1 • What if not? • Apply the same method first to paragraphs! • identify paragraphs roughly in both languages • run the algorithm to get aligned paragraph-like segments • then, run on sentences within paragraphs. • Performs well if not many consecutive 1:0 or 0:1 beads. JHU CS 600.465/ Intro to NLP/Jan Hajic

Word alignment • Length alone does not help anymore. • mainly because words can be swapped, and mutual translations have often vastly different length. • ...but at least, we have “sentences” (sentence-like segments) aligned; that will be exploited heavily. • Idea: • Assume some (simple) translation model (such as Model 1). • Find its parameters by considering virtually all alignments. • After we have the parameters, find the best alignment given those parameters. JHU CS 600.465/ Intro to NLP/Jan Hajic

Word Alignment Algorithm • Start with sentence-aligned corpus. • Let (E,F) be a pair of sentences (actually, a bead). • Initialize p(f|e) randomly (e.g., uniformly), fF, eE. • Compute expected counts over the corpus: c(f,e) = S(E,F);eE,fF p(f|e) " aligned pair (E,F), find if e in E and f in F; if yes, add p(f|e). • Reestimate: p(f|e) = c(f,e) / c(e) [c(e) = Sf c(f,e)] • Iterate until change of p(f|e) is small. JHU CS 600.465/ Intro to NLP/Jan Hajic

Best Alignment • Select, for each (E,F), A = argmaxA P(A|F,E) = argmaxAP(F,A|E)/P(F) = argmaxA P(F,A|E) = argmaxA (e / (l+1)mPj=1..m p(fj|eaj)) = argmaxA Pj=1..mp(fj|eaj) (IBM Model 1) • Again, use dynamic programming, Viterbi-like algorithm. • Recompute p(f|e) based on the best alignment • (only if you are inclined to do so; the “original” summed-over-all distribution might perform better). • Note: we have also got all Model 1 parameters. JHU CS 600.465/ Intro to NLP/Jan Hajic

*Introduction to Natural Language Processing (600.465) Statistical Machine Translation