Statistical Machine Translation Word Alignment

Statistical Machine TranslationWord Alignment Stephan Vogel MT Class Spring Semester 2011 Stephan Vogel - Machine Translation

Overview • Word alignment – some observations • Models IBM2 and IBM1: 0th-order position model • HMM alignment model: 1st-order position model • IBM3: fertility • IBM4: plus relative distortion Stephan Vogel - Machine Translation

Alignment Example Observations: • Mostly 1-1 • Some 1-to-many • Some 1-to-nothing • Often monotone • Not always clear-cut • English ‘eight’ is a time • German has ‘acht Uhr’ • Could also leave ‘Uhr’ unaligned Stephan Vogel - Machine Translation

Evaluating Alignment • Given some manually aligned data (ref) and automatically aligned data (hyp) links can be • Correct, i.e. link in hyp matches link in ref: true positive (tp) • Wrong, i.e. link in hyp but not in ref: false positive (fp) • Missing, i.e. link in ref but not in hyp: false negative (fn) • Evaluation measures • Precision: P = tp / (tp + fp) = correct / links_in_hyp • Recall: R = tp / (tp + fn) = correct / links_in_ref • Alignment Error Rate: AER = 1 – F = 1 – 2tp / (2tp +fp +fn) Stephan Vogel - Machine Translation

Sure and Possible Links • Sometimes it is difficult for human annotators to decide • Differentiate between sure and possible links • En: Det Noun - Ch: Noun, don’t align Det, or align to NULL? • En: Det Noun - Ar: DetNoun, should Det be aligned to DetNoun? • Alignment Error Rate with sure and possible links (Och 2000) • A = generated links • S = sure links (no finding a sure link is an error) • P = possible links (putting a link which is not possible is an error) Stephan Vogel - Machine Translation

Word Alignment Models • IBM1 – lexical probabilities only • IBM2 – lexicon plus absolut position • IBM3 – plus fertilities • IBM4 – inverted relative position alignment • IBM5 – non-deficient version of model 4 • HMM – lexicon plus relative position • BiBr – Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation • Syntactical alignment models [Brown et.al. 1993, Vogel et.al. 1996, Och et al 2000, Wu 1997, Yamada et al. 2003, and many others] Stephan Vogel - Machine Translation

GIZA++ Alignment Toolkit • All standard alignment models (IBM1 … IBM5, HMM) are implemented in GIZA++ • This toolkit was started (as GIZA) at John Hopkins University workshop 1998 • Extended and improved by Franz Josef Och • Now used by many groups • Known problems: • Memory when training on large corpora • Writes many large files (depends on your parameter setting) • Extensions for large corpora (Qin Gao) • Distributed GIZA: run on many machines, I/O bound • Multithreaded GIZA: run on one machine, multiple cores Stephan Vogel - Machine Translation

Notation • Source language • f: source (French) word • J: length of source sentence • j: position in source sentence (target position) • : source sentence • Target language • e: target (English) word • I: length of target sentence • i: position in target sentence (source position) • : target sentence • Alignment: relation mapping source to target positions • i=aj: position i of ei which is aligned to j • : whole alignment Stephan Vogel - Machine Translation

SMT - Principle • Translate a ‘French’ stringinto an ‘English’ string • Bayes’ decision rule for translation: • Why this inversion of the translation direction? • Decomposition of dependencies: makes modeling easier • Cooperation of two knowledge sources for final decision • Note: IBM paper and GIZA call e source and f target Stephan Vogel - Machine Translation

Alignment as Hidden Variable • ‘Hidden alignments’ to capture word-to-word correspondences • Mapping A subset of [1, …, J]x[1, …, I] • Number of connections: J * I (each source word with each target word • Number of alignments: 2JI (each connection yes/no) • Summation over all alignments • To many alignments, summation not feasible Stephan Vogel - Machine Translation

Restricted Alignment • Each source word has one connection • Alignment mapping becomes function: j -> i = aj • Number of alignments is now: IJ • Sum over all alignments: • Not possible to enumerate • In some situations full summationpossible through Dynamic Programming • In other situations: take only best alignmentand perhaps some alignments closeto the best one Stephan Vogel - Machine Translation

Empty Position (Null Word) • Sometimes a word has no correspondence • Alignment function aligns each source word to one target word, i.e. cannot skip source word • Solution: • Introduce empty position 0 with null word e0 • ‘Skip’ source word fj by aligning it to e0 • Target sentence is extended to: • Alignment is extended to: Stephan Vogel - Machine Translation

Translation Model • Sum over all alignment • 3 probability distributions: • Length: • Alignment: • Lexicon: Stephan Vogel - Machine Translation

Model Assumptions Decompose interaction into pairwise dependencies • Length: Source length only dependent on target length (very weak) • Alignment: • Zero order model: target position only dependent on source position • First order model: target position only dependent on previous target position • Lexicon: source word only dependent on aligned word Stephan Vogel - Machine Translation

Mixture Model • Interpretation as mixture model by direct decomposition • Again, simplifying model assumptions applied Stephan Vogel - Machine Translation

Training IBM2 • Expectation-Maximization (EM) Algorithm • Define posterior weight (i.e. sum over column = 1) • Lexicon probabilities • Alignment probabilities count how often word pairs are aligned Turn counts into probabilities Stephan Vogel - Machine Translation

IBM1 Model • Assume uniform probability for position alignment • Alignment probability • In training: only collect counts for word pairs Stephan Vogel - Machine Translation

Training for IBM1 Model – Pseudo Code # Accumulation (over corpus) For each sentence pair For each source position j Sum = 0.0 For each target position i Sum += p(fj|ei) For each target position i Count(fj,ei) += p(fj|ei)/Sum # Re-estimate probabilities (over count table) For each target word e Sum = 0.0 For each source word f Sum += Count(f,e) For each source word f p(f|e) = Count(f,e)/Sum # Repeat for several iterations Stephan Vogel - Machine Translation

HMM Alignment Model • Idea: relative position model Entire word groups (phrases) are moved with respect to source position Target Source Stephan Vogel - Machine Translation

HMM Alignment • First order model: target position dependent on previous target position(captures movement of entire phrases) • Alignment probability: • Maximum approximation: Stephan Vogel - Machine Translation

Viterbi Training on HMM Model # Accumulation (over corpus) # find Viterbi path For each sentence pair For each source position j For each target position i Pbest = 0; t = p(fj|ei) For each target position i’ Pprev = P(j-1,i’) a = p(i|i’,I,J) Pnew = Pprev*t*a if (Pnew > Pbest) Pbest = Pnew BackPointer(j,i) = i’ # update counts i = argmax{ BackPointer( J, I ) } For each j from J downto 1 Count(f_j, e_i)++ Count(i,iprev,I,J)++ i = BackPoint(j,i) # renormalize … Pnew=Pprev*a*t t = p(fj | ei) a = p(i | i’,I,J) Pprev Stephan Vogel - Machine Translation

i j HMM Forward-Backward Training • Gamma : Probability to emit fj when in state i in sentence s • Sum over all paths through (j,i) Stephan Vogel - Machine Translation

HMM Forward-Backward Training • Epsilon: Probability to transit from state i’ into i • Sum over all paths through (j-1,i’) and (j,i), emitting fj i j-1 j Stephan Vogel - Machine Translation

Forward Probabilities • Defined as: • Recursion: • Initial condition: i Stephan Vogel - Machine Translation j

Backward Probabilities • Defined as: • Recursion: • Initial condition: i Stephan Vogel - Machine Translation j

Forward-Backward • Calculate Gamma and Epsilon with Alpha and Beta: • Gammas: • Epsilons: Stephan Vogel - Machine Translation

Parameter Re-Estimation • Lexicon probabilities • Alignment probabilities: Stephan Vogel - Machine Translation

Forward-Backward Training – Pseudo Code # Accumulation For each sentence-pair { Forward. (Calculate Alpha’s) Backward. (Calculate Beta’s) Calculate Xi’s and Gamma’s. For each source word { Increase LexiconCount(f_j|e_i) by Gamma(j,i). Increase AlignCount(i|i’) by Epsilon(j,i,i’). } } # Update Normalize LexiconCount to get P(f_j|e_i). Normalize AlignCount to get P(i|i’). Stephan Vogel - Machine Translation

Example HMM Training Stephan Vogel - Machine Translation

Statistical Machine Translation Word Alignment

Statistical Machine Translation Word Alignment

Presentation Transcript

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation Part V – Better Word Alignment, Morphology and Syntax

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Machine Translation Discriminative Word Alignment

Statistical Machine Translation

Machine Translation Word Alignment

Statistical Machine Translation Part IX – Better Word Alignment, Morphology and Syntax

Chinese Word Segmentation and Statistical Machine Translation

Statistical Machine Translation

Statistical Alignment and Machine Translation

Bayesian Word Alignment for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation Part VI – Better Word Alignment, Morphology and Syntax

Statistical Machine Translation

Chinese Word Segmentation and Statistical Machine Translation