1 / 29

Statistical Machine Translation Word Alignment

Statistical Machine Translation Word Alignment. Stephan Vogel MT Class Spring Semester 2011. Overview. Word alignment – some observations Models IBM2 and IBM1: 0 th -order position model HMM alignment model: 1 st -order position model IBM3: fertility IBM4: plus relative distortion.

matia
Download Presentation

Statistical Machine Translation Word Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Machine TranslationWord Alignment Stephan Vogel MT Class Spring Semester 2011 Stephan Vogel - Machine Translation

  2. Overview • Word alignment – some observations • Models IBM2 and IBM1: 0th-order position model • HMM alignment model: 1st-order position model • IBM3: fertility • IBM4: plus relative distortion Stephan Vogel - Machine Translation

  3. Alignment Example Observations: • Mostly 1-1 • Some 1-to-many • Some 1-to-nothing • Often monotone • Not always clear-cut • English ‘eight’ is a time • German has ‘acht Uhr’ • Could also leave ‘Uhr’ unaligned Stephan Vogel - Machine Translation

  4. Evaluating Alignment • Given some manually aligned data (ref) and automatically aligned data (hyp) links can be • Correct, i.e. link in hyp matches link in ref: true positive (tp) • Wrong, i.e. link in hyp but not in ref: false positive (fp) • Missing, i.e. link in ref but not in hyp: false negative (fn) • Evaluation measures • Precision: P = tp / (tp + fp) = correct / links_in_hyp • Recall: R = tp / (tp + fn) = correct / links_in_ref • Alignment Error Rate: AER = 1 – F = 1 – 2tp / (2tp +fp +fn) Stephan Vogel - Machine Translation

  5. Sure and Possible Links • Sometimes it is difficult for human annotators to decide • Differentiate between sure and possible links • En: Det Noun - Ch: Noun, don’t align Det, or align to NULL? • En: Det Noun - Ar: DetNoun, should Det be aligned to DetNoun? • Alignment Error Rate with sure and possible links (Och 2000) • A = generated links • S = sure links (no finding a sure link is an error) • P = possible links (putting a link which is not possible is an error) Stephan Vogel - Machine Translation

  6. Word Alignment Models • IBM1 – lexical probabilities only • IBM2 – lexicon plus absolut position • IBM3 – plus fertilities • IBM4 – inverted relative position alignment • IBM5 – non-deficient version of model 4 • HMM – lexicon plus relative position • BiBr – Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation • Syntactical alignment models [Brown et.al. 1993, Vogel et.al. 1996, Och et al 2000, Wu 1997, Yamada et al. 2003, and many others] Stephan Vogel - Machine Translation

  7. GIZA++ Alignment Toolkit • All standard alignment models (IBM1 … IBM5, HMM) are implemented in GIZA++ • This toolkit was started (as GIZA) at John Hopkins University workshop 1998 • Extended and improved by Franz Josef Och • Now used by many groups • Known problems: • Memory when training on large corpora • Writes many large files (depends on your parameter setting) • Extensions for large corpora (Qin Gao) • Distributed GIZA: run on many machines, I/O bound • Multithreaded GIZA: run on one machine, multiple cores Stephan Vogel - Machine Translation

  8. Notation • Source language • f: source (French) word • J: length of source sentence • j: position in source sentence (target position) • : source sentence • Target language • e: target (English) word • I: length of target sentence • i: position in target sentence (source position) • : target sentence • Alignment: relation mapping source to target positions • i=aj: position i of ei which is aligned to j • : whole alignment Stephan Vogel - Machine Translation

  9. SMT - Principle • Translate a ‘French’ stringinto an ‘English’ string • Bayes’ decision rule for translation: • Why this inversion of the translation direction? • Decomposition of dependencies: makes modeling easier • Cooperation of two knowledge sources for final decision • Note: IBM paper and GIZA call e source and f target Stephan Vogel - Machine Translation

  10. Alignment as Hidden Variable • ‘Hidden alignments’ to capture word-to-word correspondences • Mapping A subset of [1, …, J]x[1, …, I] • Number of connections: J * I (each source word with each target word • Number of alignments: 2JI (each connection yes/no) • Summation over all alignments • To many alignments, summation not feasible Stephan Vogel - Machine Translation

  11. Restricted Alignment • Each source word has one connection • Alignment mapping becomes function: j -> i = aj • Number of alignments is now: IJ • Sum over all alignments: • Not possible to enumerate • In some situations full summationpossible through Dynamic Programming • In other situations: take only best alignmentand perhaps some alignments closeto the best one Stephan Vogel - Machine Translation

  12. Empty Position (Null Word) • Sometimes a word has no correspondence • Alignment function aligns each source word to one target word, i.e. cannot skip source word • Solution: • Introduce empty position 0 with null word e0 • ‘Skip’ source word fj by aligning it to e0 • Target sentence is extended to: • Alignment is extended to: Stephan Vogel - Machine Translation

  13. Translation Model • Sum over all alignment • 3 probability distributions: • Length: • Alignment: • Lexicon: Stephan Vogel - Machine Translation

  14. Model Assumptions Decompose interaction into pairwise dependencies • Length: Source length only dependent on target length (very weak) • Alignment: • Zero order model: target position only dependent on source position • First order model: target position only dependent on previous target position • Lexicon: source word only dependent on aligned word Stephan Vogel - Machine Translation

  15. Mixture Model • Interpretation as mixture model by direct decomposition • Again, simplifying model assumptions applied Stephan Vogel - Machine Translation

  16. Training IBM2 • Expectation-Maximization (EM) Algorithm • Define posterior weight (i.e. sum over column = 1) • Lexicon probabilities • Alignment probabilities count how often word pairs are aligned Turn counts into probabilities Stephan Vogel - Machine Translation

  17. IBM1 Model • Assume uniform probability for position alignment • Alignment probability • In training: only collect counts for word pairs Stephan Vogel - Machine Translation

  18. Training for IBM1 Model – Pseudo Code # Accumulation (over corpus) For each sentence pair For each source position j Sum = 0.0 For each target position i Sum += p(fj|ei) For each target position i Count(fj,ei) += p(fj|ei)/Sum # Re-estimate probabilities (over count table) For each target word e Sum = 0.0 For each source word f Sum += Count(f,e) For each source word f p(f|e) = Count(f,e)/Sum # Repeat for several iterations Stephan Vogel - Machine Translation

  19. HMM Alignment Model • Idea: relative position model Entire word groups (phrases) are moved with respect to source position Target Source Stephan Vogel - Machine Translation

  20. HMM Alignment • First order model: target position dependent on previous target position(captures movement of entire phrases) • Alignment probability: • Maximum approximation: Stephan Vogel - Machine Translation

  21. Viterbi Training on HMM Model # Accumulation (over corpus) # find Viterbi path For each sentence pair For each source position j For each target position i Pbest = 0; t = p(fj|ei) For each target position i’ Pprev = P(j-1,i’) a = p(i|i’,I,J) Pnew = Pprev*t*a if (Pnew > Pbest) Pbest = Pnew BackPointer(j,i) = i’ # update counts i = argmax{ BackPointer( J, I ) } For each j from J downto 1 Count(f_j, e_i)++ Count(i,iprev,I,J)++ i = BackPoint(j,i) # renormalize … Pnew=Pprev*a*t t = p(fj | ei) a = p(i | i’,I,J) Pprev Stephan Vogel - Machine Translation

  22. i j HMM Forward-Backward Training • Gamma : Probability to emit fj when in state i in sentence s • Sum over all paths through (j,i) Stephan Vogel - Machine Translation

  23. HMM Forward-Backward Training • Epsilon: Probability to transit from state i’ into i • Sum over all paths through (j-1,i’) and (j,i), emitting fj i j-1 j Stephan Vogel - Machine Translation

  24. Forward Probabilities • Defined as: • Recursion: • Initial condition: i Stephan Vogel - Machine Translation j

  25. Backward Probabilities • Defined as: • Recursion: • Initial condition: i Stephan Vogel - Machine Translation j

  26. Forward-Backward • Calculate Gamma and Epsilon with Alpha and Beta: • Gammas: • Epsilons: Stephan Vogel - Machine Translation

  27. Parameter Re-Estimation • Lexicon probabilities • Alignment probabilities: Stephan Vogel - Machine Translation

  28. Forward-Backward Training – Pseudo Code # Accumulation For each sentence-pair { Forward. (Calculate Alpha’s) Backward. (Calculate Beta’s) Calculate Xi’s and Gamma’s. For each source word { Increase LexiconCount(f_j|e_i) by Gamma(j,i). Increase AlignCount(i|i’) by Epsilon(j,i,i’). } } # Update Normalize LexiconCount to get P(f_j|e_i). Normalize AlignCount to get P(i|i’). Stephan Vogel - Machine Translation

  29. Example HMM Training Stephan Vogel - Machine Translation

More Related