Statistical Machine Translation: IBM Model 1 and Word Alignment

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, section 1:Natural Language Processing Lecture #34: Machine Translation, Word Alignment Models Thanks to Dan Klein of UC Berkeley and Chris Manning of Stanford for all of the materials used in this lecture.

Announcements • Project #4 • Note the clarification about horizontal markovization *order 2* in the instructions • Project #5 • Help session: today at 4pm in CS Conference Room (3350 TMCB) • Propose-your-own • Keep moving forward • Project Report: • Early: Wednesday after Thanksgiving • Due: Friday after Thanksgiving • Homework 0.4 • See end of lecture • Due: Monday

Quiz – take 2 • What are the four steps of the Expectation Maximization (EM) algorithm? • Think of the document clustering example, if that helps • What is the primary purpose of EM?

Objectives • Understand the role of alignment in statistical approaches to translation • Understand statistical word alignment • Define IBM Model 1, and understand how to train it using EM

The Coding View • “One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ” • Warren Weaver (1955:18, quoting a letter he wrote in 1947)

Learning Correspondence • What would you do, if I asked you to align these two strings?   • What if I asked you to learn a translation lexicon from this pair? • What is the missing data here?

What if you had more pairs? 1.   2.   3.  

MT System Components Language Model Translation Model channel source P(e) e f P(f|e) observed best decoder e* f e* = argmax P(e|f) = argmaxP(f|e)P(e) e e Finds an English translation that is both fluent and semantically faithful to the original foreign language.

Simple MT • The components of a simple MT system: • You already know about the LM • Word-alignment based Translation Models (TMs) • IBM models 1 and 2 – Assignment #0.4 and Project #5! • A simple decoder • Next few classes, as time permits • More complex word-level and phrase-level TMs • More sophisticated decoders

What might a model of look like? A Word-Level TM? How to estimate this? What can go wrong here?

A Word-Level TM? • Can we break down the granularity of the model even further to overcome the trouble posed by sparsity?

IBM Model 1 (Brown et al., 93) • Alignment: a hidden vector specifying which English source is responsible for each French target word NULL

IBM Model 1 (Brown et al., 93) • Alignment: a hidden vector specifying which English source is responsible for each French target word NULL How do we get from here? NULL

EM for Model 1 • Model 1 Parameters: • Translation probabilities: • Start with uniform (or ), including • Top: Initialize for all words and . • (E-step) For each pair of sentences in the parallel corpus: • For each French position • For each English position , • Calculate the posterior probability: • Increment count of word with word by these amounts (as “partial counts”):

EM for Model 1 (part 2) • (M-step) • For each word that appears in at least one , (including null) • For each that appears in at least one • Re-estimate by normalizing the count: • Repeat at step “Top:” until • convergence of Or • a pre-specified number of times • Result: a “translation table” for each value of e

Assignment #0.4 Objective: To work with and understand IBM Model 1 and EM • Data: • I like it | Me gusta • You like it | Tegusta • Result: • An IBM Model 1 translation table for each English word , including NULL • See course wiki for details

Next • Trouble with Model 1 • Improvement to Model 1: Model 2! • Happy Thanksgiving!

Statistical Machine Translation: IBM Model 1 and Word Alignment