statistical machine translation l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Statistical Machine Translation PowerPoint Presentation
Download Presentation
Statistical Machine Translation

Loading in 2 Seconds...

play fullscreen
1 / 82

Statistical Machine Translation - PowerPoint PPT Presentation


  • 313 Views
  • Uploaded on

Statistical Machine Translation Alona Fyshe Based on slides from Colin Cherry and Dekang Lin Basic statistics 0 <= P(x) <=1 P(A) Probability that A happens P(A,B) Probabiliy that A and B happen P(A|B) Probability that A happens given that we know B happened Basic statistics

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Statistical Machine Translation' - ostinmannual


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
statistical machine translation

Statistical Machine Translation

Alona Fyshe

Based on slides from Colin Cherry and Dekang Lin

basic statistics
Basic statistics
  • 0 <= P(x) <=1
  • P(A)
    • Probability that A happens
  • P(A,B)
    • Probabiliy that A and B happen
  • P(A|B)
    • Probability that A happens given that we know B happened
basic statistics3
Basic statistics
  • Conditional probability
basic statistics4
Basic Statistics
  • Use definition of conditional probability to derive the chain rule
basic statistics6
Basic Statistics
  • Just remember
    • Definition of cond. prob.
    • Bayes rule
    • Chain rule
slide7
Goal
  • Translate.
  • I’ll use French (F) into English (E) as the running example.
oh canada
Oh, Canada
  • I’m Canadian
    • Mandatory French class in school until grade 6
    • I speak “Cereal Box French”

Gratuit

Gagner

Chocolat

Glaçage

Sans gras

Sans cholestérol

Élevé dans la fibre

machine translation
Machine Translation
  • Translation is easy for (bilingual) people
  • Process:
    • Read the text in French
    • Understand it
    • Write it down in English
machine translation11
Machine Translation
  • Translation is easy for (bilingual) people
  • Process:
    • Read the text in French
    • Understand it
    • Write it down in English
machine translation12
Machine Translation

Understanding language

Writing well formed text

  • Hard tasks for computers
    • The human process is invisible, intangible
one approach babelfish
One approach: Babelfish
  • A rule-based approach to machine translation
  • A 30-year-old feat in Software Eng.
  • Programming knowledge in by hand is difficult and expensive
alternate approach statistics
Alternate Approach: Statistics
  • We are trying to model P(E|F)
    • I give you a French sentence
    • You give me back English
  • How are we going to model this?
    • We could use Bayes rule:
why bayes rule at all
Why Bayes rule at all?
  • Why not model P(E|F) directly?
  • P(F|E)P(E) decomposition allows us to be sloppy
    • P(E) worries about good English
    • P(F|E) worries about French that matches English
    • The two can be trained independently
crime scene analogy
Crime Scene Analogy
  • F is a crime scene. E is a person who may have committed the crime
    • P(E|F) - look at the scene - who did it?
    • P(E) - who had a motive? (Profiler)
    • P(F|E) - could they have done it? (CSI - transportation, access to weapons, alabi)
  • Some people might have great motives, but no means - you need both!
on voit jon la t l vision
On voit Jon à la télévision

Table borrowed from Jason Eisner

on voit jon la t l vision19
On voit Jon à la télévision

Table borrowed from Jason Eisner

i speak english good
I speak English good.
  • How are we going to model good English?
  • How do we know these sentences are not good English?
    • Jon appeared in TV.
    • It back twelve saw.
    • In Jon appeared TV.
    • TV appeared on Jon.
    • Je ne parle pas l'anglais.
i speak english good21
I speak English good.
  • Je ne parle pas l'anglais.
    • These aren’t English words.
  • It back twelve saw.
    • These are English words, but it’s jibberish.
  • Jon appeared in TV.
    • “appeared in TV” isn’t proper English
i speak english good22
I speak English good.
  • Let’s say we have a huge collection of documents written in English
    • Like, say, the Internet.
  • It would be a pretty comprehensive list of English words
    • Save for “named entities” People, places, things
    • Might include some non-English words
      • Speling mitsakes! lol!
  • Could also tell if a phrase is good English
google is this good english
Google, is this good English?
  • Jon appeared in TV.
    • “Jon appeared” 1,800,000 Google results
    • “appeared in TV” 45,000 Google results
    • “appeared on TV” 210,000 Google results
  • It back twelve saw.
    • “twelve saw” 1,100 Google results
    • “It back twelve” 586 Google results
    • “back twelve saw” 0 Google results
  • Imperfect counting… why?
google is this good english24
Google, is this good English?
  • Language is often modeled this way
    • Collect statistics about the frequency of words and phrases
    • N-gram statistics
      • 1-gram = unigram
      • 2-gram = bigram
      • 3-gram = trigram
      • 4-gram = four-gram
      • 5-gram = five-gram
google is this good english25
Google, is this good English?
  • Seriously, you want to query google for every phrase in the translation?
  • Google created and released a 5-gram data set.
    • Now you can query Google locally
      • (kind of)
language modeling
Language Modeling
  • What’s P(e)?
    • P(English sentence)
    • P(e1, e2, e3 … ei)
    • Using the chain rule
language modeling27
Language Modeling
  • Markov assumption
    • The choice of word ei depends only on the n words before ei
  • Definition of conditional probability
language modeling29
Language Modeling
  • Approximate probability using counts
  • Use the n-gram corpus!
language modeling30
Language Modeling
  • Use the n-gram corpus!
    • Not surprisingly, given that you love to eat, loving to eat chocolate is more probable (0.177)
language modeling31
Language Modeling
  • But what if
  • Then P(e) = 0
  • Happens even if the sentence is grammatically correct
    • “Al Gore’s pink Hummer was stolen.”
language modeling32
Language Modeling
  • Smoothing
    • Many techniques
  • Add one smoothing
    • Add one to every count
    • No more zeros, no problems
  • Backoff
    • If P(e1, e2, e3, e4, e5) = 0 use something related to P(e1, e2, e3, e4)
language modeling33
Language Modeling
  • Wait… Is this how people “generate” English sentences?
    • Do you choose your fifth word based on B
  • Admittedly, this is an approximation to process which is both
    • intangible and
    • hard for humans themselves to explain
  • If you disagree, and care to defend yourself, consider a PhD in NLP
back to translation
Back to Translation
  • Anyway, where were we?
    • Oh right…
    • So, we’ve got P(e), let’s talk P(f|e)
where will we get p f e
Where will we get P(F|E)?

Machine

Learning

Magic

Cereal boxes

in English

Same cereal

Boxes,

in French

P(F|E) model

where will we get p f e36
Where will we get P(F|E)?

Machine

Learning

Magic

Books in

English

Same books,

in French

P(F|E) model

We call collections stored in two languages parallel corpora or parallel texts

Want to update your system? Just add more text!

translated corpora
Translated Corpora
  • The Canadian Parliamentary Debates
    • Available in both French and English
  • UN documents
    • Available in Arabic, Chinese, English, French, Russian and Spanish
problem
Problem:
  • How are we going to generalize from examples of translations?
  • I’ll spend the rest of this lecture telling you:
    • What makes a useful P(F|E)
    • How to obtain the statistics needed for P(F|E) from parallel texts
strategy generative story
Strategy: Generative Story
  • When modeling P(X|Y):
    • Assume you start with Y
    • Decompose the creation of X from Y into some number of operations
    • Track statistics of individual operations
    • For a new example X,Y: P(X|Y) can be calculated based on the probability of the operations needed to get X from Y
what if

The quick fox jumps over the lazy dog

Le renard rapide saut par - dessus le chien parasseux

What if…?
new information
New Information
  • Call this new info a word alignment (A)
  • With A, we can make a good story

The quick fox jumps over the lazy dog

Le renard rapide saut par - dessus le chien parasseux

p f a e story
P(F,A|E) Story

null The quick fox jumps over the lazy dog

p f a e story43
P(F,A|E) Story

null The quick fox jumps over the lazy dog

f1

f2

f3

f10

Simplifying assumption: Choose the length of the French sentence f. All lengths have equal probability 

p f a e story44
P(F,A|E) Story

null The quick fox jumps over the lazy dog

f1

f2

f3

f10

There are (l+1)m = (8+1)10 possible alignments

p f a e story45
P(F,A|E) Story

null The quick fox jumps over the lazy dog

Le renard rapide saut par - dessus le chien parasseux

p f a e story46
P(F,A|E) Story

null The quick fox jumps over the lazy dog

Le renard rapide saut par - dessus le chien parasseux

getting p t f e

null The quick fox jumps over the lazy dog

null The quick fox jumps over the lazy dog

null The quick fox jumps over the lazy dog

null The quick fox jumps over the lazy dog

null The quick fox jumps over the lazy dog

null The quick fox jumps over the lazy dog

Le renard rapide saut par - dessus le chien parasseux

Le renard rapide saut par - dessus le chien parasseux

Le renard rapide saut par - dessus le chien parasseux

Le renard rapide saut par - dessus le chien parasseux

Le renard rapide saut par - dessus le chien parasseux

Le renard rapide saut par - dessus le chien parasseux

Getting Pt(f|e)
  • We need numbers for Pt(f|e)
  • Example: Pt(le|the)
    • Count lines in a large collection of aligned text
where do we get the lines
Where do we get the lines?
  • That sure looked like a lot of monkeys…
  • Remember: some times the information hidden in the text just jumps out at you
    • We’ll get alignments out of unaligned text by treating the alignment as a hidden variable
    • We infer an A that maxes the prob. of our corpus
    • Generalization of ideas in HMM training: called EM
where s heaven in vietnamese
Where’s “heaven” in Vietnamese?

Example borrowed from Jason Eisner

where s heaven in vietnamese50
Where’s “heaven” in Vietnamese?

English: In the beginning God created the heavens and the earth.

Vietnamese: Ban dâu Dúc Chúa Tròi dung nên tròi dât.

English: God called the expanse heaven.

Vietnamese: Dúc Chúa Tròi dat tên khoang không la tròi.

English: … you are this day like the stars of heaven in number.

Vietnamese: … các nguoi dông nhu sao trên tròi.

Example borrowed from Jason Eisner

where s heaven in vietnamese51
Where’s “heaven” in Vietnamese?

English: In the beginning God created the heavens and the earth.

Vietnamese: Ban dâu Dúc Chúa Tròi dung nên tròi dât.

English: God called the expanse heaven.

Vietnamese: Dúc Chúa Tròi dat tên khoang không la tròi.

English: … you are this day like the stars of heaven in number.

Vietnamese: … các nguoi dông nhu sao trên tròi.

Example borrowed from Jason Eisner

em expectation maximization
EM: Expectation Maximization
  • Assume a probability distribution (weights) over hidden events
    • Take counts of events based on this distribution
    • Use counts to estimate new parameters
    • Use parameters to re-weight examples.
  • Rinse and repeat
alignment hypotheses

0.65

0.25

0.05

null I like milk

null I like milk

null I like milk

Je aime le lait

Je aime le lait

Je aime le lait

0.01

0.01

0.01

null I like milk

null I like milk

null I like milk

Je aime le lait

Je aime le lait

Je aime le lait

0.01

0.001

null I like milk

null I like milk

Je aime le lait

Je aime le lait

Alignment Hypotheses
weighted alignments
Weighted Alignments
  • What we’ll do is:
    • Consider every possible alignment
    • Give each alignment a weight - indicating how good it is
    • Count weighted alignments as normal
good grief we forgot about p f e
Good grief! We forgot about P(F|E)!
  • No worries, a little more stats gets us what we need:
big example corpus
Big Example: Corpus

1

fast car

voiture rapide

2

fast

rapide

possible alignments
Possible Alignments

1a

1b

2

fast car

fast car

fast

voiture rapide

voiture rapide

rapide

parameters
Parameters

1a

1b

2

fast car

fast car

fast

voiture rapide

voiture rapide

rapide

weight calculations
Weight Calculations

1a

1b

2

fast car

fast car

fast

voiture rapide

voiture rapide

rapide

count lines
Count Lines

1a

1b

2

fast car

fast car

fast

1/2

1/2

1

voiture rapide

voiture rapide

rapide

count lines61
Count Lines

1a

1b

2

fast car

fast car

fast

1/2

1/2

1

voiture rapide

voiture rapide

rapide

count lines62
Count Lines

1a

1b

2

fast car

fast car

fast

1/2

1/2

1

voiture rapide

voiture rapide

rapide

Normalize

parameters63
Parameters

1a

1b

2

fast car

fast car

fast

voiture rapide

voiture rapide

rapide

weight calculations64
Weight Calculations

1a

1b

2

fast car

fast car

fast

voiture rapide

voiture rapide

rapide

count lines65
Count Lines

1a

1b

2

fast car

fast car

fast

1/4

3/4

1

voiture rapide

voiture rapide

rapide

count lines66
Count Lines

1a

1b

2

fast car

fast car

fast

1/4

3/4

1

voiture rapide

voiture rapide

rapide

count lines67
Count Lines

1a

1b

2

fast car

fast car

fast

1/4

3/4

1

voiture rapide

voiture rapide

rapide

Normalize

after many iterations
After many iterations:

1a

1b

2

fast car

fast car

fast

~0

~1

1

voiture rapide

voiture rapide

rapide

seems too easy
Seems too easy?
  • What if you have no 1-word sentence?
    • Words in shorter sentences will get more weight - fewer possible alignments
    • Weight is additive throughout the corpus: if a word e shows up frequently with some other word f, P(f|e) will go up
      • Like our heaven example
the final product
The Final Product
  • Now we have a model for P(F|E)
  • Test it by aligning a corpus!
    • IE: Find argmaxAP(A|F,E)
  • Use it for translation:
    • Combine with our n-gram model for P(E)
    • Search space of English sentences for one that maximizes P(E)P(F|E) for a given F
model could be a lot better
Model could be a lot better:
  • Word positions
  • Multiple f’s generated by the same e
  • Could take into account who generated your neighbors
  • Could use syntax, parsing
  • Could align phrases
sure but is it any better
Sure, but is it any better?
  • We’ve got some good ideas for improving translation
  • How can we quantify the change translation quality?
sure but is it any better73
Sure, but is it any better?
  • How to (automatically) measure translation?
    • Original French

Dès qu'il fut dehors, Pierre se dirigea vers la rue de Paris, la principale rue du Havre, éclairée, animée, bruyante.

    • Human translation to English

As soon as he got out, Pierre made his way to the Rue de Paris, the high-street of Havre, brightly lighted up, lively and noisy.

    • Two machine tranlations back to French:
      • Dès qu'il est sorti, Pierre a fait sa manière à la rue De Paris, la haut-rue de Le Havre, brillamment allumée, animée et bruyante.
      • Dès qu'il en est sorti, Pierre s'est rendu à la Rue de Paris, de la grande rue du Havre, brillamment éclairés, animés et bruyants.

Example from http://www.readwriteweb.com/archives/google_translation_systran.php

bleu score
Bleu Score
  • Bleu
    • Bilingual Evaluation Understudy
    • A metric for comparing translations
  • Considers
    • n-grams in common with the target translation
    • Length of target translation
  • Score of 1 is identical, 0 shares no words in common
  • Even human translations don’t score 1
google translate
Google Translate
  • http://translate.google.com/translate_t
    • 25 language pairs
  • In the news (digg.com)
    • http://www.readwriteweb.com/archives/google_translation_systran.php
  • In competition
    • http://www.nist.gov/speech/tests/mt/doc/mt06eval_official_results.html
references inspiration sources of borrowed material
References(Inspiration, Sources of borrowed material)
  • Colin Cherry, MT for NLP, 2005 http://www.cs.ualberta.ca/~colinc/ta/MT650.pdf
  • Knight, K., Automating Knowledge Acquisition for Machine Translation , AI Magazine 18(4), 1997.
  • Knight, K., A Statistical Machine Translation Tutorial Workbook, 1999, http://www.clsp.jhu.edu/ws99/projects/mt/mt-workbook.htm
  • Eisner, J., JHU NLP Course notes: Machine Translation, 2001, http://www.cs.jhu.edu/~jason/465/PDFSlides/lect32-translation.pdf
  • Olga Kubassova, Probability for NLP, http://www.comp.leeds.ac.uk/olga/ProbabilityTutorial.ppt
enumerating all alignments
Enumerating all alignments

There are possible alignments!

slide79
Gah!

Null (0) Fast (1) car (2)

Voiture (1) rapide (2)

let s move these over here
Let’s move these over here…

Null (0) Fast (1) car (2)

Voiture (1) rapide (2)

and now we can do this
And now we can do this…

Null (0) Fast (1) car (2)

Voiture (1) rapide (2)

so it turns out
So, it turns out:

Requires only operations.

Can be used whenever your alignment choice for one word does not affect the probability of the rest of the alignment