Improving SMT with
1 / 29

Improving SMT with Phrase to Phrase Translations - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Improving SMT with Phrase to Phrase Translations. Joy Ying Zang, Ashish Venugopal, Stephan Vogel, Alex Waibel Carnegie Mellon University Project: Mega-RADD. CMU Mega RADD.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Improving SMT with Phrase to Phrase Translations

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Improving SMT withPhrase to Phrase Translations

Joy Ying Zang, Ashish Venugopal,

Stephan Vogel, Alex Waibel

Carnegie Mellon University

Project: Mega-RADD


The Mega-RADD Team:SMT: Stephan Vogel, Alex Waibel, John Lafferty,EMBT: Ralf Brown, Bob Frederking,Chinese: Joy Ying Zang, Ashish Venugopal,

Bing Zhao, Fei HuangArabic: Alicia Tribble, Ahmed Badran


  • Goals:

    • Develop Data-Driven General Purpose MT Systems

    • Train on Large and Small Corpora, Evaluate to test Portability

  • Approaches

    • Two Data-driven Approaches: Statistical, Example-Based

    • Also Grammar based Translation System

    • Multi-Engine Translation

  • Languages: Chinese and Arabic

  • Statistical Translation:

    • Exploit Structure in Language: Phrases

    • Determine Phrases from Mono- and Bi-Lingual Co-occurrences

    • Determine Phrases from Lexical and Alignment Information

Arabic: Initial System

  • 1 million words of UN data, 300 sentences for testing

  • Preprocessing: separation of punctuation marks, lower case for English, correction of corrupted numbers

  • Adding Human knowledge: cleaning statistical lexicon for 100 most frequent wordsbuilding lists names, simple date expressions, numbers (total: 1000 entries, total effort: two part-timers * 4 weeks)

  • Alignment: IBM1 plus HMM training, lexicon plus phrase translations

  • Language Model: trained on 1m sub-corpus

  • Results (20 May 2002):UN test data (300 sentences): Bleu = 0.1176NIST devtest (203 sentences): Bleu = 0.0242 NIST = 2.0608

Arabic: Portability to a New Language

  • Training on subset of UN corpus chosen to cover vocabulary of test data

  • Training English to Arabic for extraction of phrase translations

  • Minimalist Morphology: strip/add suffixes for ~200 unknown wordsNIST: 5.5368  5.6700

  • Adapting LM: Select stories from 2 years of English Xinhua storiesaccording to 'Arabic' keyword list (280 entries); size 6.9m words.NIST: 5.5368  5.9183

  • Results:- 20 Mai (devtest): 2.0608- 13 June (devtest): 6.5805- 14 June (evaltest): 5.4662 (final training not completed)- 17 June (evaltest): 6.4499 (after completed training)- 19 Juli (devtest): 7.0482

Two Approaches

  • Determine Phrases from Mono- and Bi-Lingual Co-occurrences

    • Joy

  • Determine Phrases from Lexical and Alignment Information

    • Ashish

Why phrases?

  • Mismatch between languages: word to word translation doesn’t work

  • Phrases encapsulate the context of words, e.g. verb tense

Why phrases? (Cont.)

  • Local reordering, e.g. Chinese relative clause

  • Using phrases to soothe word segmentation failure

Utilizing bilingual information

  • Given a sentence pair (S,T),


    T=<t1,t2,…,tj,…,tn>, where si/tj are source/target words.

  • Given an m*n matrix B, where

    B(i,j)= co-occurrence(si,tj)=

    where, N=a+b+c+d;

Utilizing bilingual information (Cont.)

  • Goal: find a partition over matrix B, under the constraint that one src/tgt word can only align to one tgt/src word or phrase (adjacent word sequence)

Legal segmentation, imperfect alignment

Illegal segmentation, perfect alignment

Utilizing bilingual information (Cont.)

For each sentence pair in the training data:

While(still has row or column not aligned){

Find cell[i,j], where B(i,j) is the max among all available(not aligned) cells;

Expand cell[i,j] with similarity sim_thresh to region[RowStart,RowEnd; ColStart,ColEnd]

Mark all the cells in the region as aligned


Output the aligned regions as phrases


Sub expand cell[i,j] with sim_thresh {

current aligned region: region[RowStart=i, RowEnd=i; ColStart=j, ColEnd=j]

While(still ok to expand){

if all cells[m,n], where m=RowStart-1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart --; //expand to north

if all cells[m,n], where m=RowEnd+1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart ++; //expand to south

… //expand to east

… //expand to west


Define similar(x,y)= true, if abs((x-y)/y) < 1-similarity_thresh

Utilizing bilingual information (Cont.)

Expand to North

Expand to South

Expand to East

Expand to West

Santa Clarita

Union town


Los Angeles



Integrating monolingual information

  • Motivation:

    • Use more information in the alignment

    • Easier for aligning phrases

    • There is much more monolingual data than bilingual data

Santa Monica

Integrating monolingual information (Cont.)

  • Given a sentence pair (S,T),

    S=<s1,s2,…,si,…sm> and T=<t1,t2,…,tj,…,tn>, where si/tj are source/target words.

  • Construct m*m matrix A,where A(i,j) = collocation(si, sj); Only A(i,i-1) and A(i,i+1) have values.

  • Construct n*n matrix C,where C(i,j) = collocation(ti, tj); Only C(j-1,j) and A(j+1,j) have values.

  • Construct m*n matrix B, where B(i,j)= co-occurrence(si, tj).

Integrating monolingual information (Cont.)

  • Normalize A so that:

  • Normalize C so that:

  • Normalize B so that:

  • Calculating new src-tgt matrix B’



Discussion and Results

  • Simple

  • Efficient

    • Partitioning the matrix is linear O(min(m,n)).

    • The construction of A*B*C is O(m*n);

  • Effective

    • Improved the translation quality from baseline (NIST= 6.3775, Bleu=0.1417 ) to (NIST= 6.7405, Bleu=0.1681) on small data track dev-test

Utilizing alignment information: Motivation

  • Alignment model associates words and their translations on the sentence level.

  • Context and co-occurrence are represented when considering a set of sentence level alignments.

  • Extract phrase relations from the alignment information.

Processing Alignments

  • Identification – Selection of the source phrases target phrase candidates.

  • Scoring – Assigning a score to each candidate phrase pair to create a ranking.

  • Pruning – Reducing the set of candidate translations to a computationally tractable number.


  • Extraction from sentence level alignments.

  • For each source phrase identify the sentences in which they occur and load the sentence alignment

  • Form a sliding/expanding window in the alignment to identify candidate translations.

Identification Example - I

Identification Example - II

  • - is

  • is in step with the

  • is in step with the establishment

  • is in step with the establishment of

  • is in step with the establishment of its

  • is in step with the establishment of its legal

  • is in step with the establishment of its legal system

  • the

  • the establishment

  • the establishment of

  • ……

  • the establishment of its legal system

  • ……

  • establishment

  • establishment of

  • establishment of its

  • ….

Scoring - I

  • This candidate set H needs to be scored and ranked before pruning.

  • Alignment based scores.

  • Similarity clustering

    • Assume that the hypothesis set contains several similar phrases ( across several sentences ) and several noisy phrases.

    • SimScore(h) = Mean(EditDistance(h, h’)/AvgLen(h,h’)) for h,h’ in H

Scoring Example

Scoring - II

  • Lexicon augmentation

    • Weight each point in alignment scoring by their lexical probability.

      • P( si | tj ) where I, J represent the area of the translation hypothesis being considered. Only the pairs of words where there is an alignment is considered.

    • Calculate translation probability of hypothesis

      • ΣiΠj P( si | tj ) All words in the hypothesis are considered.

Combining Scores

  • Final Score(h) = Πj Scorej(h) for each scoring method.

  • Due to additional morphology present in English as compared to Chinese, a length model is used to adjust the final score to prefer longer phrases.

  • Diff Ratio = (I-J) / J if I>J

  • FinalScore(h)=FinalScore(h)*(1.0+c*e(-1.0*DiffRatio) )

    • c is an experimentally determined constant


  • This large candidate list is now sorted by score and is ready for pruning.

  • Difficult to pick a threshold that will work across different phrases. We need a split point that separates the useful and the noisy candidates.

  • Split point = argmax p {MeanScore(h<p) – MeanScore(h>=p)}where h represents each hypothesis in the ordered set H.


  • Alignment model – experimented with one-way (EF) and two-way (EF-FE union/intersection) for IBM Models 1-4.

    • Best results found using union (high recall model) from model 4.

  • Both lexical augmentation (using model 1 lexicon) scores and length bonus were applied.

Results and Thoughts

Small Track

Large Track

Baseline (IBM1+LDC-Dic)



+ Phrases



-More effective pruning techniques will significantly reduced the experimentation cycle

- Improved alignment models that better combine bi-directional alignment information

Combining Methods

Small Data Track

(Dec-01 data)







+ Phrases Joy



+ Phrases Ashish



+ Phrases Joy & Ashish



  • Login