ibm statistical machine translation for spoken languages n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
IBM Statistical Machine Translation for Spoken Languages PowerPoint Presentation
Download Presentation
IBM Statistical Machine Translation for Spoken Languages

Loading in 2 Seconds...

play fullscreen
1 / 18

IBM Statistical Machine Translation for Spoken Languages - PowerPoint PPT Presentation


  • 570 Views
  • Uploaded on

IBM T. J. Watson Research Center. IBM Statistical Machine Translation for Spoken Languages. Young-Suk Lee IWSLT 2005 October 24−25, 2005. © 2005 IBM Corporation. IBM T. J. Watson Research Center. Outline. Baseline Phrase Translation System Block Acquisition Decoding

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

IBM Statistical Machine Translation for Spoken Languages


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ibm statistical machine translation for spoken languages

IBM T. J. Watson Research Center

IBM Statistical Machine Translation for Spoken Languages

Young-Suk LeeIWSLT 2005October 24−25, 2005

© 2005 IBM Corporation

outline

IBM T. J. Watson Research Center

Outline
  • Baseline Phrase Translation System
    • Block Acquisition
    • Decoding
  • Performance Enhancing Techniques
    • Extended Block Acquisition Algorithm
    • System Combination
  • IWSLT 2005 Evaluations
  • Conclusions & Future Work

© 2005 IBM Corporation

baseline system block acquisition

e1

e2

e3

e4

e5

e6

f1

f2

f3

IBM T. J. Watson Research Center

Baseline System: Block Acquisition

Block (b): a phrase translation pair consisting of source

& target phrase

© 2005 IBM Corporation

decoding i

IBM T. J. Watson Research Center

Decoding I
  • Phrase translation models
    • Direct model:
    • Source channel model:
    • Block unigram model:

© 2005 IBM Corporation

decoding ii

IBM T. J. Watson Research Center

Decoding II
  • IBM Model 1 cost per phrase in both directions
  • Word trigram language model
  • Word-level distortion models applied to blocks
  • Word count penalty
  • Block count penalty

© 2005 IBM Corporation

extended block acquisition

Arabic: lA Aryd AzAlthA

لا أريدإزالتها

IBM T. J. Watson Research Center

Extended Block Acquisition

English:Ido n't want it extracted

© 2005 IBM Corporation

extended block acquisition algorithm

I don't want it extracted

lA Aryd AzAlthA

لا أريدإزالتها

IBM T. J. Watson Research Center

Extended Block Acquisition Algorithm
  • Expansion word list: A list of target words typically aligned to null source words (e.g. I, do, it)
  • Extend the target phrase to include an expansion word if it occurs in the neighborhood of a seed block

© 2005 IBM Corporation

impact of extended block aquisition a2e

IBM T. J. Watson Research Center

Impact of Extended Block Aquisition: A2E

BLEUr16n4

EXTENDED

EXTENDED

Reordering Rules

CSTAR 03 Dev Set

IWSLT 04 Dev Set

© 2005 IBM Corporation

impact of extended block acquisition c2e

IBM T. J. Watson Research Center

Impact of Extended Block Acquisition: C2E

BLEUr16n4

EXTENDED

EXTENDED

Reordering Rules

CSTAR 03 Dev Set IWSLT 04 Dev Set

© 2005 IBM Corporation

system combination recipe

IBM T. J. Watson Research Center

System Combination: Recipe

Phrase

Lexicon 1

Phrase

Lexicon 2

Phrase

Lexicon 3

SYSTEM 1

SYSTEM 2

SYSTEM 3

translate

translate

translate

Algorithm: Select the Best

© 2005 IBM Corporation

arabic to english phrase lexicons

IBM T. J. Watson Research Center

Arabic-to-English Phrase Lexicons

llmEArDp 'of the opposition' → l# Al# EArD +p → l# Al# EArDp

lA Aryd AzAlthA → lA A# ryd AzAl +t +hA → lA Aryd AzAlt +hA

OOV Ratio

© 2005 IBM Corporation

system combination algorithm

YES

NO

...

YES

NO

IBM T. J. Watson Research Center

System Combination Algorithm
  • h-sys (system producing the highest BLEU score) vs. l-sys1, l-sys2, ..., l-sysn

output(l-sys1)

cost(h-sys) >

cost(l-sys1) +

threshold_1

output(l-sysn)

cost(h-sys) >

cost(l-sysn) +

threshold_n

output(h-sys)

  • Combine the selected output as the final translation

© 2005 IBM Corporation

impact of system combination iwslt 05 a2e unrestricted data track

IBM T. J. Watson Research Center

Impact of System Combination: IWSLT 05 A2E Unrestricted Data Track

BLEUr16n4

system combination

morph segmented

morph analysis

unsegmented

Reordering Rules

© 2005 IBM Corporation

impact of system combination iwslt 05 c2e unrestricted data track

IBM T. J. Watson Research Center

Impact of System Combination: IWSLT 05 C2E Unrestricted Data Track

BLEUr16n4

char seg & unreordered

system combination

word seg & reorder

char seg & reorder

Reordering Rules

© 2005 IBM Corporation

iwslt 2005 training corpora for a2e

IBM T. J. Watson Research Center

IWSLT 2005: Training Corpora for A2E

TM: Number of sentence pairs, LM: Number of words

© 2005 IBM Corporation

conclusions future work

IBM T. J. Watson Research Center

Conclusions & Future Work
  • Conclusions
    • Robust system performances on
      • Large & small training corpora
      • Various language pairs: A2E, C2E, S2E, E2S
    • System combination & Extended block acquisition algorithm
      • Effective for A2E & C2E translations
  • Future Work: System Combination
    • Extend the technique to models derived by distinct algorithms
    • Refine the algorithm to discriminate effective decoder parameters
    • Apply the technique to TC-Star SLT partner systems

© 2005 IBM Corporation

iwslt 2005 training corpora for c2e

IBM T. J. Watson Research Center

IWSLT 2005: Training Corpora for C2E

TM: Number of sentence pairs, LM: Number of words

© 2005 IBM Corporation