ibm statistical machine translation for spoken languages n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
IBM Statistical Machine Translation for Spoken Languages PowerPoint Presentation
Download Presentation
IBM Statistical Machine Translation for Spoken Languages

Loading in 2 Seconds...

play fullscreen
1 / 18

IBM Statistical Machine Translation for Spoken Languages - PowerPoint PPT Presentation


  • 562 Views
  • Uploaded on

IBM T. J. Watson Research Center. IBM Statistical Machine Translation for Spoken Languages. Young-Suk Lee IWSLT 2005 October 24−25, 2005. © 2005 IBM Corporation. IBM T. J. Watson Research Center. Outline. Baseline Phrase Translation System Block Acquisition Decoding

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'IBM Statistical Machine Translation for Spoken Languages' - niveditha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ibm statistical machine translation for spoken languages

IBM T. J. Watson Research Center

IBM Statistical Machine Translation for Spoken Languages

Young-Suk LeeIWSLT 2005October 24−25, 2005

© 2005 IBM Corporation

outline

IBM T. J. Watson Research Center

Outline
  • Baseline Phrase Translation System
    • Block Acquisition
    • Decoding
  • Performance Enhancing Techniques
    • Extended Block Acquisition Algorithm
    • System Combination
  • IWSLT 2005 Evaluations
  • Conclusions & Future Work

© 2005 IBM Corporation

baseline system block acquisition

e1

e2

e3

e4

e5

e6

f1

f2

f3

IBM T. J. Watson Research Center

Baseline System: Block Acquisition

Block (b): a phrase translation pair consisting of source

& target phrase

© 2005 IBM Corporation

decoding i

IBM T. J. Watson Research Center

Decoding I
  • Phrase translation models
    • Direct model:
    • Source channel model:
    • Block unigram model:

© 2005 IBM Corporation

decoding ii

IBM T. J. Watson Research Center

Decoding II
  • IBM Model 1 cost per phrase in both directions
  • Word trigram language model
  • Word-level distortion models applied to blocks
  • Word count penalty
  • Block count penalty

© 2005 IBM Corporation

extended block acquisition

Arabic: lA Aryd AzAlthA

لا أريدإزالتها

IBM T. J. Watson Research Center

Extended Block Acquisition

English:Ido n't want it extracted

© 2005 IBM Corporation

extended block acquisition algorithm

I don't want it extracted

lA Aryd AzAlthA

لا أريدإزالتها

IBM T. J. Watson Research Center

Extended Block Acquisition Algorithm
  • Expansion word list: A list of target words typically aligned to null source words (e.g. I, do, it)
  • Extend the target phrase to include an expansion word if it occurs in the neighborhood of a seed block

© 2005 IBM Corporation

impact of extended block aquisition a2e

IBM T. J. Watson Research Center

Impact of Extended Block Aquisition: A2E

BLEUr16n4

EXTENDED

EXTENDED

Reordering Rules

CSTAR 03 Dev Set

IWSLT 04 Dev Set

© 2005 IBM Corporation

impact of extended block acquisition c2e

IBM T. J. Watson Research Center

Impact of Extended Block Acquisition: C2E

BLEUr16n4

EXTENDED

EXTENDED

Reordering Rules

CSTAR 03 Dev Set IWSLT 04 Dev Set

© 2005 IBM Corporation

system combination recipe

IBM T. J. Watson Research Center

System Combination: Recipe

Phrase

Lexicon 1

Phrase

Lexicon 2

Phrase

Lexicon 3

SYSTEM 1

SYSTEM 2

SYSTEM 3

translate

translate

translate

Algorithm: Select the Best

© 2005 IBM Corporation

arabic to english phrase lexicons

IBM T. J. Watson Research Center

Arabic-to-English Phrase Lexicons

llmEArDp 'of the opposition' → l# Al# EArD +p → l# Al# EArDp

lA Aryd AzAlthA → lA A# ryd AzAl +t +hA → lA Aryd AzAlt +hA

OOV Ratio

© 2005 IBM Corporation

system combination algorithm

YES

NO

...

YES

NO

IBM T. J. Watson Research Center

System Combination Algorithm
  • h-sys (system producing the highest BLEU score) vs. l-sys1, l-sys2, ..., l-sysn

output(l-sys1)

cost(h-sys) >

cost(l-sys1) +

threshold_1

output(l-sysn)

cost(h-sys) >

cost(l-sysn) +

threshold_n

output(h-sys)

  • Combine the selected output as the final translation

© 2005 IBM Corporation

impact of system combination iwslt 05 a2e unrestricted data track

IBM T. J. Watson Research Center

Impact of System Combination: IWSLT 05 A2E Unrestricted Data Track

BLEUr16n4

system combination

morph segmented

morph analysis

unsegmented

Reordering Rules

© 2005 IBM Corporation

impact of system combination iwslt 05 c2e unrestricted data track

IBM T. J. Watson Research Center

Impact of System Combination: IWSLT 05 C2E Unrestricted Data Track

BLEUr16n4

char seg & unreordered

system combination

word seg & reorder

char seg & reorder

Reordering Rules

© 2005 IBM Corporation

iwslt 2005 training corpora for a2e

IBM T. J. Watson Research Center

IWSLT 2005: Training Corpora for A2E

TM: Number of sentence pairs, LM: Number of words

© 2005 IBM Corporation

conclusions future work

IBM T. J. Watson Research Center

Conclusions & Future Work
  • Conclusions
    • Robust system performances on
      • Large & small training corpora
      • Various language pairs: A2E, C2E, S2E, E2S
    • System combination & Extended block acquisition algorithm
      • Effective for A2E & C2E translations
  • Future Work: System Combination
    • Extend the technique to models derived by distinct algorithms
    • Refine the algorithm to discriminate effective decoder parameters
    • Apply the technique to TC-Star SLT partner systems

© 2005 IBM Corporation

iwslt 2005 training corpora for c2e

IBM T. J. Watson Research Center

IWSLT 2005: Training Corpora for C2E

TM: Number of sentence pairs, LM: Number of words

© 2005 IBM Corporation