IBM Statistical Machine Translation for Spoken Languages - PowerPoint PPT Presentation

ibm statistical machine translation for spoken languages n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
IBM Statistical Machine Translation for Spoken Languages PowerPoint Presentation
Download Presentation
IBM Statistical Machine Translation for Spoken Languages

play fullscreen
1 / 18
IBM Statistical Machine Translation for Spoken Languages
577 Views
Download Presentation
niveditha
Download Presentation

IBM Statistical Machine Translation for Spoken Languages

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. IBM T. J. Watson Research Center IBM Statistical Machine Translation for Spoken Languages Young-Suk LeeIWSLT 2005October 24−25, 2005 © 2005 IBM Corporation

  2. IBM T. J. Watson Research Center Outline • Baseline Phrase Translation System • Block Acquisition • Decoding • Performance Enhancing Techniques • Extended Block Acquisition Algorithm • System Combination • IWSLT 2005 Evaluations • Conclusions & Future Work © 2005 IBM Corporation

  3. e1 e2 e3 e4 e5 e6 f1 f2 f3 IBM T. J. Watson Research Center Baseline System: Block Acquisition Block (b): a phrase translation pair consisting of source & target phrase © 2005 IBM Corporation

  4. IBM T. J. Watson Research Center Decoding I • Phrase translation models • Direct model: • Source channel model: • Block unigram model: © 2005 IBM Corporation

  5. IBM T. J. Watson Research Center Decoding II • IBM Model 1 cost per phrase in both directions • Word trigram language model • Word-level distortion models applied to blocks • Word count penalty • Block count penalty © 2005 IBM Corporation

  6. Arabic: lA Aryd AzAlthA لا أريدإزالتها IBM T. J. Watson Research Center Extended Block Acquisition English:Ido n't want it extracted © 2005 IBM Corporation

  7. I don't want it extracted lA Aryd AzAlthA لا أريدإزالتها IBM T. J. Watson Research Center Extended Block Acquisition Algorithm • Expansion word list: A list of target words typically aligned to null source words (e.g. I, do, it) • Extend the target phrase to include an expansion word if it occurs in the neighborhood of a seed block © 2005 IBM Corporation

  8. IBM T. J. Watson Research Center Impact of Extended Block Aquisition: A2E BLEUr16n4 EXTENDED EXTENDED Reordering Rules CSTAR 03 Dev Set IWSLT 04 Dev Set © 2005 IBM Corporation

  9. IBM T. J. Watson Research Center Impact of Extended Block Acquisition: C2E BLEUr16n4 EXTENDED EXTENDED Reordering Rules CSTAR 03 Dev Set IWSLT 04 Dev Set © 2005 IBM Corporation

  10. IBM T. J. Watson Research Center System Combination: Recipe Phrase Lexicon 1 Phrase Lexicon 2 Phrase Lexicon 3 SYSTEM 1 SYSTEM 2 SYSTEM 3 translate translate translate Algorithm: Select the Best © 2005 IBM Corporation

  11. IBM T. J. Watson Research Center Arabic-to-English Phrase Lexicons llmEArDp 'of the opposition' → l# Al# EArD +p → l# Al# EArDp lA Aryd AzAlthA → lA A# ryd AzAl +t +hA → lA Aryd AzAlt +hA OOV Ratio © 2005 IBM Corporation

  12. YES NO ... YES NO IBM T. J. Watson Research Center System Combination Algorithm • h-sys (system producing the highest BLEU score) vs. l-sys1, l-sys2, ..., l-sysn output(l-sys1) cost(h-sys) > cost(l-sys1) + threshold_1 output(l-sysn) cost(h-sys) > cost(l-sysn) + threshold_n output(h-sys) • Combine the selected output as the final translation © 2005 IBM Corporation

  13. IBM T. J. Watson Research Center Impact of System Combination: IWSLT 05 A2E Unrestricted Data Track BLEUr16n4 system combination morph segmented morph analysis unsegmented Reordering Rules © 2005 IBM Corporation

  14. IBM T. J. Watson Research Center Impact of System Combination: IWSLT 05 C2E Unrestricted Data Track BLEUr16n4 char seg & unreordered system combination word seg & reorder char seg & reorder Reordering Rules © 2005 IBM Corporation

  15. IBM T. J. Watson Research Center IWSLT 2005: Training Corpora for A2E TM: Number of sentence pairs, LM: Number of words © 2005 IBM Corporation

  16. IBM T. J. Watson Research Center IWSLT 2005: IBM System Performances © 2005 IBM Corporation

  17. IBM T. J. Watson Research Center Conclusions & Future Work • Conclusions • Robust system performances on • Large & small training corpora • Various language pairs: A2E, C2E, S2E, E2S • System combination & Extended block acquisition algorithm • Effective for A2E & C2E translations • Future Work: System Combination • Extend the technique to models derived by distinct algorithms • Refine the algorithm to discriminate effective decoder parameters • Apply the technique to TC-Star SLT partner systems © 2005 IBM Corporation

  18. IBM T. J. Watson Research Center IWSLT 2005: Training Corpora for C2E TM: Number of sentence pairs, LM: Number of words © 2005 IBM Corporation