180 likes | 820 Views
IBM T. J. Watson Research Center. IBM Statistical Machine Translation for Spoken Languages. Young-Suk Lee IWSLT 2005 October 24−25, 2005. © 2005 IBM Corporation. IBM T. J. Watson Research Center. Outline. Baseline Phrase Translation System Block Acquisition Decoding
E N D
IBM T. J. Watson Research Center IBM Statistical Machine Translation for Spoken Languages Young-Suk LeeIWSLT 2005October 24−25, 2005 © 2005 IBM Corporation
IBM T. J. Watson Research Center Outline • Baseline Phrase Translation System • Block Acquisition • Decoding • Performance Enhancing Techniques • Extended Block Acquisition Algorithm • System Combination • IWSLT 2005 Evaluations • Conclusions & Future Work © 2005 IBM Corporation
e1 e2 e3 e4 e5 e6 f1 f2 f3 IBM T. J. Watson Research Center Baseline System: Block Acquisition Block (b): a phrase translation pair consisting of source & target phrase © 2005 IBM Corporation
IBM T. J. Watson Research Center Decoding I • Phrase translation models • Direct model: • Source channel model: • Block unigram model: © 2005 IBM Corporation
IBM T. J. Watson Research Center Decoding II • IBM Model 1 cost per phrase in both directions • Word trigram language model • Word-level distortion models applied to blocks • Word count penalty • Block count penalty © 2005 IBM Corporation
Arabic: lA Aryd AzAlthA لا أريدإزالتها IBM T. J. Watson Research Center Extended Block Acquisition English:Ido n't want it extracted © 2005 IBM Corporation
I don't want it extracted lA Aryd AzAlthA لا أريدإزالتها IBM T. J. Watson Research Center Extended Block Acquisition Algorithm • Expansion word list: A list of target words typically aligned to null source words (e.g. I, do, it) • Extend the target phrase to include an expansion word if it occurs in the neighborhood of a seed block © 2005 IBM Corporation
IBM T. J. Watson Research Center Impact of Extended Block Aquisition: A2E BLEUr16n4 EXTENDED EXTENDED Reordering Rules CSTAR 03 Dev Set IWSLT 04 Dev Set © 2005 IBM Corporation
IBM T. J. Watson Research Center Impact of Extended Block Acquisition: C2E BLEUr16n4 EXTENDED EXTENDED Reordering Rules CSTAR 03 Dev Set IWSLT 04 Dev Set © 2005 IBM Corporation
IBM T. J. Watson Research Center System Combination: Recipe Phrase Lexicon 1 Phrase Lexicon 2 Phrase Lexicon 3 SYSTEM 1 SYSTEM 2 SYSTEM 3 translate translate translate Algorithm: Select the Best © 2005 IBM Corporation
IBM T. J. Watson Research Center Arabic-to-English Phrase Lexicons llmEArDp 'of the opposition' → l# Al# EArD +p → l# Al# EArDp lA Aryd AzAlthA → lA A# ryd AzAl +t +hA → lA Aryd AzAlt +hA OOV Ratio © 2005 IBM Corporation
YES NO ... YES NO IBM T. J. Watson Research Center System Combination Algorithm • h-sys (system producing the highest BLEU score) vs. l-sys1, l-sys2, ..., l-sysn output(l-sys1) cost(h-sys) > cost(l-sys1) + threshold_1 output(l-sysn) cost(h-sys) > cost(l-sysn) + threshold_n output(h-sys) • Combine the selected output as the final translation © 2005 IBM Corporation
IBM T. J. Watson Research Center Impact of System Combination: IWSLT 05 A2E Unrestricted Data Track BLEUr16n4 system combination morph segmented morph analysis unsegmented Reordering Rules © 2005 IBM Corporation
IBM T. J. Watson Research Center Impact of System Combination: IWSLT 05 C2E Unrestricted Data Track BLEUr16n4 char seg & unreordered system combination word seg & reorder char seg & reorder Reordering Rules © 2005 IBM Corporation
IBM T. J. Watson Research Center IWSLT 2005: Training Corpora for A2E TM: Number of sentence pairs, LM: Number of words © 2005 IBM Corporation
IBM T. J. Watson Research Center IWSLT 2005: IBM System Performances © 2005 IBM Corporation
IBM T. J. Watson Research Center Conclusions & Future Work • Conclusions • Robust system performances on • Large & small training corpora • Various language pairs: A2E, C2E, S2E, E2S • System combination & Extended block acquisition algorithm • Effective for A2E & C2E translations • Future Work: System Combination • Extend the technique to models derived by distinct algorithms • Refine the algorithm to discriminate effective decoder parameters • Apply the technique to TC-Star SLT partner systems © 2005 IBM Corporation
IBM T. J. Watson Research Center IWSLT 2005: Training Corpora for C2E TM: Number of sentence pairs, LM: Number of words © 2005 IBM Corporation