1 / 22

微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告. 张冬冬 李志灏 李沐 周明 微软亚洲研究院. Outline. Overview MSRA Submissions System Description Experiments Training Data & Toolkits Chinese-English Machine Translation Chinese-English System Combination Conclusion. Evaluation Task Participation. MSRA Submission.

jory
Download Presentation

微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 微软亚洲研究院汉英翻译系统CWMT2008评测技术报告微软亚洲研究院汉英翻译系统CWMT2008评测技术报告 张冬冬 李志灏 李沐 周明 微软亚洲研究院

  2. Outline • Overview • MSRA Submissions • System Description • Experiments • Training Data & Toolkits • Chinese-English Machine Translation • Chinese-English System Combination • Conclusion

  3. Evaluation Task Participation

  4. MSRA Submission • Machine translation task • Primary submission • Unlimited training corpus • Combining: SysA + SysB + SysC + SysD • Contrast submission • Limited training corpus • Combining: SysA + SysB + SysC • System combination task • Limited training corpus • Combining: 10 systems

  5. SysA • Phrase-based model • CYK decoding algorithm • BTG grammar • Features: • Similar with(Koehn, 2004) • Maximum Entropy reordering model • (Zhang et. al 2007, Xiong et. Al, 2006)

  6. SysB • Syntactic pre-reordering model • (Li et. al, 2007) • Motivations • Isolating reordering model from decoder • Making use of syntactic information

  7. SysC • Hierarchical phase-based model • (David Chiang, 2005) • Hiero re-implementation • Weighted synchronous CFG

  8. SysD • String-to-dependency MT • (Shen et. al, 2008) • Integrating target dependent language model • Motivations • Target dependent structures integrate linguistic knowledge • Directly targeted on lexical items, simpler than CFG • Capture long distance relations by local dependency trees

  9. System Combination • Analogous with BBN’s work (Rosti et. al 2007)

  10. System Combination(Cont.) • Adaptations in MSRA system • Single confusion network • Candidate skeletons come from top-1 translations of each system • The best skeleton has the most similarity with others based on BLEU • Word alignment between skeleton and other candidate translations performed by GIZA++ • Parameters are tuned to maximize BLEU on Dev. data

  11. Outline • Overview • MSRA Submissions • System Description • Experiments • Training Data & Toolkits • Chinese-English Machine Translation • Chinese-English System Combination • Conclusion

  12. Training Data Primary MT Submission Contrast MT Submission

  13. Pre-/Post-processing • Pre-processing • Tokenization for Chinese and English sentences • Before word alignment and language model training • Special tokens recognized and normalized (date, time and number) for training data • Special tokens are pre-translated with rules for test data before decoding • Post-processing • English caserestoration after translation • OOVs are removed from final translation

  14. Tools • MSR-SEG • MSRA word segmentation tool used to segment Chinese sentences in parallel data • Berkeley parser • Parse sentences for both training and test data for syntactic pre-reordering model based system • GIZA++ • Used for bilingual word alignment • MaxEnt Toolkit • Reordering Model (Le Zhang, 2004) • MSRA internal tools • Language modeling • Decoders • Case-restoration for English words • System combination

  15. Experiments for MT Task

  16. Experiments for System Comb. 非受限LM

  17. Conclusions • Syntax information improves SMT • Syntactic pre-reordering model • Target dependency model • Limited LM affects the system combination • Perform worse over unlimited output when using limited LM

  18. Thanks!

  19. MSRASystems • SysA: • 基于连续短语翻译模型 • SysB: • SysA + 多个预调序的源语言输入 • SysC: • 基于层次短语翻译模型 • SysD: • 基于串到目标语言依存树的翻译模型

  20. SysB • Syntactic pre-reordering model • (Li et. al, 2007) • Motivations • Isolating reordering model from decoder • Making use of syntactic parse information

More Related