html5-img
1 / 23

Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur

A Hybrid Approach for Bengali to Hindi Machine Translation. Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur. Contents. Abstract and Motivation Rule Based and Statistical Machine Translation Hybrid System System Architecture

zaviera
Download Presentation

Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Hybrid Approach for Bengali to Hindi Machine Translation Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur

  2. Contents • Abstract and Motivation • Rule Based and Statistical Machine Translation • Hybrid System • System Architecture • Phrase table enhancement using lexical resources • Suffix, Infix and Pattern based postprocessing • Experiments with Example Sentence • Evaluation • Conclusion • References

  3. Abstract and Previous work • MT translate a text from one natural language(such as Bengali) to another (such as Hindi) – Meaning must be restored • Current MT software allows for customization by domain – Improving output by limiting scope • History: • 1946: A.D. Booth proposed using digital computers for translation of natural languages. • 1954: Georgetown experiment involved MT of 60 Russian languages into English. Claimed 3-5 Years MT would be a solved problem. • 1966: ALPAI report 10 years long research has failed to fulfill expectations • Translation Challenges: Decoding the meaning of the source text Re-encoding the meaning in the target language

  4. Rule Based and Statistical MT • Rule Based MT • Relies on countless built-in linguistic rules and dictionary • Good out-of-domain quality and is predictable • Lack of fluency, long and costly • Bengali-Hindi: 2 years 5 person effort – BLEU Score 0.0424 • Statistical MT • Uses statistical model with bilingual corpora • Provides good quality when large and qualified corpora are available • Poor for other domains • Fluent and cheaper • Bengali-Hindi: 2 month 2 person effort – BLEU Score 0.1745

  5. Hybrid System • There is a clear need for a third approach through which • Users would reach better translation quality and high performance(Rule based) • Less investment – cost and time (Statistical) • Bengali-Hindi: BLEU Score 0.2318

  6. System Architecture

  7. Feeding dictionary into SMT • Lexical entries from Transfer Based system(tourism) is used to increase word alignments in SMT(news) • Dictionary contains only words, not phrases • Dictionary is from another domain

  8. Postprocessing by suffix list • Suffix list (1000) • Monolingual corpuses of same size for source and target languages (500K each) • Some of the suffices which occur more than 1000 times in Bengali corpus and Zero times in Hindi corpus • Some other suffixes which occur more than 5000 times in Bengali corpus and more than 99% of total occurrences in combined corpus occur in Bengali corpus

  9. Suffix list

  10. Infix based postprocessing • Multiple suffixes can be attached and they are stacked • chhelegulike = chhele + guli + ke • Infix in Bengali is translated to infix in Hindi

  11. Pattern based postprocessing • After Suffix and Infix based postprocessing the output is further inspected to find out some error patterns • “te” or “ke” suffixes preceded by 5 or more english characters are very rare in Hindi

  12. Experiment • Resources: Training corpus (12K sentences) of EMILLE-CIIL Development corpus(1K Sentences) of EMILLE-CIIL Test Corpus(100 Sentences) of EMILLE-CIIL Suffix List: 1000 Bengali linguistic suffixes Dictionary: 15,000 parallel synsets of ILMT-DIT Gazetteer list: 50K parallel names of ILMT-DIT Monolingual Corpus: 500K words from SL and TL • Systems: Giza++; Moses; Mart; Pharaoh.

  13. Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulikepariskArabhAbe batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

  14. Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulikepariskArabhAbe batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

  15. Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulikesApha tarahA se batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

  16. Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulao.Nke sApha tarahA se batA diYA hai kI mAtApitAo.Nke eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

  17. Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

  18. Evaluation

  19. BLEU • Automatic, inexpensive, quick and language independent evaluation system • The closer a machine translation output to a professional human reference translation, the better is the BLEU score • Source word can be translated to different word choices • Candidate translation will select one of them • may not match with the reference translation word choice

  20. BLEU Cont. Candidate translation BLEU 0.2275 Reference translation Modified BLEU 0.2318 Monolingual concept dictionary Improving BLEU score considering the concepts rather than words

  21. Conclusion • Targeted to postprocess the inflected words which remain unchanged after translation • The words which are wrongly translated are not considered • A morphological analyzer/generator may be useful • By considering the dictionary fluency level is decreased

  22. References W. S. Bennett, J. Slocum. 1985. The Irc Machine Translation System. In Comp. Linguist., pp. 11(2-3): 111-121. P. F. Brown, S. D. Pietra, V. J. D. Pietra, R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. In Comp. Linguist., pp. 19(2) 263-312. A. Eisele, C. Federmann, H. Uszkoreit, H. Saint-Amand, M. Kay, M. Jellinghaus, S. Hunsicker, T. Herrmann, Y. Chen. 2008. Hybrid Machine Translation Architectures within and beyond the EuroMatrix project. In Proceedings of the European Machine Translation Conference, pp. 27-34. Ethnologue: Languages of the World, 16th edition, Edited by M. Paul Lewis, 2009. P. Isabelle, C. Goutte, M. Simard. 2007. Domain adaptation of MT systems through automatic post-editing. In Proceedings Of MTSummit XI, pp. 255-261, Copenhagen, Denmark. P. Koehn, F. J. Och, D. Marcu. 2003. Statistical phrase-based translation. In Proceedings Of NAACL-HLT, pp. 48-54, Edmonton, Canada. P. Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proceedings of Association of Machine Translation in the Americas (AMTA-2004). F. J. Och, H. Ney. 2000. Improved Statistical Alignment Models. In proceedings of the 38th Annual Meeting of the ACL, pp. 440-447. F. J. Och, H. Ney. 2004. The Alignment Template Approach to Statistical Machine Translation. In Computational Linguistics Vol. 30 Num. 4, pp. 417-449. K. Papineni, S. Roukos, T. Ward, W. Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In 40th Annual meeting of the ACL, Philadelphia, pp. 311-318. A. Ushioda. 2007. Phrase Alignment for Integration of SMT and RBMT Resources. In MT Summit XI Workshop on Patent Translation Programme. H. Wu, H. Wang. 2004. Improving Statistical Word Alignment with a Rule-Based Machine Translation System, In Proceedings of Coling, pp. 29-35.

  23. Thank You

More Related