Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur

A Hybrid Approach for Bengali to Hindi Machine Translation Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur

Contents • Abstract and Motivation • Rule Based and Statistical Machine Translation • Hybrid System • System Architecture • Phrase table enhancement using lexical resources • Suffix, Infix and Pattern based postprocessing • Experiments with Example Sentence • Evaluation • Conclusion • References

Abstract and Previous work • MT translate a text from one natural language(such as Bengali) to another (such as Hindi) – Meaning must be restored • Current MT software allows for customization by domain – Improving output by limiting scope • History: • 1946: A.D. Booth proposed using digital computers for translation of natural languages. • 1954: Georgetown experiment involved MT of 60 Russian languages into English. Claimed 3-5 Years MT would be a solved problem. • 1966: ALPAI report 10 years long research has failed to fulfill expectations • Translation Challenges: Decoding the meaning of the source text Re-encoding the meaning in the target language

Rule Based and Statistical MT • Rule Based MT • Relies on countless built-in linguistic rules and dictionary • Good out-of-domain quality and is predictable • Lack of fluency, long and costly • Bengali-Hindi: 2 years 5 person effort – BLEU Score 0.0424 • Statistical MT • Uses statistical model with bilingual corpora • Provides good quality when large and qualified corpora are available • Poor for other domains • Fluent and cheaper • Bengali-Hindi: 2 month 2 person effort – BLEU Score 0.1745

Hybrid System • There is a clear need for a third approach through which • Users would reach better translation quality and high performance(Rule based) • Less investment – cost and time (Statistical) • Bengali-Hindi: BLEU Score 0.2318

System Architecture

Feeding dictionary into SMT • Lexical entries from Transfer Based system(tourism) is used to increase word alignments in SMT(news) • Dictionary contains only words, not phrases • Dictionary is from another domain

Postprocessing by suffix list • Suffix list (1000) • Monolingual corpuses of same size for source and target languages (500K each) • Some of the suffices which occur more than 1000 times in Bengali corpus and Zero times in Hindi corpus • Some other suffixes which occur more than 5000 times in Bengali corpus and more than 99% of total occurrences in combined corpus occur in Bengali corpus

Suffix list

Infix based postprocessing • Multiple suffixes can be attached and they are stacked • chhelegulike = chhele + guli + ke • Infix in Bengali is translated to infix in Hindi

Pattern based postprocessing • After Suffix and Infix based postprocessing the output is further inspected to find out some error patterns • “te” or “ke” suffixes preceded by 5 or more english characters are very rare in Hindi

Experiment • Resources: Training corpus (12K sentences) of EMILLE-CIIL Development corpus(1K Sentences) of EMILLE-CIIL Test Corpus(100 Sentences) of EMILLE-CIIL Suffix List: 1000 Bengali linguistic suffixes Dictionary: 15,000 parallel synsets of ILMT-DIT Gazetteer list: 50K parallel names of ILMT-DIT Monolingual Corpus: 500K words from SL and TL • Systems: Giza++; Moses; Mart; Pharaoh.

Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulikepariskArabhAbe batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulikesApha tarahA se batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulao.Nke sApha tarahA se batA diYA hai kI mAtApitAo.Nke eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

Evaluation

BLEU • Automatic, inexpensive, quick and language independent evaluation system • The closer a machine translation output to a professional human reference translation, the better is the BLEU score • Source word can be translated to different word choices • Candidate translation will select one of them • may not match with the reference translation word choice

BLEU Cont. Candidate translation BLEU 0.2275 Reference translation Modified BLEU 0.2318 Monolingual concept dictionary Improving BLEU score considering the concepts rather than words

Conclusion • Targeted to postprocess the inflected words which remain unchanged after translation • The words which are wrongly translated are not considered • A morphological analyzer/generator may be useful • By considering the dictionary fluency level is decreased

References W. S. Bennett, J. Slocum. 1985. The Irc Machine Translation System. In Comp. Linguist., pp. 11(2-3): 111-121. P. F. Brown, S. D. Pietra, V. J. D. Pietra, R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. In Comp. Linguist., pp. 19(2) 263-312. A. Eisele, C. Federmann, H. Uszkoreit, H. Saint-Amand, M. Kay, M. Jellinghaus, S. Hunsicker, T. Herrmann, Y. Chen. 2008. Hybrid Machine Translation Architectures within and beyond the EuroMatrix project. In Proceedings of the European Machine Translation Conference, pp. 27-34. Ethnologue: Languages of the World, 16th edition, Edited by M. Paul Lewis, 2009. P. Isabelle, C. Goutte, M. Simard. 2007. Domain adaptation of MT systems through automatic post-editing. In Proceedings Of MTSummit XI, pp. 255-261, Copenhagen, Denmark. P. Koehn, F. J. Och, D. Marcu. 2003. Statistical phrase-based translation. In Proceedings Of NAACL-HLT, pp. 48-54, Edmonton, Canada. P. Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proceedings of Association of Machine Translation in the Americas (AMTA-2004). F. J. Och, H. Ney. 2000. Improved Statistical Alignment Models. In proceedings of the 38th Annual Meeting of the ACL, pp. 440-447. F. J. Och, H. Ney. 2004. The Alignment Template Approach to Statistical Machine Translation. In Computational Linguistics Vol. 30 Num. 4, pp. 417-449. K. Papineni, S. Roukos, T. Ward, W. Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In 40th Annual meeting of the ACL, Philadelphia, pp. 311-318. A. Ushioda. 2007. Phrase Alignment for Integration of SMT and RBMT Resources. In MT Summit XI Workshop on Patent Translation Programme. H. Wu, H. Wang. 2004. Improving Statistical Word Alignment with a Rule-Based Machine Translation System, In Proceedings of Coling, pp. 29-35.

Thank You

Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur

Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur

Presentation Transcript

Dhaval Patel 2 nd Year Student School of IT IIT Kharagpur

PRIVATISATION, EMPLOYMENT AND EMPLOYEES THE INDIAN EXPERIENCE Shri P.K. Basu India

Entrepreneurship Summit IIT Kharagpur

Suman Majumdar Centre For Theoretical Studies IIT Kharagpur

IIT KHARAGPUR as a Fast Moving Consumer Goods (FMCG) Consultancy

Centre of Excellence for Innovation in Mining Engineering (CEIME), IIT Kharagpur , India

IIT kharagpur as a pharmaceuticaLs consultancy

Presented by : Uttam Sharma email: techteam.moodle@gmail.com Organization : IIT Kharagpur

“MASHABLE” by Suman Kalyan Maity Dept. of CSE IIT Kharagpur CNeRG Retreat 2014

Ball Separation Properties in Banach Spaces Sudeshna Basu

ANUPAM GUHA

Monojit Choudhury, Animesh Mukherjee, Anupam Basu, Niloy Ganguly

Database and Related Research at IIT Kharagpur

Madhabi Chatterji, Ph.D.

Tirthankar Dasgupta , IIT Kharagpur Monojit Choudhury , Microsoft Research India

Pushpak Bhattacharyya CSE Dept., IIT Bombay

Presented By Animesh Mukherjee IIT Kharagpur

Object-Orientation Concepts, UML, and OOAD Prof. R. Mall Dept. of CSE, IIT, Kharagpur

Pushpak Bhattacharyya CSE Dept., IIT Bombay

PRIVATISATION, EMPLOYMENT AND EMPLOYEES THE INDIAN EXPERIENCE Shri P.K. Basu India

ANUPAM SRIVASTAVA INDIA

Prof. J. Mukhopadhyay Computer Science and Engineering IIT, Kharagpur