1 / 31

Morphological Processing for Statistical Machine Translation

COMS E6998: Topics in Computer Science: Machine Translation February 7, 2013 Reading Set #1. Morphological Processing for Statistical Machine Translation. Presenter: Nizar Habash. Papers Discussed.

tasya
Download Presentation

Morphological Processing for Statistical Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMS E6998: Topics in Computer Science: Machine Translation February 7, 2013 Reading Set #1 Morphological Processing for Statistical Machine Translation Presenter: Nizar Habash

  2. Papers Discussed • Nizar Habash and Fatiha Sadat. 2006. Arabic Preprocessing Schemes for Statistical Machine Translation. • Nimesh Singh and Nizar Habash. 2012. Hebrew Morphological Preprocessing for Statistical Machine Translation.

  3. Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions

  4. The Basic Idea • Reduction of word sparsity improves translation quality • This reduction can be achieved by • increasing training data, or by • morphologically driven preprocessing

  5. Introduction • Morphologically rich languages are especially challenging for SMT • Model sparsity, high OOV rate especially under low-resource conditions • A common solution is to tokenize the source words in a preprocessing step • Lower OOV rate  Better SMT (in terms of BLEU) • Increased token symmetry  Better SMT models • conj+article+noun :: conj article noun • wa+Al+kitAb :: and the book

  6. Introduction • Different tokenizations can be used • No one “correct” tokenization. Tokenizations vary in terms of • Scheme (what) and Technique (how) • Accuracy • Consistency • Sparsity reduction • The two papers consider different preprocessing options and other settings to study SMT from Arabic/Hebrew to English

  7. Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions

  8. Linguistic Issues • Arabic & Hebrew are Semitic languages • Root-and-pattern morphology • Extensive use of affixes and clitics • Rich Morphology • Clitics [CONJ+ [PART+ [DET+ BASE +PRON]]]w+ l+ Al+ mktb and+ for+ the+ office • Morphotactics w+l+Al+mktb wllmktbوللمكتب و+ل+ال+مكتب

  9. Linguistic Issues • Orthographic & Morphological Ambiguity • وجدناwjdnA • wjd+nAwajad+nA (we found) • w+jd+nAwa+jad~u+nA (and our grandfather) • בשורהbbšwrh

  10. MT LAB HINTS Arabic Orthographic Ambiguity Extra w+ Repeated Al+ Repeated Al+ wdrst AltAlbAt AlErbyAt ktAbA bAlSynyp w+drs+t Al+Talb+At Al+Erb+y+At ktAb+A b+Al+Syn+y+p And+study+they the+student+f.pl. the+Arab+f.pl. book+a in+the+Chinese The Arab students studied a book in Chinese the+arab students studied a+book in+chinese th+rb stdnts stdd +bk n+chns thrb stdnts stdd bk nchns to+herb so+too+dents studded bake in chains?

  11. MT LAB HINTS Arabic Morphemes circumfix Verbs Nominals everything Clitics are optional, affixes are obligatory!

  12. Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions

  13. ApproachHabash&Sadat 2006 / Singh&Habash 2012 • Preprocessing scheme • What to tokenize • Preprocessing Technique • How to tokenize • Regular expressions • Morphological analysis • Morphological tagging / disambiguation • Unsupervised morphological segmentation • Not always independent

  14. Arabic Preprocessing Schemes • ST Simple Tokenization • D1 Decliticizeconjunctions: w+/f+ • D2 D1 + Decliticize particles: b+/l+/k+/s+ • D3 D2 + Decliticize article Al+ and pron’lclitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

  15. Arabic Preprocessing Techniques • REGEX: Regular Expressions • BAMA: Buckwalter Arabic Morphological Analyzer (Buckwalter 2002; 2004) • Pick first analysis • Use TOKAN (Habash 2006) • A generalized tokenizer • Assumes disambiguated morphological analysis • Declarative specification of any preprocessing scheme • MADA: Morphological Analysis and Disambiguation for Arabic(Habash&Rambow 2005) • Multiple SVM classifiers + combiner • Selects BAMA analysis • Use TOKAN

  16. Hebrew Preprocessing Techniques/Schemes • Regular Expressions • RegEx-S1 = Conjunctions: ו ‘and’ and ש ‘that/who’ • RegEx-S2 = RegEx-S1 and Prepositions: ב ‘in’, כ ‘like/as’, ל ‘to/for’, and מ ‘from’ • RegEx-S3 = RegEx-S2 and the article ה ‘the’ • RegEx-S4 = RegEx-S3 and pronominal enclitics • Morfessor (Creutz and Lagus, 2007) • Morf - Unsupervised splitting into morphemes • Hebrew Morphological Tagger (Adler, 2009) • Htag - Hebrew morphological analysis and disambiguation

  17. Tokenization System Statistics • Aggressive tokenization schemes have: • More tokens • More change from the baseline (untokenized) • Fewer OOVs (baseline OOV is 7%)

  18. Tokenization System Statistics

  19. Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions

  20. Arabic-English Experiments • Portage Phrase-based MT (Sadat et al., 2005) • Training Data: parallel 5 Million words only • All in News genre • Learning curve: 1%, 10% and 100% • Language Modeling: 250 Million words • Development Tuning Data: MT03 Eval Set • Test Data MT04 • Mixed genre: news, speeches, editorials • Metric: BLEU (Papineni et al 2001)

  21. Arabic-English Experiments • Each experiment • Select a preprocessing scheme • Select a preprocessing technique • Some combinations do not exist • REGEX and EN

  22. Arabic-English Results Training 100% 10% BLEU 1% > > MADABAMA REGEX

  23. Hebrew-English Experiments • Phrase-based statistical MT • Moses (Koehn et al., 2007) • MERT (Och, 2003) tuned for BLEU (Papineni et al., 2002) • Language models: English Gigaword (5-gram) plus training (3-gram) • True casing for English output • Training data  850,000 words

  24. Hebrew-English Experiments • Compare seven systems • Vary only preprocessing • Baseline, RegEx-S{1-4}, Morf, and Htag • Metrics • BLEU, NIST (Doddington, 2002), • METEOR (Banerjee & Lavie, 2005)

  25. Results Htag is consistently best, and Morf is consistently second best, in terms of BLEU and NIST

  26. Results Morf has very low OOV, but still does worse than Htag (and even more poorly according to METEOR), indicating that it sometimes over-tokenizes.

  27. Results Within RegEx, BLEU peaks at S2/S3, similar to Arabic D2 (Habash & Sadat, 2006)

  28. Translation Example

  29. Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions

  30. Conclusions • Preprocessing is useful for improving Arabic-English & Hebrew-English SMT • But as more data is added, the value diminishes • Tokenization with a morphological tagger does best but requires a lot of linguistic knowledge • Morfessor does quite well with no linguistic information necessary, and significantly reduces OOV (but perhaps erroneously) • Optimal Scheme/Technique choice varies by training data size • In Arabic, for large amounts of training data, splitting off conjunctions and particles performs best • But, for small amount of training data, following an English-like tokenization performs best

  31. Thank you! Questions? Nizar Habash habash@cs.columbia.edu

More Related