Introduction

Source Language Adaptationfor Resource-Poor Machine Translation Pidong Wang, National University of SingaporePreslav Nakov, QCRI, Qatar FoundationHwee Tou Ng, National University of Singapore

Introduction

Overview Statistical Machine Translation (SMT) systems Need large sentence-aligned bilingual corpora (bi-texts). Problem Such training bi-texts do not exist for most languages. Idea Adapt a bi-text for a related resource-rich language. Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Idea: reuse bi-texts from related resource-rich languages to improve resource-poor SMT Related languages have overlapping vocabulary (cognates) e.g., casa (‘house’) in Spanish, Portuguese similar word order syntax Idea & Motivation

Related EU – nonEU languages Swedish – Norwegian Bulgarian – Macedonian Related EU languages Spanish – Catalan Czech – Slovak Irish – Gaelic Scottish Standard German – Swiss German Related languages outside Europe MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi) Hindi – Urdu Turkish – Azerbaijani Russian – Ukrainian Malay – Indonesian Resource-rich vs. Resource-poor Languages We will explore these pairs. Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Our Main focus: Improving Indonesian-English SMT Using Malay-English Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Malay vs. Indonesian Malay Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak. Mereka mempunyai pemikiran dan perasaan hatidan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan. Indonesian Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama. Mereka dikaruniai akal dan hati nurani dan hendaknya bergaul satu sama lain dalam semangat persaudaraan. ~50% exact word overlap from Article 1 of the Universal Declaration of Human Rights Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Malay Can Look “More Indonesian”… Malay Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak. Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan. ~75% exact word overlap Post-edited Malay to look “Indonesian” (by an Indonesian speaker). We attempt to do this automatically: adaptMalay to look Indonesian Then, use it to improve SMT… • Indonesian • Semua manusia dilahirkan bebas danmempunyai martabat dan hak-hak yang sama. • Mereka mempunyai pemikiran dan perasaandan hendaklah bergaul satu sama lain dalam semangat persaudaraan. from Article 1 of the Universal Declaration of Human Rights 8 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Method at a Glance poor Indonesian English rich Malay Adapt Step 1: Adaptation poor Indonesian English rich “Indonesian” Step 2: Combination Indonesian + “Indonesian” English Note that we have no Malay-Indonesian bi-text!

Step 1: Adapting Malay-English to “Indonesian”-English 10 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Word-Level Bi-text Adaptation:Overview Given a Malay-English sentence pair • Adapt the Malay sentence to “Indonesian” • Word-level paraphrases • Phrase-level paraphrases • Cross-lingual morphology • We pair the adapted “Indonesian” with English from Malay-Englishsentence pair Thus, we generate a new “Indonesian”-English sentence pair. Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010. Word-Level Bi-text Adaptation:Overview Decode using a large Indonesian LM Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Word-Level Bi-text Adaptation:Overview Pair each with the English counter-part Thus, we generate a new “Indonesian”-English bi-text. Malaysia’s GDP is expected to reach 8 per cent in 2010.

Word-Level Adaptation:Extracting Paraphrases ML-EN bi-text Indonesian sentence Malay sentence ML1 ML2 ML3 ML4 ML5 IN1 IN2 IN3 IN4 IN-EN bi-text English sentence EN1 EN2 EN3 EN4 English sentence EN11 EN3 EN12 Note: we have no Malay-Indonesian bi-text, so we pivot. Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Indonesian translations for Malay: pivoting over English Weights

Word-Level Adaptation:Issue 1 poor IN EN rich ML Works because of cognates between Malay and Indonesian. Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) IN-EN bi-text is small, thus: • Unreliable IN-EN word alignments  bad ML-IN paraphrases • Solution: • improve IN-EN alignments using the ML-EN bi-text • concatenate: IN-EN*k + ML-EN • k ≈ |ML-EN| / |IN-EN| • word alignment • get the alignments for one copy of IN-EN only

Word-Level Adaptation:Issue 2 poor IN EN rich ML Note: The IN variants are from a larger monolingual IN text. IN-EN bi-text is small, thus: • Small IN vocabulary for the ML-IN paraphrases • Solution: • Add cross-lingual morphological variants: • Given ML word: seperminuman • Find ML lemma: minum • Propose all known IN words sharing the same lemma: • diminum, diminumkan, diminumnya, makan-minum, makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum 16 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Word-Level Adaptation:Issue 3 poor IN EN rich ML - Models context better: not only Indonesian LM, but also phrases. - Allows many word operations, e.g., insertion, deletion. Word-level pivoting • Ignores context, and relies on LM • Cannot drop/insert/merge/split/reorder words • Solution: • Phrase-level pivoting • Build ML-EN and EN-IN phrase tables • Induce ML-IN phrase table (pivoting over EN) • Adapt the ML side of ML-EN to get “IN”-EN bi-text: • using Indonesian LM and n-best “IN” as before • Also, use cross-lingual morphological variants 17 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Step 2: Combining IN-EN + “IN”-EN 18 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Combining IN-EN and “IN”-EN bi-texts Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. (EMNLP 2009) Preslav Nakov, Hwee Tou Ng • Simple concatenation: IN-EN + “IN”-EN • Balanced concatenation:IN-EN * k + “IN”-EN • Sophisticated phrase table combination:(Nakov and Ng, EMNLP 2009), (Nakov and Ng, JAIR 2012) • Improved word alignments for IN-EN • Phrase table combination with extra features Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Experiments& Evaluation Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Data (tokens) • Translation data (for IN-EN) • IN2EN-train: 0.9M • IN2EN-dev: 37K • IN2EN-test: 37K • EN-monoling.: 5M • Adaptation data (for ML-EN  “IN”-EN) • ML2EN: 8.6M • IN-monoling.: 20M

Isolated Experiments:Training on “IN”-EN only BLEU System combination using MEMT (Heafield and Lavie, 2010)

Combined Experiments:Training on IN-EN + “IN”-EN BLEU

Experiments: Improvements BLEU

Application to Other Languages & Domains • Improve Macedonian-English SMT by adapting Bulgarian-English bi-text • Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words) • OPUS movie subtitles BLEU

Conclusion 26 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Adapt bi-texts for related resource-rich languages, using confusion networks word-level & phrase-level paraphrasing cross-lingual morphological analysis Achieved: +6.7 BLEU over ML2EN +2.6 BLEU over IN2EN +1.5-3.0 BLEU over comb(IN2EN,ML2EN) Future work add split/merge as word operations better integrate word-level and phrase-level methods apply our methods to other languages & NLP problems Conclusion & Future Work Thank you! Supported by the Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office.

Further Analysis 28 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

ParaphrasingNon-Indonesian Malay Words Only So, we do need to paraphrase all words.

Human Judgments Is the adapted sentence better Indonesian than the original Malay sentence? 100 random sentences Morphology yields worse top-3 adaptations but better phrase tables, due to coverage.

Reverse Adaptation • Idea: • Adapt dev/test Indonesian input to “Malay”, • then, translate with a Malay-English system • Input to SMT: • - “Malay” lattice • - 1-best “Malay” sentence from the lattice Adapting dev/test is worse than adapting the training bi-text: So, we need both n-best and LM

Related Work 32 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)

Related Work (1) Machine translation between related languages E.g. Cantonese–Mandarin (Zhang, 1998) Czech–Slovak (Hajic & al., 2000) Turkish–Crimean Tatar (Altintas & Cicekli, 2002) Irish–Scottish Gaelic (Scannell, 2006) Bulgarian–Macedonian (Nakov & Tiedemann, 2012) We do not translate (no training data), we “adapt”.

Related Work (2) Adapting dialects to standard language (e.g., Arabic) (Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011) manual rules Normalizing Tweets and SMS (Aw & al., 2006; Han & Baldwin, 2011) informal text: spelling, abbreviations, slang same language

Related Work (3) Adapt Brazilian to European Portuguese (Marujo & al. 2011) rule-based, language-dependent tiny improvements for SMT Reuse bi-texts between related languages (Nakov & Ng. 2009) no language adaptation (just transliteration)

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction