Learning to Generate Complex Morphology for Machine Translation

Learning to Generate Complex Morphology for Machine Translation Einat Minkov†, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research †Carnegie Mellon University

Motivation I would like to meet this nice woman. اود ان مواجهه هذا جيد امراه. woman nice this fem masc masc

Motivation

Motivation System guess(Quirk et al, 05)

Motivation System guess(Quirk et al, 05) Correct

SMT challenges forEnglish  Morphology rich language • Information ‘missing’ on source side • Data sparsity • Morphological agreement in the target language

Related work • Translation from morphology-rich languages to English • Preprocessing of the inputs, to improve alignmentsArabic (Lee, 04), German (Koehn and Knight, 03; Nießen and Ney, 04; Popović and Ney, 04; Collins et al. 05), Czech (Goldwater and McClosky 05) • Translation from English to morphology-rich languages • Preprocessing and postprocessing Turkish (El-Kahlout and Oflazer 06), Spanish and Catalan (Oeffing and Ney, 03) • Our approach • Extension of Japanese case marker prediction (Suzuki and Toutanova, 06)

ununa dideldeidella… eliminareeliminoelimini eliminiamo… Morphology Prediction • Morphology generation as classification: Classify each stem into an inflected form Source: Eliminate a primary key constraint System guess: eliminare un vincolo di chiave primario vincolovincoli chiavechiavi primarioprimariaprimariprimarie Possible inflections

Outline • Morphology • Russian, Arabic • Lexicon operations • The task of inflection prediction • A log-linear model • Features • Lexical, Syntactic and Morphological • Experiments

Russian Morphology • 3 genders, 2 numbers, 6 cases (nom, acc, location …) • Nouns have gender, and inflect for number and case • Adjectives agree with nouns in number, gender, and case; • Verbs agree with Subject person and number (past tense agrees with gender and number) Уменя есть синий карандаш at me is blue pencil Pers1 Pres GenNom Nom MascMasc SingSing

Arabic morphology • Arabic: inflection + clitics • Prefixes: Conj/Prep/Det (in strict order) • Suffixes: Object pronouns/Possessive pronouns • Agreement: • In person, number, gender and definiteness (from Bar-Haim et al) فقلناها /faqulnāhā/ ف+ قال+ نا+ ها fa+qul+na+hā so+said+we+it so we said it وللمكتبات /walilmaktabāt/ و+ل+ال+مكتبة+ات wa+li+al+maktabāt and+for+the+libraries and for the libraries (from Nizar Habash)

Lexicon Operations Set of possible lemmas то, тот Stemming Inflection Surface word Lexicon Set of possible morphological variants то того, тому, тем, том, те, тех, теми,то Analysis Set of possible morphological analyses тот+PronAdj+DemPron+Neut+Sg+NomAcc (that) то то+Pron+Neut+Inanim+Sg+NomAcc (it) то то+Conj (then)

y1 y2 y3 y4 Inflection Prediction Model • Given a sentence, predict the inflection of each word. • Conditional Markov Model • Sentence processed left-to-right(can be applied top-down) • Features: pairs of target and context predicates • Can model agreement:POS(yi-2)=DT & Number(yi-1)=sg &Number(yi )=sg

Linguistic annotations • Annotations used in Quirk et al (05) system Source dependency tree POS &morphological features Surface features POS &morphological features Projected dependency tree

Features Monoligual Bilingual Inflection stemleft stemright stemyi-1,yi-2parent stem… aligned words aiparent (ai)left sister (ai)right sister (ai)POS (ai)number (ai)person (ai)tense (ai)det* (ai)prep* (ai)pron* (ai)… inflection (yi)POS (yi)tense (yi)number (yi)… Lexical Syntax POS (yi-1)number(yi-1)person (yi-1)tense(yi-1)… Morph.

Russian [PrevStem=X, Case_Inflection=y] [AlignedWords=will,Tense_Inflection=future] [AlignedWords=been,Tense_Inflection=past] [AlignedWords=click,Tense_Inflection=imperative] Arabic [Prev.Stem=qam~-u_qam~, Prep_Inflection=bi] [Aligned_Number=Plur, Number_Inflection=pl] [AlignedWords=and, Conj_Inflection=true] [PrevStem=fiy_y, Prep_Inflection=none] [AlignedWords=applications, Gender_Inflection=fem]

Reference Experiments • Baselines • Random baseline (pick a label at random) • Word-trigram language model baseline • Trained using the CMU toolkit on the same training dataset • Models • Monolingual word / all, Bilingual Word / all • Lexicons: • Russian dictionary, Arabic: Buckwalter analyzer • Evaluated only on words in the lexicon

Russian inflection prediction: accuracy • The suggested model better than a language model • Syntactic and morphological features are informative

Arabic inflection prediction: accuracy

Accuracy vs. training data size

Error Analysis • Russian • Gender of pronoun (it ~ he/she/it) • Case/Gender in coordinate construction • Morphological analysis ambiguity • Arabic • Gender/Number of pronoun • Definiteness in noun phrases

Summary • Proposed a general framework for improving SMT into morphology rich languages • Showed that morpho-syntactic features and source sentence information, derived from aligned sentence pair and a lexicon, are effective. • Achieved good results also for little training data

Future Directions • Integration with the MT system • Initial results for Russian: 1.7 BLEU improvement • Improvements to the model and features • Morphological disambiguation • Semantic role labeling • Longer distance agreements (e.g. pronoun coreference) • More languages

Thanks! Questions?

Learning to Generate Complex Morphology for Machine Translation

Learning to Generate Complex Morphology for Machine Translation

Presentation Transcript

Machine Translation

Machine Translation

Discriminative Learning of Extraction Sets for Machine Translation

Machine Translation

Machine Translation of Persian Complex Predicates

Machine Translation

Machine Translation

Machine Translation

Introduction to Machine Translation

Approaches to Machine Translation

Machine Translation

Statistical Machine Translation Part X – Dealing with Morphology for Translating to German

Machine Translation

“Applying Morphology Generation Models to Machine Translation”

Machine Translation

Machine Translation

Statistical Machine Translation Part VI – Dealing with Morphology for Translating to German

Machine Translation

Machine Translation

Introduction to Machine Translation

Machine Translation, Free Machine Translation