Statistical XFER: Hybrid Statistical Rule-based Machine Translation

Statistical XFER:Hybrid Statistical Rule-based Machine Translation Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Jaime Carbonell, Lori Levin, Bob Frederking, Erik Peterson, Christian Monson, Vamshi Ambati, Greg Hanneman, Kathrin Probst, Ariadna Font-Llitjos, Alison Alvarez, Roberto Aranovich

Outline • Background and Rationale • Stat-XFER Framework Overview • Elicitation • Learning Transfer Rules • Automatic Rule Refinement • Example Prototypes • Major Research Challenges Statistical XFER MT

Progression of MT • Started with rule-based systems • Very large expert human effort to construct language-specific resources (grammars, lexicons) • High-quality MT extremely expensive  only for handful of language pairs • Along came EBMT and then Statistical MT… • Replaced human effort with extremely large volumes of parallel text data • Less expensive, but still only feasible for a small number of language pairs • We “traded” human labor with data • Where does this take us in 5-10 years? • Large parallel corpora for maybe 25-50 language pairs • What about all the other languages? • Is all this data (with very shallow representation of language structure) really necessary? • Can we build MT approaches that learn deeper levels of language structure and how they map from one language to another? Statistical XFER MT

Rule-based vs. Statistical MT • Traditional Rule-based MT: • Expressive and linguistically-rich formalisms capable of describing complex mappings between the two languages • Accurate “clean” resources • Everything constructed manually by experts • Main challenge: obtaining broad coverage • Phrase-based Statistical MT: • Learn word and phrase correspondences automatically from large volumes of parallel data • Search-based “decoding” framework: • Models propose many alternative translations • Effective search algorithms find the “best” translation • Main challenge: obtaining high translation accuracy Statistical XFER MT

Main Principles of Stat-XFER • Integrate the major strengths of rule-based and statistical MT within a common framework: • Linguistically rich formalism that can express complex and abstract compositional transfer rules • Rules can be written by human experts and also acquired automatically from data • Easy integration of morphological analyzers and generators • Word and basic phrase correspondences (i.e. base NPs) can be automatically acquired from parallel text when available • Search-based decoding from statistical MT adapted to find the best translation within the search space: multi-feature scoring, beam-search, parameter optimization, etc. • Framework suitable for both resource-rich and resource-poor language scenarios Statistical XFER MT

Stat-XFER MT Approach Semantic Analysis Sentence Planning Interlingua Syntactic Parsing Transfer Rules Text Generation Statistical-XFER Source (e.g. Quechua) Target (e.g. English) Direct: SMT, EBMT Statistical XFER MT

Source Input בשורה הבאה Preprocessing Morphology Transfer Rules Language Model + Additional Features {NP1,3} NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1)) Transfer Engine Translation Lexicon Decoder N::N |: ["$WR"] -> ["BULL"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL")) N::N |: ["$WRH"] -> ["LINE"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE")) Translation Output Lattice (0 1 "IN" @PREP) (1 1 "THE" @DET) (2 2 "LINE" @N) (1 2 "THE LINE" @NP) (0 2 "IN LINE" @PP) (0 4 "IN THE NEXT LINE" @PP) English Output in the next line Statistical XFER MT

Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Transfer Rule Formalism ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Statistical XFER MT

Value constraints Agreement constraints Transfer Rule Formalism (II) ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Statistical XFER MT

Hebrew Manual Transfer Grammar (human-developed) • Initially developed in a couple of days, with some later revisions by a CL post-doc • Current grammar has 36 rules: • 21 NP rules • one PP rule • 6 verb complexes and VP rules • 8 higher-phrase and sentence-level rules • Captures the most common (mostly local) structural differences between Hebrew and English Statistical XFER MT

Hebrew Transfer GrammarExample Rules {NP1,2} ;;SL: $MLH ADWMH ;;TL: A RED DRESS NP1::NP1 [NP1 ADJ] -> [ADJ NP1] ( (X2::Y1) (X1::Y2) ((X1 def) = -) ((X1 status) =c absolute) ((X1 num) = (X2 num)) ((X1 gen) = (X2 gen)) (X0 = X1) ) {NP1,3} ;;SL: H $MLWT H ADWMWT ;;TL: THE RED DRESSES NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ( (X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1) ) Statistical XFER MT

The XFER Engine • Input: source-language input sentence, or source-language confusion network • Output: lattice representing collection of translation fragments at all levels supported by transfer rules • Basic Algorithm: “bottom-up” integrated “parsing-transfer-generation” guided by the transfer rules • Start with translations of individual words and phrases from translation lexicon • Create translations of larger constituents by applying applicable transfer rules to previously created lattice entries • Beam-search controls the exponential combinatorics of the search-space, using multiple scoring features Statistical XFER MT

Source-language Confusion Network Hebrew Example • Input word: B$WRH 0 1 2 3 4 |--------B$WRH--------| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---| Statistical XFER MT

XFER Output Lattice (28 28 "AND" -5.6988 "W" "(CONJ,0 'AND')") (29 29 "SINCE" -8.20817 "MAZ " "(ADVP,0 (ADV,5 'SINCE')) ") (29 29 "SINCE THEN" -12.0165 "MAZ " "(ADVP,0 (ADV,6 'SINCE THEN')) ") (29 29 "EVER SINCE" -12.5564 "MAZ " "(ADVP,0 (ADV,4 'EVER SINCE')) ") (30 30 "WORKED" -10.9913 "&BD " "(VERB,0 (V,11 'WORKED')) ") (30 30 "FUNCTIONED" -16.0023 "&BD " "(VERB,0 (V,10 'FUNCTIONED')) ") (30 30 "WORSHIPPED" -17.3393 "&BD " "(VERB,0 (V,12 'WORSHIPPED')) ") (30 30 "SERVED" -11.5161 "&BD " "(VERB,0 (V,14 'SERVED')) ") (30 30 "SLAVE" -13.9523 "&BD " "(NP0,0 (N,34 'SLAVE')) ") (30 30 "BONDSMAN" -18.0325 "&BD " "(NP0,0 (N,36 'BONDSMAN')) ") (30 30 "A SLAVE" -16.8671 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,34 'SLAVE')) ) ) ) ") (30 30 "A BONDSMAN" -21.0649 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,36 'BONDSMAN')) ) ) ) ") Statistical XFER MT

The Lattice Decoder • Simple Stack Decoder, similar in principle to simple Statistical MT decoders • Searches for best-scoring path of non-overlapping lattice arcs • No reordering during decoding • Scoring based on log-linear combination of scoring components, with weights trained using MERT • Scoring components: • Statistical Language Model • Fragmentation: how many arcs to cover the entire translation? • Length Penalty • Rule Scores • Lexical Probabilities Statistical XFER MT

XFER Lattice Decoder 0 0 ON THE FOURTH DAY THE LION ATE THE RABBIT TO A MORNING MEAL Overall: -8.18323, Prob: -94.382, Rules: 0, Frag: 0.153846, Length: 0, Words: 13,13 235 < 0 8 -19.7602: B H IWM RBI&I (PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE') (NP2,0 (NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1 (N,6 'DAY')))))))> 918 < 8 14 -46.2973: H ARIH AKL AT H $PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,17 'LION')))))(VERB,0 (V,0 'ATE'))(NP,100 (NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,24 'RABBIT')))))))> 584 < 14 17 -30.6607: L ARWXH BWQR (PP,0 (PREP,6 'TO')(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NNP,3 (NP0,0 (N,32 'MORNING'))(NP0,0 (N,27 'MEAL')))))))> Statistical XFER MT

Data Elicitation for Languages with Limited Resources • Rationale: • Large volumes of parallel text not available  create a small maximally-diverse parallel corpus that directly supports the learning task • Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool • Elicitation corpus designed to be typologically and structurally comprehensive and compositional • Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data Statistical XFER MT

Elicitation Tool:English-Chinese Example Statistical XFER MT

Elicitation Tool:English-Hindi Example Statistical XFER MT

Elicitation Tool:English-Arabic Example Statistical XFER MT

Elicitation Tool:Spanish-Mapudungun Example Statistical XFER MT

Designing Elicitation Corpora • Goal: Create a small representative parallel corpus that contains examples of the most important translation correspondences and divergences between the two languages • Method: • Elicit translations and word alignments for a broad diversity of linguistic phenomena and constructions • Current Elicitation Corpus: ~3100 sentences and phrases, constructed based on a broad feature-based specification • Open Research Issues: • Feature Detection: discover what features exist in the language and where/how they are marked • Example: does the language mark gender of nouns? How and where are these marked? • Dynamic corpus navigation based on feature detection: no need to elicit for combinations involving non-existent features Statistical XFER MT

Rule Learning - Overview • Goal: Acquire Syntactic Transfer Rules • Use available knowledge from the source side (grammatical structure) • Three steps: • Flat Seed Generation: first guesses at transfer rules; flat syntactic structure • Compositionality Learning:use previously learned rules to learn hierarchical structure • Constraint Learning: refine rules by learning appropriate feature constraints Statistical XFER MT

Flat Seed Rule Generation Statistical XFER MT

Compositionality Learning Statistical XFER MT

Constraint Learning Statistical XFER MT

Automated Rule Refinement • Bilingual informants can identify translation errors and pinpoint the errors • A sophisticated trace of the translation path can identify likely sources for the error and do “Blame Assignment” • Rule Refinement operators can be developed to modify the underlying translation grammar (and lexicon) based on characteristics of the error source: • Add or delete feature constraints from a rule • Bifurcate a rule into two rules (general and specific) • Add or correct lexical entries • See [Font-Llitjos, Carbonell & Lavie, 2005] Statistical XFER MT

Stat-XFER MT Prototypes • General Statistical XFER framework under development for past five years (funded by NSF and DARPA) • Prototype systems so far: • Chinese-to-English • Dutch-to-English • French-to-English • Hindi-to-English • Hebrew-to-English • Mapudungun-to-Spanish • In progress or planned: • Brazilian Portuguese-to-English • Native-Brazilian languages to Brazilian Portuguese • Hebrew-to-Arabic • Iñupiaq-to-English • Urdu-to-English • Turkish-to-English Statistical XFER MT

Chinese-English Stat-XFER System • Bilingual lexicon: over 1.1 million entries (multiple resources, incl. ADSO, Wikipedia, extracted base NPs) • Manual syntactic XFER grammar:76 rules! (mostly NPs, a few PPs, and reordering of NPs/PPs within VPs) • Multiple overlapping Chinese word segmentations • English morphology generation • Uses CMU SMT-group’s Suffix-Array LM toolkit for LM • Current Performance (GALE dev-test): • NW: • XFER: 10.89(B)/0.4509(M) • Best (UMD): 15.58(B)/0.4769(M) • NG • XFER: 8.92(B)/0.4229(M) • Best (UMD): 12.96(B)/0.4455(M) • In Progress: • Automatic extraction of “clean” base NPs from parallel data • Automatic learning and extraction of high-quality transfer-rules from parallel data Statistical XFER MT

Translation Example • REFERENCE:When responding to whether it is possible to extend Russian fleet's stationing deadline at the Crimean peninsula, Yanukovych replied, "Without a doubt. • Stat-XFER (0.3989): In reply to whether the possibility to extend the Russian fleet stationed in Crimea Pen. left the deadline of the problem , Yanukovich replied : " of course . • IBM-ylee (0.2203): In response to the possibility to extend the deadline for the presence in Crimea peninsula , the Queen Vic said : " of course . • CMU-SMT (0.2067): In response to a possible extension of the fleet in the Crimean Peninsula stay on the issue , Yanukovych vetch replied : " of course . • maryland-hiero (0.1878): In response to the possibility of extending the mandate of the Crimean peninsula in , replied: "of course. • IBM-smt (0.1862): The answer is likely to be extended the Crimean peninsula of the presence of the problem, Yanukovych said: " Of course. • CMU-syntax (0.1639): In response to the possibility of extension of the presence in the Crimean Peninsula , replied : " of course . Statistical XFER MT

Major Research Directions • Automatic Transfer Rule Learning: • From manually word-aligned elicitation corpus • From large volumes of automatically word-aligned “wild” parallel data • In the absence of morphology or POS annotated lexica • Compositionality and generalization • Identifying “good” rules from “bad” rules • Effective models for rule scoring for • Decoding: using scores at runtime • Pruning the large collections of learned rules • Learning Unification Constraints Statistical XFER MT

Major Research Directions • Extraction of Base-NP translations from parallel data: • Base-NPs are extremely important “building blocks” for transfer-based MT systems • Frequent, often align 1-to-1, improve coverage • Correctly identifying them greatly helps automatic word-alignment of parallel sentences • Parsers (or NP-chunkers) available for both languages: Extract base-NPs independently on both sides and find their correspondences • Parsers (or NP-chunkers) available for only one language (i.e. English): Extract base-NPs on one side, and find reliable correspondences for them using word-alignment, frequency distributions, other features… • Promising preliminary results Statistical XFER MT

Major Research Directions • Algorithms for XFER and Decoding • Integration and optimization of multiple features into search-based XFER parser • Complexity and efficiency improvements (i.e. “Cube Pruning”) • Non-monotonicity issues (LM scores, unification constraints) and their consequences on search Statistical XFER MT

Major Research Directions • Discriminative Language Modeling for MT: • Current standard statistical LMs provide only weak discrimination between good and bad translation hypotheses • New Idea: Use “occurrence-based” statistics: • Extract instances of lexical, syntactic and semantic features from each translation hypothesis • Determine whether these instances have been “seen before” (at least once) in a large monolingual corpus • The Conjecture: more grammatical MT hypotheses are likely to contain higher proportions of feature instances that have been seen in a corpus of grammatical sentences. • Goals: • Find the set of features that provides the best discrimination between good and bad translations • Learn how to combine these into a LM-like function for scoring alternative MT hypotheses Statistical XFER MT

Major Research Directions • Building Elicitation Corpora: • Feature Detection • Corpus Navigation • Automatic Rule Refinement • Translation for highly polysynthetic languages such as Mapudungun and Iñupiaq Statistical XFER MT

Questions? Statistical XFER MT

Statistical XFER: Hybrid Statistical Rule-based Machine Translation