1 / 30

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System. Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Shuly Wintner, Yaniv Eytani - University of Haifa Erik Peterson, Katharina Probst – Carnegie Mellon. Outline.

maya
Download Presentation

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rapid Prototyping of a Transfer-based Hebrew-to-EnglishMachine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Shuly Wintner, Yaniv Eytani - University of Haifa Erik Peterson, Katharina Probst – Carnegie Mellon

  2. Outline • Hebrew and its Challenges for MT • CMU Transfer-based MT Framework • Hebrew-to-English System • Input pre-proc and Morph. Analysis • MT Resources: lexicon and grammar • Performance Evaluation • Conclusions, Current and Future Work TMI-2004

  3. Modern Hebrew • Native language of about 3-4 Million in Israel • Semitic language, closely related to Arabic and with similar linguistic properties • Root+Pattern word formation system • Rich verb and noun morphology • Particles attach as prefixed to the following word: definite article (H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)… • Unique alphabet and Writing System • 22 letters represent (mostly) consonants • Vowels represented (mostly) by diacritics • Modern texts omit the diacritic vowels, thus additional level of ambiguity: “bare” word  word • Example: MHGR  mehager, m+hagar, m+h+ger TMI-2004

  4. Modern Hebrew Spelling • Two main spelling variants • “KTIV XASER” (difficient): spelling with the vowel diacritics, and consonant words when the diacritics are removed • “KTIV MALEH” (full): words with I/O/U vowels are written with long vowels which include a letter • KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications  inconsistent spelling • Example: • niqud (spelling): NIQWD, NQWD, NQD • Written as NQD, could also be niqed, naqed, nuqad TMI-2004

  5. Challenges for Hebrew MT • Puacity in existing language resources for Hebrew • No publicly available broad coverage morphological analyzer • No publicly available bilingual lexicons or dictionaries • No POS-tagged corpus or parse tree-bank corpus for Hebrew • No large Hebrew/English parallel corpus • Scenario well suited for CMU transfer-based MT framework for languages with limited resources TMI-2004

  6. Hebrew Input בשורה הבאה Preprocessing Morphology Transfer Rules English Language Model {NP1,3} NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1)) Transfer Engine Translation Lexicon Decoder N::N |: ["$WR"] -> ["BULL"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL")) N::N |: ["$WRH"] -> ["LINE"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE")) Translation Output Lattice (0 1 "IN" @PREP) (1 1 "THE" @DET) (2 2 "LINE" @N) (1 2 "THE LINE" @NP) (0 2 "IN LINE" @PP) (0 4 "IN THE NEXT LINE" @PP) English Output in the next line

  7. Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Transfer Rule Formalism ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) TMI-2004

  8. Value constraints Agreement constraints Transfer Rule Formalism (II) ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) TMI-2004

  9. The Transfer Engine TMI-2004

  10. XFER + Decoder • XFER engine produces a lattice of all possible transferred fragments • Decoder searches for and selects the best scoring sequence of fragments as a final translation output • Main advantages: • Very high robustness • always some translation output • no transfer grammar  word-to-word translation • Scoring can take into account word-to-word translation probabilities, transfer rule scores, target statistical language model • Effective framework for late-stage disambiguation • Main Difficulty: lattice size too big  pruning TMI-2004

  11. Hebrew Text Encoding Issues • Input texts are (most commonly) in standard Windows encoding for Hebrew, but also unicode (UTF-8) and others… • Morphology analyzer and other resources already set to work in a romanized “ascii-like” representation •  Converter script converts the input into the romanized representation – 1-to-1 mapping! • All further processing is done in the romanized representation • Lexicon and grammar rules are also converted into romanized representation TMI-2004

  12. Morphological Analyzer • Analyzer program developed at Technion was available, works on Windows and with minimal adaptation on Linux • Coverage is reasonable (for nouns and verbs and adjectives) • Produces all analyses or a disambiguated analysis for each word • Output format includes lexeme (base form), POS, morphological features • Output was adapted to our representation needs (POS and feature mappings) TMI-2004

  13. Morphological Processing • Split attached prefixes and suffixes into separate words for translation • Produce f-structures as output • Convert feature-value codes to our conventions • “All analyses mode”: all possible analyses for each input word returned, represented in the form of a input lattice • Analyzer installed as a server integrated with input pre-processer TMI-2004

  14. Morphology Example • Input word: B$WRH 0 1 2 3 4 |--------B$WRH--------| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---| TMI-2004

  15. Morphology Example Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE)) Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET)) Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE)) TMI-2004

  16. Translation Lexicon • Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary made available to us • Coverage is not great but not bad • Dahan H-to-E is about 15K translation pairs • Dahan E-to-H is about 7K translation pairs • POS information on both sides • No proper names or named entities • Converted Dahan into our representation, added entries for missing closed-class entries (pronouns, prepositions, etc.) • Issue with spelling conventions • Dahan dictionary uses deficient KTIV XASER • Developed conversion scripts for most common patterns of verbs • Add/merge these into resulting lexicon • Target side (English) morphological variants added into lexicon TMI-2004

  17. Translation Lexicon: Examples PRO::PRO |: ["ANI"] -> ["I"] ( (X1::Y1) ((X0 per) = 1) ((X0 num) = s) ((X0 case) = nom) ) PRO::PRO |: ["ATH"] -> ["you"] ( (X1::Y1) ((X0 per) = 2) ((X0 num) = s) ((X0 gen) = m) ((X0 case) = nom) ) N::N |: ["$&H"] -> ["HOUR"] ( (X1::Y1) ((X0 NUM) = s) ((Y0 NUM) = s) ((Y0 lex) = "HOUR") ) N::N |: ["$&H"] -> ["hours"] ( (X1::Y1) ((Y0 NUM) = p) ((X0 NUM) = p) ((Y0 lex) = "HOUR") ) TMI-2004

  18. Transfer Grammar (human-developed) • Written by Alon in a few days… • Current grammar has 36 rules: • 21 NP rules • one PP rule • 6 verb complexes and VP rules • 8 higher-phrase and sentence-level rules • Captures the most common (mostly local) structural differences between Hebrew and English TMI-2004

  19. Transfer GrammarExample Rules {NP1,2} ;;SL: $MLH ADWMH ;;TL: A RED DRESS NP1::NP1 [NP1 ADJ] -> [ADJ NP1] ( (X2::Y1) (X1::Y2) ((X1 def) = -) ((X1 status) =c absolute) ((X1 num) = (X2 num)) ((X1 gen) = (X2 gen)) (X0 = X1) ) {NP1,3} ;;SL: H $MLWT H ADWMWT ;;TL: THE RED DRESSES NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ( (X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1) ) TMI-2004

  20. Sample Output (dev-data) maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money TMI-2004

  21. Test set of 62 sentences from Haaretz newspaper, 2 reference translations Evaluation Results TMI-2004

  22. Current and Future Work • Issues specific to the Hebrew-to-English system: • Further improvements in the translation lexicon and morphological analyzer • Manual Grammar development • Acquiring/training of word-to-word translation probabilities • Acquiring/training of a Hebrew language model at a post-morphology level that can help with disambiguation • General Issues related to XFER framework: • Effective pruning during full lattice construction • Effective model for assigning scores to transfer rules • Extending decoder to incorporate rule scores • Improved grammar learning TMI-2004

  23. Conclusions • Test case for the CMU XFER framework for rapid MT prototyping • Two-month, three person effort – we were quite happy with the outcome • Core concept of XFER + Decoder is very powerful and promising • We experienced the main bottlenecks of knowledge acquisition for MT: morphology, translation lexicons, grammar... TMI-2004

  24. Questions? TMI-2004

  25. Learning Transfer-Rules for Languages with Limited Resources • Rationale: • Large bilingual corpora not available • Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool • Elicitation corpus designed to be typologically comprehensive and compositional • Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data TMI-2004

  26. English-Hindi Example TMI-2004

  27. Rule Learning - Overview • Goal: Acquire Syntactic Transfer Rules • Use available knowledge from the source side (grammatical structure) • Three steps: • Flat Seed Generation: first guesses at transfer rules; flat syntactic structure • Compositionality:use previously learned rules to add hierarchical structure • Seeded Version Space Learning: refine rules by learning appropriate feature constraints TMI-2004

  28. Flat Seed Rule Generation TMI-2004

  29. Compositionality TMI-2004

  30. Seeded Version Space Learning TMI-2004

More Related