1 / 19

Machine Translation for Indigenous Languages

Machine Translation for Indigenous Languages. Jaime Carbonell, Lori Levin, Alon Lavie Language Technologies Institute Carnegie Mellon University {jgc, lsl, alavie}@cs.cmu.edu. Context: Project NICE. Very low-density languages (e.g. Mapudungun, Inupiaq, Siona,…)

pekelo
Download Presentation

Machine Translation for Indigenous Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Translation for Indigenous Languages Jaime Carbonell, Lori Levin, Alon Lavie Language Technologies Institute Carnegie Mellon University {jgc, lsl, alavie}@cs.cmu.edu NSF, August 6, 2001

  2. Context: Project NICE • Very low-density languages (e.g. Mapudungun, Inupiaq, Siona,…) • Minimal amount of parallel text (< 100K words) • No standard orthography/spelling • No available trained linguists • Access to native informants possible • Minimize development time and cost • Target: functional but rudimentary MT NSF, August 6, 2001

  3. Generalized EBMT Parallel text 50K-2MB (uncontrolled corpus) Rapid implementation Proven for major L’s with reduced data Transfer-rule learning Elicitation (controlled) corpus to extract grammatical properties Seeded version-space learning Two Technical Approaches NSF, August 6, 2001

  4. Architecture Diagram SL Input Run-Time Module Learning Module SL Parser EBMT Engine Elicitation Process SVS Learning Process Transfer Rules Transfer Engine TL Generator User Unifier Module TL Output NSF, August 6, 2001

  5. EBMT Example English:I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English:The tallest man is my father. Mapudungun:Chi doy fütra chi wentru fey ta inche ñi chaw. English:I would like to meetthe tallest man Mapudungun (new):Ayükefun trawüaelChi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu. NSF, August 6, 2001

  6. Elicitation of Data for Seeded Version Space Learning NSF, August 6, 2001

  7. I fell. Caí Tranün I am falling. Estoy cayendo Tranmeken You (John) fell. Tu (Juan) caiste Eymi tranimi (Kuan) You (John) are falling. Tu (Juan) estás cayendo Eimi(Kuan) tranmekeymi You (Mary) fell. Tu (María) caiste Eymi tranimi (Maria) You (Mary) are falling. Tu (María) estás cayendo Eimi tranmekeymi (Maria) Example: Elicitation Corpus NSF, August 6, 2001

  8. The Elicitation Corpus • List of sentences in a major language • English • Spanish • Dynamically adaptable • Different sentences are presented depending on what was previously elicited • Compositional • Joe, Joe’s brother, I saw Joe’s brother, I told you that I saw Joe’s brother, etc. • Aim for typological completeness • Cover all types of languages NSF, August 6, 2001

  9. Version Space Learning • Symbolic learning from + and – examples • Invented by Mitchell, refined by Hirsch • Builds generalization lattice implicitly • Bounded by G and S sets • Worse-case exponential complexity (in size of G and S) • Slow convergence rate NSF, August 6, 2001

  10. Seeded Version Spaces • Generate concept seed from first + example • Generalization-level hypothesis (POS + feature agreement for T-rules in NICE) • Generalization/specialization level bounds • Up to k-levels generalization, and up to j-levels specialization. • Implicit lattice explored seed-outwards NSF, August 6, 2001

  11. Complexity of SVS • O(gk) upward search, where g = # of generalization operators • O(sj) downward search, where s = # of specialization operators • Since m and k are constants, the SVS runs in polynomial time of order max(j,k) • Convergence rates bounded by F(j,k) NSF, August 6, 2001

  12. NICE Partners NSF, August 6, 2001

  13. Pilot Version of Elicitation Corpus • Approximately 800 sentences • Tested on Swahili and Mapudungun • Vocabulary • Include a variety of semantic classes e.g., animate, inanimate, man-made objects, natural objects, etc. • Noun phrases • Detect number, gender, types of possessives, classifiers, etc. • Basic sentences • Detect agreement between verb and subject and/or object, basic word order, problems with indefinite or inanimate subjects, etc. • Complex constructions • Currently relative clauses. Later, comparatives,questions, embedded clauses, etc. NSF, August 6, 2001

  14. Detection of Grammatical Features • Each language uses a different inventory of grammatical features: tense, number, person, agreement. Swahili The hunter kill-ed the animal Mwindaji a-li-mu-ua mnyama a – class-one subject li – past tense mu – class-one object ua – kill Fox (Algonquian) Ne-waapam-aa-wa I-see-direct-him Ne-waapam-ek-wa me-see-indirect-he NSF, August 6, 2001

  15. Mapudungun Data for EBMT • Spanish-Mapudungun parallel corpora • Total words: 223,366 • Bilingual newspaper, 4 issues • Ultimas Familias – memoirs • Memorias de Pascual Coña • A publishable version of a historical text with a new translation into Spanish • 35 hours transcribed speech (will be translated into Spanish) • 80 hours recorded speech • Spanish-Mapudungun glossary • About 5500 entries NSF, August 6, 2001

  16. Nice/Mapudungun:Other Products • Standardization of orthography: Linguists at UFRO have evaluated the competing orthographies for Mapudungun and written a report detailing their recommendations for a standardized orthography for NICE. • Training for spoken language collection: In January 2001 native speakers of Mapudungun were trained in the recording and transcription of spoken data. NSF, August 6, 2001

  17. Summary of Results: iRBMT • Preliminary design and implementation of transfer rule formalism for machine translation. • Design and pilot testing of prototype elicitation corpus. • First prototype of feature detection • Morphological processing in PC Kimmo covering about 40 Mapudungun morphemes. • Preliminary version of new parser for run-time translation component. NSF, August 6, 2001

  18. Next Steps (original plan) • Lexical and phrasal generalization for EBMT • Complete implementation of transfer-rule intepreter • Implementation of SVS to learn transfer rules • Extend elicitation corpus for evaluation • Evaluate first on Mapudungun MT NSF, August 6, 2001

  19. DARPA Redirection for NICE • Focus on technology for rapid deployment of MT for new (low density) languages. • Not interested in indigenous endangered L’s • Somali, Kirgistani, Bahasa, => yes • Siona, US-indigenous, Mapudungun => no • First focus on limited-data evaluation for Major L’s, such as Chinese & Arabic • Statistical methods favored over linguistic. NSF, August 6, 2001

More Related