1 / 3

NICE Machine Translation for Low-Density Languages

NICE Machine Translation for Low-Density Languages. Nice (Native Language Interpretation and Communication Environment) is a project for rapid development of machine translation for low and very low density languages Classification of MT by Language Density

ailis
Download Presentation

NICE Machine Translation for Low-Density Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NICE Machine Translation for Low-Density Languages • Nice (Native Language Interpretation and Communication Environment) is a project for rapid development of machine translation for low and very low density languages • Classification of MT by Language Density • High density pairs (E-F, E-S, E-J, …) • Statistical or traditional MT approaches are O.K. • Medium density (E-Czech, E-Croatian, …) • Example-based MT (success with Croatian, Korean) • JHU: initial success with stat-MT (Czech) • Low density (S-Mapudungun, E-Iñupiaq, …) • 10,000 to 1 million speakers • Insufficient bilingual corpora for SMT, EBMT • Partial corpus-based resources • Insufficient trained computational linguists • Machine Translation of Very Low Density Languages • No text in electronic form • Can’t apply current methods for statistical MT • No standard spelling or orthography • Few literate native speakers • Few linguists familiar with the language • Nobody is available to do rule-based MT • Not enough money or time for years of linguistic information gathering/analysis • E.g., Siona (Colombia) • Motivation for LDMT • Methods developed for languages with very scarce resources will generalize to all MT. • Policy makers can get input from indigenous people. • E.g., Has there been an epidemic or a crop failure • Indigenous people can participate in government, education, and internet without losing their language. • First MT of polysynthetic languages • New Ideas • MT without large amounts of text and without trained linguists • Machine learning of rule-based MT • Multi-Engine architecture can flexibly take advantage of whatever resources are available. • Research partnerships with indigenous communities • (Future: Exponential models for data-miserly SMT) • Approach • Machine learning • Uncontrolled corpus (Generalized Example-Based MT) • Controlled corpus elicited from native speakers (Version Space Learning) • Multi-Engine MT • Flexibly adapt to whatever resources are available • Take advantage of the strengths of different MT approaches • NICE partners • Mapudungun (Chile) • Iñupiaq (Alaska) • Siona (Colombia)

  2. NICE • Elicitation Process • Purpose: controlled elicitation of data that will be input to machine learning of translation rules • Elicitation Interface • Native informant sees source language sentence (in English or Spanish) • Native informant types in translation, then uses mouse to add word alignments • Informant is • Literate • Bilingual • Not an expert in linguistics or in linguistics or computation • The Elicitation Corpus • List of sentences in a major language • English • Spanish • Dynamically adaptable • Different sentences are presented depending on what was previously elicited • Compositional • Joe, Joe’s brother, I saw Joe’s brother, I told you that I saw Joe’s brother, etc. • Aim for typological completeness • Cover all types of languages • Data Collection in Mapudungun • Spanish-Mapudungun parallel corpora • Total words: 223,366 • Spanish-Mapudungun glossary • About 5500 entries • 40 hours of speech recorded • 6 hours of speech transcribed • Speech data will be translated into Spanish

  3. NICE – Current Work • Instructible Knowledge-Based Machine Translation (iKBMT) • The Learning Process Learning Instance: Hebrew: ha-yeled ha-gadol English: the big boy Acquired Transfer Rule: Hebrew: NP: N ADJ <==> English: NP: the ADJ N ;;x-side constraints (Hebrew) (X1 def) = *+ (X2 def) = *+ (X0 = X1) ;;y-side constraints (English) (Y0 = Y3) ;;x-y constraints (Hebrew-English equivalence, constituent alignments) (X:ADJ <==> Y:ADJ) (X:N <==> Y: N) • Seeded Version Space Learning • SVS is based on Mitchell-style inductive version-space learning, but instead of keeping full S and G boundaries for each concept, it starts from a seeded rule and grows by generalization, specialization and rule-bifurcation with incrementally acquired data.

More Related