NICE: Native language Interpretation and Communciation Environment

NICE: Native language Interpretation and Communciation Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown, Erik Peterson, Kathrin Probst, Rodolfo Vega, Hal Daume Language Technologies Institute Carnegie Mellon University April 12, 2001

Example of Feature Detection • Detect features using minimal pairs • E.G. To detect plural: • “The man slept.” • “The men slept.” • In the translations, compare the words for “man” and “men”.

Definition of the Problem • Machine Translation of very low density languages • No text in electronic form • Can’t apply current methods for statistical MT • Few literate native speakers • No standard spelling or orthography • Few linguists familiar with the language • Nobody is available to do traditional knowledge based MT • Not enough money or time for years of development

Impact • Rapid development of machine translation for languages with very scarce resources • Policy makers can get input from indigenous people. • E.g., Has there been an epidemic or a crop failure • Indigenous people can participate in government, education, and internet without losing their language. • Possibly first MT of polysynthetic languages

New Ideas • Machine learning of knowledge–based rules without large amounts of text and without trained linguists. • Multi-Engine architecture can flexibly take advantage of whatever resources are available. • Research partnerships with indigenous communities.

History of NICE • Arose from a series of joint workshops of NSF and OAS-CICAD. • Recommendations: • Create multinational projects using information technology to: • provide immediate benefits to governments and citizens • develop critical infrastructure for communication and collaborative research • training researchers and engineers • advancing science and technology

Approach • Multi-Engine MT • Flexibly adapt to whatever resources are available • Take advantage of the strengths of different MT approaches • Machine learning • Uncontrolled corpus (General Example-Based MT) • Controlled corpus elicited from native speakers (Version Space Learning)

Evaluation • To achieve a given level of translation quality for a series of languages L1 to Ln • Reduce the amount of required data • Reduce the amount of development time

Evaluation Baseline (Generalized EBMT) • High density language (French) • 1MW parallel corpus (subset of Hansards) • Consistent spelling • Corpus is grammatically correct in both languages

Evaluation Baseline • GEBMT • French Hansards

Evaluation Goal

Progress to Date Overview • Establishing partnerships • Example-Based MT • Collection of data • Standardizing spelling and orthography • Instructible Knowledge-Based MT • Elicitation corpus • Elicitation interface • Feature detection

Establishing Partnerships

Nice Partners

Agreement between LTI and Institute of Indigenous Studies (IEI), Universidad de la Frontera, Chile • Contributions of IEI • Socio-linguistic knowledge • Linguistic knowledge • Experience in multicultural bilingual education • The use of IEI facilities, faculty/researchers and staff for the project • electronic network support and computer technical support

The IEI Team • Coordinator (leader of a bilingual and multicultural education project) • Distinguished native speaker • Linguists (one native speaker, one near-native) • Typists/Transcribers • Recording assistants • Translators • Native speaker linguistic informants

Agreement between LTI and Institute of Indigenous Studies (IEI), Universidad de la Frontera, Chile • Contributions of LTI • Equipment: four computers and four DAT recorders • Payment of consulting fees pending funding from the Chilean Ministry of Education • Expertise in language technologies

LTI/IEI Agreement • Cooperate in expand the project to convergent areas, such as bilingual education, as well as in pursuing additional funding

MINEDUC/IEIAgreement Highlights: Introduction: Based on the LTI/IEI agreement, both institutions got the Chilean Ministry of Education involved in funding the data collection and processing team for the year 2001. This agreement will be renewed each year, as needed.

MINEDUC/IEI Agreement: • Objectives: • To evaluate the NICE/Mapudungun writing conventions proposal • To collect an oral corpus that represent the four Mapudungun dialects spoken in Chile. The main domain is primary health, traditional and Occidental

MINEDUC/IEI Agreement: • Products: • An oral corpus of 800 hours recorded, proportional to the demography of each current spoken dialect • 120 hours transcribed and translated from Mapudungun to Spanish • A refined proposal for writing Mapudungun

Nice/Mapudungun:Current Products • Writing conventions (Grafemario) • Vocabulary Mapudungun/Spanish (8,715) • Bilingual newspaper, 4 numbers (19,647) • Ultimas Familias –memoirs- (25,289) • Memorias de Pascual Coña (76,311) • 6 hours transcribed • 40 hours recorded

Instructible Knowledge-Based MT

Grammar Acquisition Tool

A Noun Phrase Learning Instance and Transfer Rule Learning Instance: English: the big boy Hebrew: ha-yeled ha-gadol Acquired Transfer Rule: Hebrew: NP: N ADJ <==> English: NP: the ADJ N where: (Hebrew:N <==> English: N) (Hebrew:ADJ <==> English:ADJ) (Hebrew:N has ((def +))) (Hebrew:ADJ has ((def +)))

Grammar Acquisition Tool

A Noun Phrase Learning Instance and Transfer Rule Learning Instance: English: the big boy Hebrew: ha-yeled ha-gadol Acquired Transfer Rule: Hebrew: NP: N ADJ <==> English: NP: the ADJ N where: (Hebrew:N <==> English: N) (Hebrew:ADJ <==> English:ADJ) (Hebrew:N has ((def +))) (Hebrew:ADJ has ((def +)))

Version Space Abstraction Lattice

The Elicitation Corpus • Dynamically adaptable list of sentences in a major language • Compositional • Vocabulary, noun phrases, basic sentences, complex constructions • Pilot version: 800 sentences, tested on Swahili

Elicitation Tool Purpose • Provides a simple, intuitive interface for translation and alignment of elicitation corpus • Output from tool is used in version space learning

Eliciation Tool – Ideas, Goals • Central ideas: • User: bilingual speaker, not an expert in linguistics • This user translates sentences • The user also specifies word alignment • Aimed at covering major typological linguistic features (Swadesh list) • Goals: • Learn grammar rules • Use these rules to automatically learn transfer rules

Elicitation Tool Usage • User sees source language sentence (in English or Spanish) • User types in translation, then uses mouse to add word alignments • Alignments are indicated with lines between matching words

Elicitation Tool Example

Detection of Grammatical Features • Number, case, agreement, etc.

Simplifying Assumptions • Features marked only on a word • No mechanism yet for checking • Changes in word order • Changes on other words in the sentence • One-to-one correspondence between source language and target language word • Pre-defined features

Organization of Tests Dual Plural Paucal Diagnostic Tests … Subj-V Agr … …

Optimization – Diagnostic Planning • For optimization, minimize the number of sentences that need to be translated • Base the sequencing and choice of tests on the results of previous tests • E.g. if the corpus detects that a language has no feature “plural”, prune tests for “dual”, “paucal”, etc.

System Implementation • For each state: sequence of tests • Tests specify the indices of sentences • Separate file with list of sentences (facilitates re-use of sentences for multiple tests) • Ability to run in batch- and interactive mode

Morphology • mUtrUm.e.lu.mu eymi, amu.la.y.mi • call.IDO.SVN.DS you, go.NEG.IND.2s • when they called you, you did not go • fey.pi.a.l.mi, witra.kUnu.w.a.y.mi • that.say.PROG.COND.2s get_up.PFPS.REF.NRLD.IND.2s • if you are going to speak, you mush stand up • ohafo.a.fu.y.mi wUtre.mew • catch_a_cold.NRLD.IPD.IND.2s cold.INSTR • you might catch a cold because of the cold • sungu.a.fi.y.mi • speak.NRLD.EDO.IND.2s • you must speak with him • pUntU.ke.nie.w.Uy.ng.u • apart.DISTR.PRPS.REF.IND.3sn.d • are they apart from each other? • fey.pi.nge.r.pa.n • that_say.PASS.ITR.Hh.IND.1s • on my way here I was told

Morphology • kudu.le.me.we.la.n • lay_down.st.Hh.rem.neg.ind.1S • I am not going to lay down there any more • illku.faluw.kUle.n • get_angry.SIM.ST.IND.1s • I am pretending to be angry • antU.kUdaw.kiaw.ke.rke.fu.y • day.work.CIRC.CF.REP.IPD.IND.3s • he used to work here and there as a day laborer, I am told • wisa.ka.dungu.fe.nge.y.mi • bad.VERB.FAC.speak.NOM.VERB.IND.2s • you are someone who always does and says nasty things

Example Based MT: Data Collection

Parallel Text Data • Spanish-Mapudungan parallel corpora • Total words: 223,366 • Spanish-Mapudungan glossary • About 5500 entries

Goals for Year 2 • From the statement of work

Future Projects • Discussion

NICE: Native language Interpretation and Communciation Environment