1 / 46

NICE: Native language Interpretation and Communciation Environment

NICE: Native language Interpretation and Communciation Environment. Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown, Erik Peterson, Kathrin Probst, Rodolfo Vega, Hal Daume Language Technologies Institute Carnegie Mellon University April 12, 2001. Example of Feature Detection.

gittel
Download Presentation

NICE: Native language Interpretation and Communciation Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NICE: Native language Interpretation and Communciation Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown, Erik Peterson, Kathrin Probst, Rodolfo Vega, Hal Daume Language Technologies Institute Carnegie Mellon University April 12, 2001

  2. Example of Feature Detection • Detect features using minimal pairs • E.G. To detect plural: • “The man slept.” • “The men slept.” • In the translations, compare the words for “man” and “men”.

  3. Definition of the Problem • Machine Translation of very low density languages • No text in electronic form • Can’t apply current methods for statistical MT • Few literate native speakers • No standard spelling or orthography • Few linguists familiar with the language • Nobody is available to do traditional knowledge based MT • Not enough money or time for years of development

  4. Impact • Rapid development of machine translation for languages with very scarce resources • Policy makers can get input from indigenous people. • E.g., Has there been an epidemic or a crop failure • Indigenous people can participate in government, education, and internet without losing their language. • Possibly first MT of polysynthetic languages

  5. New Ideas • Machine learning of knowledge–based rules without large amounts of text and without trained linguists. • Multi-Engine architecture can flexibly take advantage of whatever resources are available. • Research partnerships with indigenous communities.

  6. History of NICE • Arose from a series of joint workshops of NSF and OAS-CICAD. • Recommendations: • Create multinational projects using information technology to: • provide immediate benefits to governments and citizens • develop critical infrastructure for communication and collaborative research • training researchers and engineers • advancing science and technology

  7. Approach • Multi-Engine MT • Flexibly adapt to whatever resources are available • Take advantage of the strengths of different MT approaches • Machine learning • Uncontrolled corpus (General Example-Based MT) • Controlled corpus elicited from native speakers (Version Space Learning)

  8. Evaluation • To achieve a given level of translation quality for a series of languages L1 to Ln • Reduce the amount of required data • Reduce the amount of development time

  9. Evaluation Baseline (Generalized EBMT) • High density language (French) • 1MW parallel corpus (subset of Hansards) • Consistent spelling • Corpus is grammatically correct in both languages

  10. Evaluation Baseline • GEBMT • French Hansards

  11. Evaluation Goal

  12. Progress to Date Overview • Establishing partnerships • Example-Based MT • Collection of data • Standardizing spelling and orthography • Instructible Knowledge-Based MT • Elicitation corpus • Elicitation interface • Feature detection

  13. Establishing Partnerships

  14. Nice Partners

  15. Agreement between LTI and Institute of Indigenous Studies (IEI), Universidad de la Frontera, Chile • Contributions of IEI • Socio-linguistic knowledge • Linguistic knowledge • Experience in multicultural bilingual education • The use of IEI facilities, faculty/researchers and staff for the project • electronic network support and computer technical support

  16. The IEI Team • Coordinator (leader of a bilingual and multicultural education project) • Distinguished native speaker • Linguists (one native speaker, one near-native) • Typists/Transcribers • Recording assistants • Translators • Native speaker linguistic informants

  17. Agreement between LTI and Institute of Indigenous Studies (IEI), Universidad de la Frontera, Chile • Contributions of LTI • Equipment: four computers and four DAT recorders • Payment of consulting fees pending funding from the Chilean Ministry of Education • Expertise in language technologies

  18. LTI/IEI Agreement • Cooperate in expand the project to convergent areas, such as bilingual education, as well as in pursuing additional funding

  19. MINEDUC/IEIAgreement Highlights: Introduction: Based on the LTI/IEI agreement, both institutions got the Chilean Ministry of Education involved in funding the data collection and processing team for the year 2001. This agreement will be renewed each year, as needed.

  20. MINEDUC/IEI Agreement: • Objectives: • To evaluate the NICE/Mapudungun writing conventions proposal • To collect an oral corpus that represent the four Mapudungun dialects spoken in Chile. The main domain is primary health, traditional and Occidental

  21. MINEDUC/IEI Agreement: • Products: • An oral corpus of 800 hours recorded, proportional to the demography of each current spoken dialect • 120 hours transcribed and translated from Mapudungun to Spanish • A refined proposal for writing Mapudungun

  22. Nice/Mapudungun:Current Products • Writing conventions (Grafemario) • Vocabulary Mapudungun/Spanish (8,715) • Bilingual newspaper, 4 numbers (19,647) • Ultimas Familias –memoirs- (25,289) • Memorias de Pascual Coña (76,311) • 6 hours transcribed • 40 hours recorded

  23. Instructible Knowledge-Based MT

  24. Grammar Acquisition Tool

  25. Grammar Acquisition Tool

  26. A Noun Phrase Learning Instance and Transfer Rule Learning Instance: English: the big boy Hebrew: ha-yeled ha-gadol Acquired Transfer Rule: Hebrew: NP: N ADJ <==> English: NP: the ADJ N where: (Hebrew:N <==> English: N) (Hebrew:ADJ <==> English:ADJ) (Hebrew:N has ((def +))) (Hebrew:ADJ has ((def +)))

  27. Grammar Acquisition Tool

  28. A Noun Phrase Learning Instance and Transfer Rule Learning Instance: English: the big boy Hebrew: ha-yeled ha-gadol Acquired Transfer Rule: Hebrew: NP: N ADJ <==> English: NP: the ADJ N where: (Hebrew:N <==> English: N) (Hebrew:ADJ <==> English:ADJ) (Hebrew:N has ((def +))) (Hebrew:ADJ has ((def +)))

  29. Version Space Abstraction Lattice

  30. The Elicitation Corpus • Dynamically adaptable list of sentences in a major language • Compositional • Vocabulary, noun phrases, basic sentences, complex constructions • Pilot version: 800 sentences, tested on Swahili

  31. Elicitation Tool Purpose • Provides a simple, intuitive interface for translation and alignment of elicitation corpus • Output from tool is used in version space learning

  32. Eliciation Tool – Ideas, Goals • Central ideas: • User: bilingual speaker, not an expert in linguistics • This user translates sentences • The user also specifies word alignment • Aimed at covering major typological linguistic features (Swadesh list) • Goals: • Learn grammar rules • Use these rules to automatically learn transfer rules

  33. Elicitation Tool Usage • User sees source language sentence (in English or Spanish) • User types in translation, then uses mouse to add word alignments • Alignments are indicated with lines between matching words

  34. Elicitation Tool Example

  35. Detection of Grammatical Features • Number, case, agreement, etc.

  36. Simplifying Assumptions • Features marked only on a word • No mechanism yet for checking • Changes in word order • Changes on other words in the sentence • One-to-one correspondence between source language and target language word • Pre-defined features

  37. Organization of Tests Dual Plural Paucal Diagnostic Tests … Subj-V Agr … …

  38. Optimization – Diagnostic Planning • For optimization, minimize the number of sentences that need to be translated • Base the sequencing and choice of tests on the results of previous tests • E.g. if the corpus detects that a language has no feature “plural”, prune tests for “dual”, “paucal”, etc.

  39. System Implementation • For each state: sequence of tests • Tests specify the indices of sentences • Separate file with list of sentences (facilitates re-use of sentences for multiple tests) • Ability to run in batch- and interactive mode

  40. Morphology • mUtrUm.e.lu.mu eymi, amu.la.y.mi • call.IDO.SVN.DS you, go.NEG.IND.2s • when they called you, you did not go • fey.pi.a.l.mi, witra.kUnu.w.a.y.mi • that.say.PROG.COND.2s get_up.PFPS.REF.NRLD.IND.2s • if you are going to speak, you mush stand up • ohafo.a.fu.y.mi wUtre.mew • catch_a_cold.NRLD.IPD.IND.2s cold.INSTR • you might catch a cold because of the cold • sungu.a.fi.y.mi • speak.NRLD.EDO.IND.2s • you must speak with him • pUntU.ke.nie.w.Uy.ng.u • apart.DISTR.PRPS.REF.IND.3sn.d • are they apart from each other? • fey.pi.nge.r.pa.n • that_say.PASS.ITR.Hh.IND.1s • on my way here I was told

  41. Morphology • kudu.le.me.we.la.n • lay_down.st.Hh.rem.neg.ind.1S • I am not going to lay down there any more • illku.faluw.kUle.n • get_angry.SIM.ST.IND.1s • I am pretending to be angry • antU.kUdaw.kiaw.ke.rke.fu.y • day.work.CIRC.CF.REP.IPD.IND.3s • he used to work here and there as a day laborer, I am told • wisa.ka.dungu.fe.nge.y.mi • bad.VERB.FAC.speak.NOM.VERB.IND.2s • you are someone who always does and says nasty things

  42. Example Based MT: Data Collection

  43. Parallel Text Data • Spanish-Mapudungan parallel corpora • Total words: 223,366 • Spanish-Mapudungan glossary • About 5500 entries

  44. Goals for Year 2 • From the statement of work

  45. Future Projects • Discussion

More Related