MT For Low-Density Languages - PowerPoint PPT Presentation

dian
mt for low density languages n.
Skip this Video
Loading SlideShow in 5 Seconds..
MT For Low-Density Languages PowerPoint Presentation
Download Presentation
MT For Low-Density Languages

play fullscreen
1 / 38
Download Presentation
210 Views
Download Presentation

MT For Low-Density Languages

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007

  2. What is “Low Density”?

  3. What is “Low Density”? • In NLP, languages are usually chosen for: • Economic Value • Ease of development • Funding (NSA, anyone?)

  4. What is “Low Density”? • As a result, NLP work until recently has focused on a rather small set of languages. • e.g. English, German, French, Japanese, Chinese

  5. What is “Low Density”? • “Density” refers to the availability of resources (primarily digital) for a given language. • Parallel text • Treebanks • Dictionaries • Chunked, semantically tagged, or other annotation


  6. What is “Low Density”? • “Density” not necessarily linked to speaker population • Our favorite example, Iniktitut

  7. So, why study LDL?

  8. So, why study LDL? • Preserving endangered languages • Spreading benefits of NLP to other populations • (Tegic has T9 for Azerbaijani now) • Benefits of wide typological coverage for cross-linguistic research • (?)

  9. Problem of LDL?

  10. Problem of LDL? • “The fundamental problem for annotation of lower-density languages is that they are lower density” – Maxwell & Hughes • Easiest NLP development (and often best) done with statistical methods • Training requires lots of resources • Resources require lots of money • Cost/Benefit chicken and the egg

  11. What are our options? • Create corpora by hand • Very time-consuming (= expensive) • Requires trained native speakers • Digitize printed resources • Time-consuming • May require trained native speakers • e.g. orthography without unicode entries

  12. What are our options? • Traditional requirements are going to be difficult to satisfy, no matter how we slice it. • We need to, then: • Maximize information extracted from resources we can get • Reduce requirements for building a system

  13. Maximizing Information with IGT

  14. Maximizing Information with IGT • Interlinear Glossed Text • Traditional form of transcription for linguistic field researchers and grammarians • Example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”

  15. Benefits of IGT • As IGT is frequently used in fieldwork, it is often available for low-density languages • IGT provides information about syntax, morphology, • The translation line is usually a high-density language that we can use as a pivot language.

  16. Drawbacks of IGT • Data can be ‘abormal’ in a number of ways • Usually quite short • May be used by grammarian to illustrate fringe usages • Often purposely limited vocabularies • Still, in working with LDL it might be all we’ve got

  17. Utilizing IGT • First, a big nod to Fei (this is her paper!) • As we saw in HW#2, word alignment is hard. • IGT, however, often gets us halfway there!

  18. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”

  19. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”

  20. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teachergave a book to the boy yesterday”

  21. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbook to-the boy yesterday “The teachergave a book to the boy yesterday”

  22. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbookto-theboyyesterday “The teachergave a bookto theboyyesterday”

  23. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbookto-theboyyesterday “The teachergave a bookto theboyyesterday” • The interlinear already aligns the source with the gloss • Often, the gloss uses words found in the translation already

  24. Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)

  25. Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes) • We can get a little more by stemming…

  26. Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes) • We can get a little more by stemming… • …but we’re going to need more.

  27. Utilizing IGT • Thankfully, with an English translation, we already have tools to get phrase and dependency structures that we can project: (Source: Will & Fei’s NAACL 2007 Paper!)

  28. Utilizing IGT • Thankfully, with an English translation, we already have tools to get phrase and dependency structures that we can project: (Source: Will & Fei’s NAACL 2007 Paper!)

  29. Utilizing IGT • What can we get from this? • Automatically generated CFGs • Can infer word order from these CFGs • Can infer possible constituents • …suggestions? • From a small amount of data, this is a lot of information, but what about…

  30. Reducing data Requirements with Prototyping

  31. Grammar Induction • So, we have a way to get production rules from a small amount of data. • Is this enough? • Probably not. • CFGs aren’t known for their robustness • How about using what we have as a bootstrap?

  32. Grammar Induction • Given unannotated text, we can derive PCFGs • Without annotation, though, we just have unlabelled trees: ROOT C2 X0 X1 Y2 the dog Z3 N4 fell asleep • Such an unlabelled parse doesn’t give us S -> NP VP, though. p=0.02 p=0.45e-4 p=0.003 p=0.09 p=5.3e-2

  33. Grammar Induction • Can we get labeled trees without annotated text? • Haghighi & Klein (2006) • Propose a way in which production rules can be passed to a PCFG induction algorithm as “prototypical” constituents • Think of these prototypes as a rubric that could be given to a human annotator • e.g. for English, NP -> DT NN

  34. Grammar Induction • Let’s take the possible constituent DT NN • We could tell our PCFG algorithm to apply this as a constituent everywhere it occurs • But what about DT NN NN? (the train station)? • We would like to catch this as well

  35. Grammar Induction • K&H’s solution? • distributional clustering • “a similarity measure between two items on the basis of their immediate left and right contexts” • …to be honest, I lose them in the math here. • Importantly, however, weighting the probability of a constituent with the right measure improves from the base unsupervised level of f-measure 35.3 to 62.2

  36. So… what now?

  37. Next Steps • By extracting production rules from a very small amount of data using IGT and using Haghighi & Klein’s unsupervised methods, it may be possible to bootstrap an effective language model from very little data!

  38. Next Steps • Possible applications: • Automatic generation of language resources • (While a system with the same goals would only compound error, automatically annotated data could be easier for a human to correct rather than hand-generate) • Assist linguists in the field • (Better model performance could imply better grammar coverage) • …you tell me!