1 / 77

Towards automatic enrichment and analysis of linguistic data for low-density languages

Towards automatic enrichment and analysis of linguistic data for low-density languages. Fei Xia University of Washington Joint work with William Lewis and Dan Jinguji. Motivation: theoretical linguistics.

cullen
Download Presentation

Towards automatic enrichment and analysis of linguistic data for low-density languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards automatic enrichment and analysis of linguistic data for low-density languages Fei Xia University of Washington Joint work with William Lewis and Dan Jinguji

  2. Motivation: theoretical linguistics • For a particular language (e.g., Yaqui), find the answers for the following questions: • What is word order: SVO, SOV, VSO, ….? • Does it have double-object construction? • Can a coordinated phrase be discontinuous? (e.g., “NP1 Verb and NP2”) • …. • We want to know the answers for hundreds of languages.

  3. Motivation: computational linguistics • For a particular language, we want to build • a Part-of-speech tagger and a parser • Common approach: create a treebank • a MT system • Common approach: • collect parallel data • test translation divergence (Dorr, 1994; Fox 2002; Hwa et al, 2002)

  4. Main ideas • Projecting structures from a resource-rich language (e.g., English) to a low-density language. • Tapping the large body of Web-based linguistic data  using ODIN dataset

  5. Structure projection • Previous work • (Yarowsky & Ngai, 2001): POS tags and NP boundaries • (Xi & Hwa, 2005): POS tags • (Hua et al., 2002): dependency structures • (Quirk et al., 2005): dependency structures • Our work: • Projecting both dependency structures and phrase structures • It does not require a large amount of parallel data or hand-aligned data. • It can be applied to hundreds of languages.

  6. Outline • Background: IGT and ODIN • Data enrichment • Word alignment • Structure projection • Grammar extraction • Experiments • Conclusion and future work

  7. Background: IGT and ODIN

  8. Interlinear Glossed Text (IGT) Rhoddodd yr athro lyfr i’r bacjgem ddoe Gave-3sg the teacher book to-the boy yesterday The teacher gave a book to the boy yesterday (Bailyn, 2001)

  9. ODIN • Online Database of Interlinear text • Storing and indexing IGT found in scholarly documents on the Web • Searchable by language name, language family, concept/gram, etc. • Current size • 36439 instances • 725 languages

  10. Data Enrichment

  11. The goal • Original IGT: three lines • Enriched IGT: • English phrase structure (PS), dependency structure (DS) • Source PS and DS • Word alignment between source and English translation

  12. Three steps • Parse the English translation • Align the source sentence and its English translation • Project the English PS and DS onto the source side

  13. Step 1: Parsing the English translation The teacher gave a book to the boy yesterday

  14. Step 2: Word alignment

  15. Source-gloss alignment

  16. Gloss-translation alignment

  17. Heuristic word aligner Gave-3sg the teacher book to-the boy yesterday The teacher gave a book to the boy yesterday The aligner aligns two words if they have the same root form.

  18. Limitation of heuristic word aligner 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG I caught the pig and the cat

  19. Statistical word aligner • GIZA++ package (Och and Ney, 2000) • It implemented IBM models (Brown et. al., 1993) • Widely used in statistical MT field • Parallel corpus formed by the gloss and translation lines of all the IGT examples in ODIN.

  20. Improving word aligner • Train both directions (glosstrans, transgloss) and combine the results • Split words in the gloss line into morphemes 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG  1SG pig -NNOM –SG grasp -PST and cat –NNOM -SG

  21. Improving word aligner (cont) Pedro-NOM Goyo-ACC yesterday horse-ACC steal-PRFV-SAY-PRES Pedro says Goyo has stolen the horse yesterday . Add (x,x) sentence pairs: (Pedro, Pedro) (Goyo, Goyo) …..

  22. Step 3: Projecting structures • Projecting DS • Previous work: • (Hwa et. al, 2002) • (Quirk et. al, 2005) • Projecting PS

  23. Projecting phrase structure

  24. Projecting PS • Copy the English PS and remove all the unaligned English words • Replace English words with corresponding source words • Starting from the root, reorder children of each node. • Attach unaligned source words

  25. Starting with English PS The teacher gave a book to the boy yesterday

  26. Replacing English words

  27. Reordering children

  28. Calculating phrase spans

  29. “Reordering” NP and VP

  30. Removing VP

  31. Removing a node in PS

  32. After removing VP

  33. Reordering VBD and NP

  34. Removing NP

  35. Merging IN and DT

  36. Before “reordering”

  37. After reordering 1 2 3 4 5 6 7

  38. Reordering two children of x: y1 and y2 Let Si be the phrase span of yi: • S1 and S2 don’t overlap: reorder two nodes according to the spans. • S1½ S2: remove y2 • S1¾ S2: remove y1 • S1 and S2 overlap, and neither is a strict subset of the other: remove both nodes. If y1 and y2 are leaf nodes, merge them.

  39. Attaching unaligned source words y yi yj yk

  40. Information that can be extracted from enriched IGT • Grammars for source language • Transfer rules • Examples with interesting properties (e.g., crossing dependencies)

  41. Grammars S  VBD NP NP PP NP NP  DT NN NP  NN PP  IN+DT NN

  42. Examples of crossing dependencies Inepo kow-ta bwuise-k into mis-ta 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG I caught the pig and the cat (Martinez Fabian, 2006)

  43. Examples of crossing dependencies

  44. Examples of crossing dependencies

  45. Outline • Background: IGT and ODIN • Data enrichment • Experiments • Conclusion and future work

  46. Experiments • Test on a small set of IGT examples for seven languages: • SVO: German (GER) and Hausa (HUA) • SOV: Korean (KKN) and Yaqui (YAQ) • VSO: Irish (GLI) and Welsh (WLS) • VOS: Malagasy (MEX)

  47. Test set Numbers in the last row come from the Ethnologue (Gorden, 2005) Human annotators checked system output and corrected - English DS - word alignment - source DS

  48. Heuristic word aligner  High precision, low recall.

  49. Statistical word aligner: training data

More Related