1 / 45

The String Edit Distance (SED) heuristic for morpheme discovery: a look at Swahili

The String Edit Distance (SED) heuristic for morpheme discovery: a look at Swahili. Yu Hu Irina Matveeva John Goldsmith Colin Sprague. A new heuristic for morpheme discovery.

conner
Download Presentation

The String Edit Distance (SED) heuristic for morpheme discovery: a look at Swahili

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The String Edit Distance (SED) heuristic for morpheme discovery: a look at Swahili Yu Hu Irina Matveeva John Goldsmith Colin Sprague SED heuristic: morpheme discovery

  2. A new heuristic for morpheme discovery • Over-all goal: to understand the process of going fromuntagged corpora in an unknown language to a parsing of each word in the corpus into its component morphemes. • That is, morphology-induction. SED heuristic: morpheme discovery

  3. Linguistica • http://linguistica.uchicago.edu SED heuristic: morpheme discovery

  4. General structure of morphology induction • Search method within morphology space • Objective function to evaluate goodness of any given morphology for a specific corpus Search method divided into two parts: • Initial, or bootstrapping, heuristic • Incremental heuristics SED heuristic: morpheme discovery

  5. Zellig Harris (1909-1992) • Proposed successor frequency (SF) as a method for finding morpheme breaks. e i m g o v e r n o s # SED heuristic: morpheme discovery

  6. SF (Z.Harris) works reasonably well for European languages, though it draws too many false positives. • It does not work well for languages with rich morphologies: where the average number of morphemes per word is high. SED heuristic: morpheme discovery

  7. SF: false positives • Most of SF’s false positives can be weeded out by looking for signatures: multiple stems co-occurring with multiple suffixes. • That is: SED heuristic: morpheme discovery

  8. SF peaks as FSA SED heuristic: morpheme discovery

  9. Signature: reduces false positives of SF SED heuristic: morpheme discovery

  10. Generalize the signature… Sequential FSA: each state has a unique successor. SED heuristic: morpheme discovery

  11. Here is how we do it. SED heuristic: morpheme discovery

  12. 1. Alignments SED heuristic: morpheme discovery

  13. 1.1 Alignments: String edit distance algorithm SED heuristic: morpheme discovery

  14. SED is slow; and there are many pairs of words in a corpus. • So we make an effort to avoid applying SED when it’s futile. • Alphabetize the letters of each word to quickly count the overlap in the bag of letters in each word: • Set a minimum threshold of 3 letters. SED heuristic: morpheme discovery

  15. 1.2 Alignments: make cuts SED heuristic: morpheme discovery

  16. 1.3 Result: elementary alignment SED heuristic: morpheme discovery

  17. 2.1 Collapsing elementary alignments context context SED heuristic: morpheme discovery

  18. 2.2 Two or more sequential FSAs with identical contexts are collapsed: SED heuristic: morpheme discovery

  19. 3. Further collapsing FSAs SED heuristic: morpheme discovery

  20. 4. 1 Evaluating the robustness of these templates (sequential FSAs) • Measure: How many letters do we save by expressing words in a template rather than by writing each one out individually?Answer: 36 -17 = 19. SED heuristic: morpheme discovery

  21. 4.2 In practice… • Significant templates save from 200 to 5,000 letters. • Ranking by this measure provides a good measure of how significant they are in the overall morphology of the language. SED heuristic: morpheme discovery

  22. Swahili (Bantu, East Africa) SED heuristic: morpheme discovery

  23. The goal… • Is to learn a FSA that matches what we know the morphology of Swahili to be. • As a first approximation: SED heuristic: morpheme discovery

  24. Swahili verb SED heuristic: morpheme discovery

  25. Swahili verb Subject marker SED heuristic: morpheme discovery

  26. Swahili verb Subject marker Tense marker SED heuristic: morpheme discovery

  27. Swahili verb Subject marker Tense marker Object marker SED heuristic: morpheme discovery

  28. Swahili verb Subject marker Object marker Tense marker Root SED heuristic: morpheme discovery

  29. Swahili verb Subject marker Object marker Tense marker Root Voice (active/passive) SED heuristic: morpheme discovery

  30. Swahili verb Subject marker Object marker Tense marker Root Voice (active/passive) Finalvowel SED heuristic: morpheme discovery

  31. Swahili verb Subject marker Object marker Tense marker Root Voice (active/passive) choyoye Finalvowel SED heuristic: morpheme discovery

  32. 4.3 Top templates: 8,200 Swahili words SED heuristic: morpheme discovery

  33. 4.4 Precision and recall SED heuristic: morpheme discovery

  34. 5.1 Improvements through disambiguation • When all of the final letters of the production of a (non-final) state S are identical, then there is an uncertainty in the analysis: SED heuristic: morpheme discovery

  35. Can we distinguish grammatical from lexical morphemes? In general, yes: based on the number of distinct morphemes generated by a state transition: More than 5 means lexical. Better: use morpheme length and morpheme frequency in addition to size of arc production. • When one set of morphemes is lexical and the other is grammatical, then put the ambiguous material in the grammatical morphemes. Why? This keeps the number of letters in the morphology small(er). SED heuristic: morpheme discovery

  36. When both sets are grammatical: • Now we think of the cost of the labels on the FSA edges in terms of the encoding length of the pointer to the morpheme. • We would rather have edges that point to high-frequency morphemes than low-frequency morphemes. • The overall use of a string in the grammar plays the crucial role here. SED heuristic: morpheme discovery

  37. Actual implementation • We do not have access to any frequencies or probabilities yet (by construction). • We associate with each morpheme m the total robustness of each of the templates in which it appears so far. • If a word can be parsed in two ways, we choose the parse for which the sum of the robustness of the pieces is the greatest. SED heuristic: morpheme discovery

  38. Example 1: Swahili SED heuristic: morpheme discovery

  39. Collapsing templates to generate unseen words Label a transition as grammatical or lexical. We consider collapsing pairs of 4-state FSAs. Our conditions for collapsing: • Two lexical transition must share at least two stems in common. • One pair of grammatical transitions must be identical • Other pair: symmetric difference SED heuristic: morpheme discovery

  40. SED heuristic: morpheme discovery

  41. SED heuristic: morpheme discovery

  42. Adding “incomplete” stems • Try to reparse each word in the corpus according to the current templates. Success will (may) mean the hypothesis of a new stem T (lexical morpheme). If creating stem T predicts the existence of 3+ words that truly exist, then we admit stem T. SED heuristic: morpheme discovery

  43. Results: Disambiguation Training corpus: 7,180 distinct words of Swahili (50,000 running words). SED heuristic: morpheme discovery

  44. Collapsed templates SED heuristic: morpheme discovery

  45. Next steps: Integrate these sub-FSAs into a single FSA; Split some of the single state-transitions into sequences: E.g., SED heuristic: morpheme discovery

More Related