1 / 96

ParaMor & Morpho Challenge 2008

ParaMor & Morpho Challenge 2008. C hristian M onson. Jaime Carbonell, Alon Lavie, Lori Levin. Turkish Morphology – Beads on a String. One Turkish Word. götür. ül. m. ü yor. s u n. present progressive. 2 nd person singular. take. pass ive. negative. You are not being taken.

mcquaid
Download Presentation

ParaMor & Morpho Challenge 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ParaMor& Morpho Challenge 2008 Christian Monson Jaime Carbonell, Alon Lavie, Lori Levin

  2. Turkish Morphology – Beads on a String One Turkish Word götür ül m üyor sun present progressive 2nd person singular take passive negative You are not being taken

  3. Computational Morphology Improves: Machine Translation Turkish-English (Oflazer, 2007) Czech-English (Goldwater and McClosky, 2005) Information Retrieval English, German, Finnish (Kurimo et al., 2008) Speech Recognition Finnish (Creutz, 2006) Grapheme-to-Phoneme Conversion German (Demberg, 2007)

  4. Morphology is Complex – Operations Prefixation Suffixation

  5. Morphology is Complex – Operations Prefixation Suffixation Reduplication

  6. Morphology is Complex – Operations Prefixation Suffixation Reduplication Infixation

  7. Morphology is Complex – Operations Prefixation Suffixation Reduplication Infixation

  8. Morphology is Complex – Operations Prefixation Suffixation Reduplication Infixation

  9. Morphology is Complex – Morphophonology götür ül m üyor sun present progressive 2nd person singular take passive negative You are not being taken

  10. Morphology is Complex – Morphophonology götür ül m sun yecek 2nd person singular take passive negative future You will not be taken

  11. Morphology is Complex – Morphophonology götür ül m sun yecek 2nd person singular take passive negative future You will not be taken

  12. Morphology is Complex – Morphophonology götür ül me sun yecek 2nd person singular take passive negative future You will not be taken

  13. Morphology is Complex – Morphophonology götür ül me sin yecek 2nd person singular take passive negative future You will not be taken

  14. Morphology is Complex – Morphophonology götür ül me sin yecek 2nd person singular take passive negative future You will not be taken

  15. Morphology is Complex – Ambiguity Hungarian mentek men +tek go +Present.2nd.Plural ‘yinz go’

  16. Morphology is Complex – Ambiguity Hungarian mentek men +tek go +Present.2nd.Plural ‘yinz go’ men +t +ek go +PastParticiple +Plural ‘those who have gone’

  17. In Morphology Systems for New Languages Complexity Time + Expertise

  18. In Morphology Systems for New Languages Complexity Time + Expertise Kemal Oflazer Expert on Turkish Computational morphology Time 3 - 4 Months to manually build a basic Turkish analyzer Plus lexicon development and maintenance

  19. The Solution Raw Text Unsupervised Morphology Induction

  20. The Solution Raw Text ?

  21. The Solution Raw Text Language Structure

  22. Techniques for Unsupervised Morphology Induction Transition Likelihood Harris (1955) – Finite State Automata Bernhard (2007)

  23. Techniques for Unsupervised Morphology Induction Transition Likelihood Harris (1955) – Finite State Automata Bernhard (2007) Minimum Description Length Goldsmith (2001, 2006) Creutz’s Morfessor (2006)

  24. Techniques for Unsupervised Morphology Induction Contextual Similarity Wicentowski (2002) Schone (2002)

  25. Techniques for Unsupervised Morphology Induction Contextual Similarity Wicentowski (2002) Schone (2002) The Paradigm Snover (2002) ParaMor (2007)

  26. What is a Paradigm? ül m üyor sun götür present progressive 2nd person singular take passive negative

  27. Paradigms Structure Inflectional Morphology Person & Number ül m üyor sun götür present progressive 2nd person singular take passive negative

  28. Paradigms Structure Inflectional Morphology Person & Number ül m üyor um götür um present progressive take passive negative 1st person singular

  29. Paradigms Structure Inflectional Morphology Person & Number ül m üyor um götür um Ø present progressive take passive negative 3rd person singular

  30. Paradigms Structure Inflectional Morphology Person & Number ül m üyor um götür um Ø uz present progressive take passive negative

  31. Paradigms Structure Inflectional Morphology Paradigm Paradigm Mutually substitutable morphological operations ül m üyor um götür um Ø uz present progressive take passive negative

  32. Paradigms Structure Inflectional Morphology yecek Tense & Aspect Person & Number Voice Polarity üyor ül m um um Ø uz

  33. Paradigms Structure Inflectional Morphology yecek Paradigms Paradigm Mutually substitutable morphological operations üyor ül m um um Ø uz

  34. The ParaMor Algorithm yecek Paradigm Paradigm Mutually substitutable strings üyor ül m um um Ø uz

  35. The ParaMor Algorithm yecek Candidate Stems Paradigm üyor ül m um um Ø uz 1 Morpheme Boundary

  36. The ParaMor Algorithm Simplifying Assumptions Suffixes only 70% of the World’s Languages are Suffixing (Dryer, 2005) Strict Concatenation

  37. The ParaMor Algorithm Simplifying Assumptions Suffixes only 70% of the World’s Languages are Suffixing (Dryer, 2005) Strict Concatenation Only a High-Level Overview

  38. The ParaMor Algorithm Identify Paradigms in 3 Steps

  39. The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms

  40. The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm

  41. The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter least likely candidates

  42. The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter least likely candidates Segment Words Using the discovered paradigms

  43. The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter Segment Words Using the discovered paradigms Today

  44. The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter Segment Words Using the discovered paradigms

  45. Search for Candidate Paradigms Propose a morpheme boundary at every character boundary in every word Consolidate identical candidate suffixes into paradigm seeds Spanish Example autorizaciones buscabamos costas importadoras vallas … Word List 50,000 Types s 10697

  46. Search for Candidate Paradigms Identify the most frequent mutually replaceable candidate suffix Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm Spanish Example autorizaciones buscabamos costaØ costas importadoraØ importadoras vallaØ vallas … Ø s 5513 s 10697

  47. Search for Candidate Paradigms A Parameter halts the introduction of suffixes When the most frequent mutually replaceable candidate suffix severely decreases the stem count Ø r s 281 autorizaciones buscabamos costar costaØ costas importadoraØ importadoras vallaØ vallas … Ø s 5513 s 10697

  48. Search for Candidate Paradigms Parameters set to produce High-recall Spanish paradigms And then frozen Ø r s 281 autorizaciones buscabamos costar costaØ costas importadoraØ importadoras vallaØ vallas … Ø s 5513 s 10697

  49. Search for Candidate Paradigms Move on to the next most frequent paradigm seed Ø r s 281 Ø s 5513 a 9020 s 10697

  50. Search for Candidate Paradigms a as o os 899 a o os 1418 Ø r s 281 a o 2325 Ø s 5513 a 9020 s 10697

More Related