1 / 72

Unsupervised Morphological Segmentation With Log-Linear Models

Unsupervised Morphological Segmentation With Log-Linear Models. Hoifung Poon University of Washington Joint Work with Colin Cherry and Kristina Toutanova. Machine Learning in NLP. Machine Learning in NLP. Unsupervised Learning. Machine Learning in NLP. Unsupervised Learning.

makana
Download Presentation

Unsupervised Morphological Segmentation With Log-Linear Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Morphological Segmentation With Log-Linear Models Hoifung Poon University of Washington Joint Work with Colin Cherry and Kristina Toutanova

  2. Machine Learning in NLP

  3. Machine Learning in NLP Unsupervised Learning

  4. Machine Learning in NLP Unsupervised Learning Log-Linear Models ?

  5. Machine Learning in NLP Unsupervised Learning Log-Linear Models Little work except for a couple of cases

  6. Machine Learning in NLP Unsupervised Learning Log-Linear Models ? Global Features

  7. Machine Learning in NLP Unsupervised Learning Log-Linear Models ??? Global Features

  8. Machine Learning in NLP Unsupervised Learning Log-Linear Models We developed a method forUnsupervised Learning ofLog-Linear Models withGlobal Features Global Features

  9. Machine Learning in NLP Unsupervised Learning Log-Linear Models We applied it to morphological segmentation and reduced F1 error by 10%50% compared to the state of the art Global Features

  10. Outline • Morphological segmentation • Our model • Learning and inference algorithms • Experimental results • Conclusion

  11. Morphological Segmentation • Breaks words into morphemes

  12. Morphological Segmentation • Breaks words into morphemes governments

  13. Morphological Segmentation • Breaks words into morphemes governments  govern ment s

  14. Morphological Segmentation • Breaks words into morphemes governments  govern ment s lm$pxtm (according to their families)

  15. Morphological Segmentation • Breaks words into morphemes governments  govern ment s lm$pxtm l  m$pm t m (according to their families)

  16. Morphological Segmentation • Breaks words into morphemes governments  govern ment s lm$pxtm l  m$pm t m (according to their families) • Key component in many NLP applications • Particularly important for morphologically-rich languages (e.g., Arabic, Hebrew, …)

  17. Why Unsupervised Learning ? • Text: Unlimited supplies in any language • Segmentation labels? • Only for a few languages • Expensive to acquire

  18. Why Log-Linear Models ? Can incorporate arbitrary overlapping features E.g., Alrb (the lord) • Morpheme features: • Substrings Al, rb are likely morphemes • Substrings Alr, lrb are not likely morphemes • Etc. • Context features: • Substrings between Al and# are likely morphemes • Substrings between lr and# are not likely morphemes • Etc.

  19. Why Global Features ? • Words can inform each other on segmentation • E.g., Alrb (the lord), lAlrb (to the lord)

  20. State of the Art in Unsupervised Morphological Segmentation • Use directed graphical models • Morfessor [Creutz & Lagus 2007] Hidden Markov Model (HMM) • Goldwater et al. [2006] Based on Pitman-Yor processes • Snyder & Barzilay [2008a, 2008b] • Based on Dirichlet processes • Uses bilingual information to help segmentation • Phrasal alignment • Prior knowledge on phonetic correspondence E.g., Hebrew w Arabic w, f; ...

  21. Unsupervised Learning with Log-Linear Models Few approaches exist to this date • Contrastive estimation [Smith & Eisner 2005] • Sampling [Poon & Domingos 2008]

  22. This Talk • First log-linear model for unsupervised morphological segmentation • Combines contrastive estimation with sampling • Achieves state-of-the-art results • Can apply to semi-supervised learning

  23. Outline • Morphological segmentation • Our model • Learning and inference algorithms • Experimental results • Conclusion

  24. Log-Linear Model • State variable x  X • Features fi: X  R • Weightsλi • Defines probability distribution over the states

  25. Log-Linear Model • State variables x  X • Features fi: X  R • Weightsλi • Defines probability distribution over the states

  26. States for Unsupervised Morphological Segmentation • Words wvlAvwn, Alrb, … • Segmentation w  vlAv  wn, Al  rb, … • Induced lexicon (unique morphemes) w, vlAv, wn, Al, rb

  27. Features for Unsupervised Morphological Segmentation • Morphemes and contexts • Exponential priors on model complexity

  28. Morphemes and Contexts • Count number of occurrences • Inspired by CCM [Klein & Manning, 2001] • E.g., w  vlAv wn wvlAvwn (##_##) vlAv (#w_wn) wn (Av_##) w (##_vl)

  29. Complexity-Based Priors • Lexicon prior:  • On lexicon length (total number of characters) • Favor fewer and shorter morpheme types • Corpus prior: • On number of morphemes (normalized by word length) • Favor fewer morpheme tokens • E.g., l  Al rb, Al  rb • l, Al, rb5  • l  Al rb 3/5  • Al  rb  2/4 

  30. Lexicon Prior Is Global Feature • Renders words interdependent in segmentation • E.g., lAlrb, Alrb lAlrb  ? Alrb  ?

  31. Lexicon Prior Is Global Feature • Renders words interdependent in segmentation • E.g., lAlrb, Alrb lAlrb  l Al rb Alrb  ?

  32. Lexicon Prior Is Global Feature • Renders words interdependent in segmentation • E.g., lAlrb, Alrb lAlrb  l Al rb Alrb  Al rb

  33. Lexicon Prior Is Global Feature • Renders words interdependent in segmentation • E.g., lAlrb, Alrb lAlrb  l Alrb Alrb  ?

  34. Lexicon Prior Is Global Feature • Renders words interdependent in segmentation • E.g., lAlrb, Alrb lAlrb  l Alrb Alrb  Alrb

  35. Probability Distribution For corpus W and segmentation S Morphemes Contexts Lexicon Prior Corpus Prior

  36. Outline • Morphological segmentation • Our model • Learning and inference algorithms • Experimental results • Conclusion

  37. Learning withLog-Linear Models • Maximizes likelihood of the observed data Moves probability mass to the observed data • From where? The set X that Z sums over • Normally, X = {all possible states} • Major challenge: Efficient computation (approximation) of the sum • Particularly difficult in unsupervised learning

  38. Contrastive Estimation • Smith & Eisner [2005] • X = a neighborhood of the observed data • Neighborhood  Pseudo-negative examples • Discriminate them from observed instances

  39. Problem with Contrastive Estimation • Objects are independent from each other • Using global features leads to intractable inference • In our case, could not use the lexicon prior

  40. Sampling to the Rescue • Similar to Poon & Domingos [2008] • Markov chain Monte Carlo • Estimates sufficient statistics based on samples • Straightforward to handle global features

  41. Our Learning Algorithm • Combines both ideas • Contrastive estimation  Creates an informative neighborhood • Sampling  Enables global feature (the lexicon prior)

  42. Learning Objective • Observed: W* (words) • Hidden: S (segmentation) • Maximizes log-likelihood of observing the words

  43. Neighborhood • TRANS1 Transpose any pair of adjacent characters • Intuition: Transposition usually leads to a non-word • E.g., lAlrb  Allrb, llArb, … Alrb  lArb, Arlb, …

  44. Optimization Gradient descent

  45. Supervised Learning andSemi-Supervised Learning • Readily applicable if there are labeled segmentations (S*) • Supervised: Labels for all words • Semi-supervised: Labels for some words

  46. Inference: Expectation • Gibbs sampling • ES|W*[fi] For each observed word in turn, sample next segmentation, conditioning on the rest • EW,S[fi] For each observed word in turn, sample a word from neighborhood and next segmentation, conditioning on the rest

  47. Inference: MAP Segmentation Deterministic annealing • Gibbs sampling with temperature • Gradually lower the temperature from 10 to 0.1

  48. Outline • Morphological segmentation • Our model • Learning and inference algorithms • Experimental results • Conclusion

  49. Dataset • S&B: Snyder & Barzilay [2008a, 2008b] • About 7,000 parallel short phrases • Arabic and Hebrew with gold segmentation • Arabic Penn Treebank (ATB): 120,000 words

  50. Methodology • Development set: 500 words from S&B • Use trigram context in our full model • Evaluation: Precision, recall, F1 on segmentation points

More Related