1 / 35

Noah A. Smith and Jason Eisner Department of Computer Science /

Contrastive Estimation : (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data. Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University {nasmith,jason}@cs.jhu.edu. Nutshell Version.

lynna
Download Presentation

Noah A. Smith and Jason Eisner Department of Computer Science /

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Contrastive Estimation:(Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University {nasmith,jason}@cs.jhu.edu ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  2. Nutshell Version unannotated text tractable training contrastive estimation with lattice neighborhoods • Experiments on unlabeled data: • POS tagging: 46% error rate • reduction (relative to EM) • “Max ent” features make it possible • to survive damage to tag dictionary • Dependency parsing: 21% • attachment error reduction • (relative to EM) “max ent” features sequence models ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  3. “Red leaves don’t hide blue jays.” ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  4. Maximum Likelihood Estimation(Supervised) y JJ NNS MD VB JJ NNS p red leaves don’t hide blue jays x ? * p ? Σ* × Λ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  5. Maximum Likelihood Estimation(Unsupervised) ? ? ? ? ? ? p red leaves don’t hide blue jays x This is what EM does. ? * p ? Σ* × Λ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  6. Focusing Probability Mass numerator denominator ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  7. Focusing Probability Mass numerator denominator ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  8. Conditional Estimation(Supervised) y JJ NNS MD VB JJ NNS p red leaves don’t hide blue jays x ? ? ? ? ? ? p red leaves don’t hide blue jays A different denominator! (x) × Λ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  9. Objective Functions *For generative models. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  10. Objective Functions generic numerical solvers (in this talk, LMVM L-BFGS) Contrastive Estimation observed data (in this talk, raw word sequence, sum over all possible taggings) ? *For generative models. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  11. This talk is about denominators ... in the unsupervised case. A good denominator can improve accuracy and tractability. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  12. red leaves don’t hide blue jays Language Learning (Syntax) Why didn’t he say, “birds fly” or “dancing granola” or “the wash dishes” or any other sequence of words? EM ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  13. red leaves don’t hide blue jays Language Learning (Syntax) Why did he pick that sequence for those words? Why not say “leaves red ...” or “... hide don’t ...” or ... ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  14. What is a syntax model supposed to explain? Each learning hypothesis corresponds to a denominator / neighborhood. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  15. red leaves don’t blue jays leaves don’t hide blue jays red leaves don’t hide blue jays red don’t hide blue jays red leaves hide blue jays red leaves don’t hide jays red leaves don’t hide blue The Job of Syntax “Explain why each word is necessary.” → DEL1WORD neighborhood ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  16. red don’t leaves hide blue jays leaves red don’t hide blue jays red leaves don’t hide blue jays red leaves hide don’t blue jays red leaves don’t hide jays blue red leaves don’t blue hide jays The Job of Syntax “Explain the (local) order of the words.” → TRANS1 neighborhood ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  17. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? leaves red red red red red leaves leaves leaves leaves don’t red leaves don’t don’t don’t don’t hide don’t hide hide hide hide blue blue hide blue blue blue jays blue jays jays jays jays jays ? ? ? ? ? ? p red leaves don’t hide blue jays sentences in TRANS1 neighborhood p ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  18. ? ? ? ? ? ? p red leaves don’t hide blue jays red leaves don’t hide blue jays hide jays leaves don’t blue p blue hide red leaves don’t don’t hide blue jays (with any tagging) sentences in TRANS1 neighborhood ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  19. The New Modeling Imperative A good sentence hints that a set of badones is nearby. numerator denominator (“neighborhood”) “Make the good sentence likely, at the expense of those bad neighbors.” ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  20. This talk is about denominators ... in the unsupervised case. A good denominator can improve accuracy and tractability. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  21. Log-Linear Models score of x, y partition function Computing Z is undesirable! Sums over all possible taggings of all possible sentences! Contrastive Estimation (Unsupervised) Conditional Estimation (Supervised) a few sentences 1 sentence ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  22. A Big Picture: Sequence Model Estimation unannotated data tractable sums generative, EM: p(x) generative, MLE: p(x, y) log-linear, CE with lattice neighborhoods log-linear, EM: p(x) log-linear, conditional estimation: p(y | x) log-linear, MLE: p(x, y) overlapping features ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  23. Contrastive Neighborhoods • Guide the learner toward models that do what syntax is supposed to do. • Lattice representation → efficient algorithms. There is an art to choosing neighborhood functions. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  24. Neighborhoods DEL1WORD TRANS1 DELORTRANS1 DEL1WORD TRANS1 DEL1SUBSEQUENCE Σ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  25. The Merialdo (1994) Task Given unlabeled text and a POS dictionary (that tells all possible tags for each word type), learn to tag. A form of supervision. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  26. Trigram Tagging Model JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  27. CRF supervised HMM LENGTH ≈ log-linear EM TRANS1 DELORTRANS1 DA Smith & Eisner (2004) 10 × data EM Merialdo (1994) EM DEL1WORD DEL1SUBSEQUENCE random • 96K words • full POS dictionary • uninformative initializer • best of 8 smoothing conditions ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  28. Dictionary includes ... • all words • words from 1st half of corpus • words with count  2 • words with count  3 Dictionary excludes OOV words, which can get any tag. What if we damage the POS dictionary? ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  29. Dictionary includes ... • all words • words from 1st half of corpus • words with count  2 • words with count  3 Dictionary excludes OOV words, which can get any tag. • 96K words • 17 coarse POS tags • uninformative initializer EM random LENGTH DELORTRANS1 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  30. Trigram Tagging Model + Spelling JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary 1- to 3-character suffixes, contains hyphen, digit ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  31. Spelling features aided recovery, but only with a smart neighborhood. EM LENGTH + spelling random LENGTH DELORTRANS1 + spelling DELORTRANS1 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  32. The model need not be finite-state. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  33. Unsupervised Dependency Parsing Klein & Manning (2004) attachment accuracy EM LENGTH TRANS1 See our paper at the IJCAI 2005 Grammatical Inference workshop. initializer ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  34. To Sum Up ... Contrastive Estimation means picking your own denominator for tractability or for accuracy (or, as in our case, for both). Now we can use the task to guide the unsupervised learner (like discriminative techniques do for supervised learners). It’s a particularly good fit for log-linear models: with max ent features unsupervised sequence models all in time for ACL 2006. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

  35. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

More Related