Noah A. Smith and Jason Eisner Department of Computer Science /

Contrastive Estimation:(Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University {nasmith,jason}@cs.jhu.edu ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Nutshell Version unannotated text tractable training contrastive estimation with lattice neighborhoods • Experiments on unlabeled data: • POS tagging: 46% error rate • reduction (relative to EM) • “Max ent” features make it possible • to survive damage to tag dictionary • Dependency parsing: 21% • attachment error reduction • (relative to EM) “max ent” features sequence models ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

“Red leaves don’t hide blue jays.” ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Maximum Likelihood Estimation(Supervised) y JJ NNS MD VB JJ NNS p red leaves don’t hide blue jays x ? * p ? Σ* × Λ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Maximum Likelihood Estimation(Unsupervised) ? ? ? ? ? ? p red leaves don’t hide blue jays x This is what EM does. ? * p ? Σ* × Λ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Focusing Probability Mass numerator denominator ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Conditional Estimation(Supervised) y JJ NNS MD VB JJ NNS p red leaves don’t hide blue jays x ? ? ? ? ? ? p red leaves don’t hide blue jays A different denominator! (x) × Λ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Objective Functions *For generative models. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Objective Functions generic numerical solvers (in this talk, LMVM L-BFGS) Contrastive Estimation observed data (in this talk, raw word sequence, sum over all possible taggings) ? *For generative models. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

This talk is about denominators ... in the unsupervised case. A good denominator can improve accuracy and tractability. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

red leaves don’t hide blue jays Language Learning (Syntax) Why didn’t he say, “birds fly” or “dancing granola” or “the wash dishes” or any other sequence of words? EM ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

red leaves don’t hide blue jays Language Learning (Syntax) Why did he pick that sequence for those words? Why not say “leaves red ...” or “... hide don’t ...” or ... ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

What is a syntax model supposed to explain? Each learning hypothesis corresponds to a denominator / neighborhood. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

red leaves don’t blue jays leaves don’t hide blue jays red leaves don’t hide blue jays red don’t hide blue jays red leaves hide blue jays red leaves don’t hide jays red leaves don’t hide blue The Job of Syntax “Explain why each word is necessary.” → DEL1WORD neighborhood ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

red don’t leaves hide blue jays leaves red don’t hide blue jays red leaves don’t hide blue jays red leaves hide don’t blue jays red leaves don’t hide jays blue red leaves don’t blue hide jays The Job of Syntax “Explain the (local) order of the words.” → TRANS1 neighborhood ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? leaves red red red red red leaves leaves leaves leaves don’t red leaves don’t don’t don’t don’t hide don’t hide hide hide hide blue blue hide blue blue blue jays blue jays jays jays jays jays ? ? ? ? ? ? p red leaves don’t hide blue jays sentences in TRANS1 neighborhood p ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

? ? ? ? ? ? p red leaves don’t hide blue jays red leaves don’t hide blue jays hide jays leaves don’t blue p blue hide red leaves don’t don’t hide blue jays (with any tagging) sentences in TRANS1 neighborhood ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The New Modeling Imperative A good sentence hints that a set of badones is nearby. numerator denominator (“neighborhood”) “Make the good sentence likely, at the expense of those bad neighbors.” ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

This talk is about denominators ... in the unsupervised case. A good denominator can improve accuracy and tractability. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Log-Linear Models score of x, y partition function Computing Z is undesirable! Sums over all possible taggings of all possible sentences! Contrastive Estimation (Unsupervised) Conditional Estimation (Supervised) a few sentences 1 sentence ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

A Big Picture: Sequence Model Estimation unannotated data tractable sums generative, EM: p(x) generative, MLE: p(x, y) log-linear, CE with lattice neighborhoods log-linear, EM: p(x) log-linear, conditional estimation: p(y | x) log-linear, MLE: p(x, y) overlapping features ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Contrastive Neighborhoods • Guide the learner toward models that do what syntax is supposed to do. • Lattice representation → efficient algorithms. There is an art to choosing neighborhood functions. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Neighborhoods DEL1WORD TRANS1 DELORTRANS1 DEL1WORD TRANS1 DEL1SUBSEQUENCE Σ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The Merialdo (1994) Task Given unlabeled text and a POS dictionary (that tells all possible tags for each word type), learn to tag. A form of supervision. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Trigram Tagging Model JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

CRF supervised HMM LENGTH ≈ log-linear EM TRANS1 DELORTRANS1 DA Smith & Eisner (2004) 10 × data EM Merialdo (1994) EM DEL1WORD DEL1SUBSEQUENCE random • 96K words • full POS dictionary • uninformative initializer • best of 8 smoothing conditions ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Dictionary includes ... • all words • words from 1st half of corpus • words with count  2 • words with count  3 Dictionary excludes OOV words, which can get any tag. What if we damage the POS dictionary? ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Dictionary includes ... • all words • words from 1st half of corpus • words with count  2 • words with count  3 Dictionary excludes OOV words, which can get any tag. • 96K words • 17 coarse POS tags • uninformative initializer EM random LENGTH DELORTRANS1 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Trigram Tagging Model + Spelling JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary 1- to 3-character suffixes, contains hyphen, digit ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Spelling features aided recovery, but only with a smart neighborhood. EM LENGTH + spelling random LENGTH DELORTRANS1 + spelling DELORTRANS1 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The model need not be finite-state. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Unsupervised Dependency Parsing Klein & Manning (2004) attachment accuracy EM LENGTH TRANS1 See our paper at the IJCAI 2005 Grammatical Inference workshop. initializer ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

To Sum Up ... Contrastive Estimation means picking your own denominator for tractability or for accuracy (or, as in our case, for both). Now we can use the task to guide the unsupervised learner (like discriminative techniques do for supervised learners). It’s a particularly good fit for log-linear models: with max ent features unsupervised sequence models all in time for ACL 2006. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Noah A. Smith and Jason Eisner Department of Computer Science /