1 / 81

Log-Linear Models in NLP

Log-Linear Models in NLP. Noah A. Smith Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University nasmith@cs.jhu.edu. Outline. Maximum Entropy principle Log-linear models Conditional modeling for classification Ratnaparkhi’s tagger

donnel
Download Presentation

Log-Linear Models in NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Log-Linear Models in NLP Noah A. Smith Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University nasmith@cs.jhu.edu

  2. Outline • Maximum Entropy principle • Log-linear models • Conditional modeling for classification • Ratnaparkhi’s tagger • Conditional random fields • Smoothing • Feature Selection

  3. Data For now, we’re just talking about modeling data. No task. How to assign probability to each shape type?

  4. Maximum Likelihood Fewer parameters? How to smooth? 11 degrees of freedom (12 – 1).

  5. Size Shape Some other kinds of models 11 degrees of freedom (1 + 4 + 6). Color These two are the same! These two are the same! Pr(Color, Shape, Size) = Pr(Color) • Pr(Shape | Color) • Pr(Size | Color, Shape)

  6. Size Shape Some other kinds of models 9 degrees of freedom (1 + 2 + 6). Color Pr(Color, Shape, Size) = Pr(Color) • Pr(Shape) • Pr(Size | Color, Shape)

  7. Size Shape Some other kinds of models 7 degrees of freedom (1 + 2 + 4). Color No zeroes here ... Pr(Color, Shape, Size) = Pr(Size) • Pr(Shape | Size) • Pr(Color | Size)

  8. Size Shape Some other kinds of models 4 degrees of freedom (1 + 2 + 1). Color Pr(Color, Shape, Size) = Pr(Size) • Pr(Shape) • Pr(Color)

  9. This is difficult. Different factorizations affect: smoothing # parameters (model size) model complexity “interpretability” goodness of fit ... Usually, this isn’t done empirically, either!

  10. Desiderata • You decide which features to use. • Some intuitive criterion tells you how to use them in the model. • Empirical.

  11. Maximum Entropy “Make the model as uniform as possible ... but I noticed a few things that I want to model ... so pick a model that fits the data on those things.”

  12. Occam’s Razor One should not increase, beyond what is necessary, the number of entities required to explain anything.

  13. Uniform model

  14. Constraint: Pr(small) = 0.625 0.625

  15. Pr( , small) = 0.048 0.048 0.625

  16. Pr(large, ) = 0.125 0.048 ? 0.625

  17. Questions Is there an efficient way to solve this problem? Does a solution always exist? Is there a way to express the model succinctly? What to do if it doesn’t?

  18. Entropy • A statistical measurement on a distribution. • Measured in bits. •  [0, log2|X|] • High entropy: close to uniform • Low entropy: close to deterministic • Concave in p.

  19. Max The Max Ent Problem H p2 p1

  20. The Max Ent Problem objective function is H probabilities sum to 1 ... picking a distribution ... and are nonnegative expected feature value under the model n constraints expected feature value from the data

  21. The Max Ent Problem H p2 p1

  22. 1 if x is a small , 0 otherwise About feature constraints 1 if x is large and light, 0 otherwise 1 if x is small, 0 otherwise

  23. Max Mathematical Magic constrained |X| variables (p) concave in p unconstrained N variables (θ) concave in θ

  24. What’s the catch? The model takes on a specific, parameterized form. It can be shown that any max-ent model must take this form.

  25. Outline • Maximum Entropy principle • Log-linear models • Conditional modeling for classification • Ratnaparkhi’s tagger • Conditional random fields • Smoothing • Feature Selection

  26. Log-linear models Log linear

  27. Log-linear models One parameter (θi) for each feature. Unnormalized probability, or weight Partition function

  28. Max Mathematical Magic Max ent problem constrained |X| variables (p) concave in p unconstrained N variables (θ) concave in θ Log-linear ML problem

  29. What does MLE mean? Independence among examples Arg max is the same in the log domain

  30. MLE: Then and Now

  31. Iterative Methods All of these methods are correct and will converge to the right answer; it’s just a matter of how fast. • Generalized Iterative Scaling • Improved Iterative Scaling • Gradient Ascent • Newton/Quasi-Newton Methods • Conjugate Gradient • Limited-Memory Variable Metric • ...

  32. Questions Is there an efficient way to solve this problem? Yes, many iterative methods. Does a solution always exist? Is there a way to express the model succinctly? Yes, if the constraints come from the data. Yes, a log-linear model.

  33. Outline • Maximum Entropy principle • Log-linear models • Conditional modeling for classification • Ratnaparkhi’s tagger • Conditional random fields • Smoothing • Feature Selection

  34. Conditional Estimation labels examples Training Objective: Classification Rule:

  35. Maximum Likelihood label object

  36. Maximum Likelihood label object

  37. Maximum Likelihood label object

  38. Maximum Likelihood label object

  39. Conditional Likelihood label object

  40. Remember: log-linear models conditional estimation

  41. The Whole Picture

  42. Log-linear models: MLE vs. CLE Sum over all example types  all labels. Sum over all labels.

  43. Classification Rule Pick the most probable label y: We don’t need to compute the partition function at test time! But it does need to be computed during training.

  44. Outline • Maximum Entropy principle • Log-linear models • Conditional modeling for classification • Ratnaparkhi’s tagger • Conditional random fields • Smoothing • Feature Selection

  45. Ratnaparkhi’s POS Tagger (1996) • Probability model: • Assume unseen words behave like rare words. • Rare words ≡ count < 5 • Training: GIS • Testing/Decoding: beam search

  46. Features: common words

  47. Features: rare words

  48. The “Label Bias” Problem (4) (6)

  49. The “Label Bias” Problem Pr(VBN | born) Pr(IN | VBN, to) Pr(NN | VBN, IN, wealth) = 1 * .4 * 1 born to VBN, IN wealth VBN ■ IN, NN to run VBN, TO TO, VB Pr(VBN | born) Pr(TO | VBN, to) Pr(VB | VBN, TO, wealth) = 1 * .6 * 1 born to wealth

  50. Is this symptomatic of log-linear models? No!

More Related