1 / 77

Maximum Entropy Model (I)

Maximum Entropy Model (I). LING 572 Fei Xia Week 5: 02/03-02/05/09. MaxEnt in NLP. The maximum entropy principle has a long history. The MaxEnt algorithm was introduced to the NLP field by Berger et. al. (1996). Used in many NLP tasks: Tagging, Parsing, PP attachment, ….

fionn
Download Presentation

Maximum Entropy Model (I)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/03-02/05/09

  2. MaxEnt in NLP • The maximum entropy principle has a long history. • The MaxEnt algorithm was introduced to the NLP field by Berger et. al. (1996). • Used in many NLP tasks: Tagging, Parsing, PP attachment, …

  3. Reference papers • (Ratnaparkhi, 1997) • (Berger et. al., 1996) • (Ratnaparkhi, 1996) • (Klein and Manning, 2003) People often choose different notations.

  4. Notation We following the notation in (Berger et al., 1996)

  5. Outline • Overview • The Maximum Entropy Principle • Modeling** • Decoding • Training** • Case study: POS tagging

  6. The Overview

  7. Joint vs. Conditional models ** • Given training data {(x,y)}, we want to build a model to predict y for new x’s. For each model, we need to estimate the parameters µ. • Joint (generative) models estimate P(x,y) by maximizing the likelihood: P(X,Y|µ) • Ex: n-gram models, HMM, Naïve Bayes, PCFG • Choosing weights is trivial: just use relative frequencies. • Conditional (discriminative) models estimate P(y|x) by maximizing the conditional likelihood: P(Y|X, µ) • Ex: MaxEnt, SVM, etc. • Choosing weights is harder.

  8. Naïve Bayes Model C … fn f2 f1 Assumption: each fmis conditionally independent from fngiven C.

  9. The conditional independence assumption fmand fnare conditionally independent given c: P(fm | c, fn) = P(fm | c) Counter-examples in the text classification task: - P(“bank” | politics) != P(Iowa | politics, “bailout”) - P(“shameful”| politics) != P(“shameful” | politics, “18.4B”) Q: How to deal with correlated features? A: Many models, including MaxEnt, do not assume that features are conditionally independent.

  10. Naïve Bayes highlights • Choose c* = arg maxc P(c) k P(fk | c) • Two types of model parameters: • Class prior: P(c) • Conditional probability: P(fk | c) • The number of model parameters: |C|+|CV|

  11. P(f | c) in NB Each cell is a weight for a particular (class, feat) pair.

  12. Weights in NB and MaxEnt • In NB • P(f | y) are probabilities (i.e., 2 [0,1]) • P(f | y) are multiplied at test time • In MaxEnt • the weights are real numbers: they can be negative. • the weights are added at test time

  13. The highlights in MaxEnt fj(x,y) is a feature function, which normally corresponds to a (feature, class) pair. Training: to estimate Testing: to calculate P(y|x)

  14. Main questions • What is the maximum entropy principle? • What is a feature function? • Modeling: Why does P(y|x) have the form? • Training: How do we estimate ¸j ?

  15. Outline • Overview • The Maximum Entropy Principle • Modeling** • Decoding • Training* • Case study

  16. The maximal entropy principle

  17. The maximum entropy principle • Related to Occam’s razor and other similar justifications for scientific inquiry • Make the minimumassumptions about unseen data. • Also: Laplace’s Principle of Insufficient Reason: when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely.

  18. Maximum Entropy • Why maximum entropy? • Maximize entropy = Minimize commitment • Model all that is known and assume nothing about what is unknown. • Model all that is known: satisfy a set of constraints that must hold • Assume nothing about what is unknown: choose the most “uniform” distribution  choose the one with maximum entropy

  19. Ex1: Coin-flip example(Klein & Manning, 2003) • Toss a coin: p(H)=p1, p(T)=p2. • Constraint: p1 + p2 = 1 • Question: what’s p(x)? That is, what is the value of p1? • Answer: choose the p that maximizes H(p) H H p1=0.3 p1

  20. Coin-flip example** (cont) H p1 + p2 = 1 p2 p1 p1+p2=1.0, p1=0.3

  21. Ex2: An MT example(Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer:

  22. An MT example (cont) Constraints: Intuitive answer:

  23. An MT example (cont) Constraints: Intuitive answer: ??

  24. Ex3: POS tagging(Klein and Manning, 2003)

  25. Ex3 (cont)

  26. Ex4: Overlapping features(Klein and Manning, 2003) p1 p2 p3 p4

  27. Ex4 (cont) p1 p2

  28. Ex4 (cont) p1

  29. The MaxEnt Principle summary • Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p). • Q1: How to represent constraints? • Q2: How to find such distributions?

  30. Reading #2 (Q1): Let P(X=i) be the probability of getting an i when rolling a dice. What is the value of P(x) with the maximum entropy if the following is true? • P(X=1) + P(X=2) = ½ ¼, ¼, 1/8, 1/8, 1/8, 1/8 (b) P(X=1) + P(X=2) = 1/2 and P(X=6) = 1/3 ¼, ¼, 1/18, 1/18, 1/18, 1/3

  31. (Q2) In the text classification task, |V| is the number of features, |C| is the number of classes. How many feature functions are there? |C| * |V|

  32. Outline • Overview • The Maximum Entropy Principle • Modeling** • Decoding • Training** • Case study

  33. Modeling

  34. The Setting • From the training data, collect (x, y) pairs: • x 2 X: the observed data • y 2 Y: thing to be predicted (e.g., a class in a classification problem) • Ex: In a text classification task • x: a document • y: the category of the document • To estimate P(y|x)

  35. The basic idea • Goal: to estimate p(y|x) • Choose p(x,y) with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”).

  36. The outline for modeling • Feature function: • Calculating the expectation of a feature function • The forms of P(x,y) and P(y|x)

  37. Feature function

  38. The definition • A feature function is a binary-valued function on events: • Ex:

  39. The weights in NB

  40. The weights in NB Each cell is a weight for a particular (class, feat) pair.

  41. The matrix in MaxEnt Each feature function fj corresponds to a (feat, class) pair.

  42. The weights in MaxEnt Each feature function fj has a weight ¸j.

  43. Feature function summary • A feature function in MaxEnt corresponds to a (feat, class) pair. • The number of feature functions in MaxEnt is approximately |C| * |F|. • A MaxEnt trainer is to learn the weights for the feature functions.

  44. The outline for modeling • Feature function: • Calculating the expectation of a feature function • The forms of P(x,y) and P(y|x)

  45. The expected return • Ex1: • Flip a coin • if it is a head, you will get 100 dollars • if it is a tail, you will lose 50 dollars • What is the expected return? P(X=H) * 100 + P(X=T) * (-50) • Ex2: • If it is a xi, you will receive vi dollars? • What is the expected return?

  46. Calculating the expectation of a function

  47. Empirical expectation • Denoted as : • Ex1: Toss a coin four times and get H, T, H, and H. • The average return: (100-50+100+100)/4 = 62.5 • Empirical distribution: • Empirical expectation: ¾ * 100 + ¼ * (-50) = 62.5

  48. Model expectation • Ex1: Toss a coin four times and get H, T, H, and H. • A model: • Model expectation: 1/2 * 100 + ½ * (-50) = 25

  49. Some notations Training data: Empirical distribution: A model: The jthfeature function: Empirical expectation of Model expectation of

  50. Empirical expectation**

More Related