1 / 63

Maximum Entropy Model

Maximum Entropy Model. LING 572 Fei Xia 02/08/07. Topics in LING 572. Easy: kNN, Rocchio, DT, DL Feature selection, binarization, system combination Bagging Self-training. Topics in LING 572. Slightly more complicated Boosting Co-training Hard (to some people): MaxEnt EM. History.

reid
Download Presentation

Maximum Entropy Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maximum Entropy Model LING 572 Fei Xia 02/08/07

  2. Topics in LING 572 • Easy: • kNN, Rocchio, DT, DL • Feature selection, binarization, system combination • Bagging • Self-training

  3. Topics in LING 572 • Slightly more complicated • Boosting • Co-training • Hard (to some people): • MaxEnt • EM

  4. History • The concept of Maximum Entropy can be traced back along multiple threads to Biblical times. • Introduced to NLP area by Berger et. al. (1996). • Used in many NLP tasks: Tagging, Parsing, PP attachment, LM, …

  5. Outline • Main idea • Modeling • Training: estimating parameters • Feature selection during training • Case study

  6. Main idea

  7. Maximum Entropy • Why maximum entropy? • Maximize entropy = Minimize commitment • Model all that is known and assume nothing about what is unknown. • Model all that is known: satisfy a set of constraints that must hold • Assume nothing about what is unknown: choose the most “uniform” distribution  choose the one with maximum entropy

  8. Ex1: Coin-flip example(Klein & Manning 2003) • Toss a coin: p(H)=p1, p(T)=p2. • Constraint: p1 + p2 = 1 • Question: what’s your estimation of p=(p1, p2)? • Answer: choose the p that maximizes H(p) H p1 p1=0.3

  9. Coin-flip example (cont) H p1 + p2 = 1 p2 p1 p1+p2=1.0, p1=0.3

  10. Ex2: An MT example(Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer:

  11. An MT example (cont) Constraints: Intuitive answer:

  12. An MT example (cont) Constraints: Intuitive answer: ??

  13. Ex3: POS tagging(Klein and Manning, 2003)

  14. Ex3 (cont)

  15. Ex4: overlapping features(Klein and Manning, 2003)

  16. Modeling the problem • Objective function: H(p) • Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p). • Question: How to represent constraints?

  17. Modeling

  18. Reference papers • (Ratnaparkhi, 1997) • (Ratnaparkhi, 1996) • (Berger et. al., 1996) • (Klein and Manning, 2003)  Different notations.

  19. The basic idea • Goal: estimate p • Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”).

  20. Setting • From training data, collect (a, b) pairs: • a: thing to be predicted (e.g., a class in a classification problem) • b: the context • Ex: POS tagging: • a=NN • b=the words in a window and previous two tags • Learn the prob of each (a, b): p(a, b)

  21. Features in POS tagging(Ratnaparkhi, 1996) context (a.k.a. history) allowable classes

  22. Features • Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events: • A: the set of possible classes (e.g., tags in POS tagging) • B: space of contexts (e.g., neighboring words/ tags in POS tagging) • Ex:

  23. Some notations Finite training sample of events: Observed probability of x in S: The model p’s probability of x: The jth feature: Observed expectation of : (empirical count of ) Model expectation of :

  24. Constraints • Model’s feature expectation = observed feature expectation • How to calculate ?

  25. Training data  observed events

  26. Restating the problem The task: find p* s.t. where Objective function: H(p) Constraints: Add a feature

  27. Questions • Is P empty? • Does p* exist? • Is p* unique? • What is the form of p*? • How to find p*?

  28. What is the form of p*?(Ratnaparkhi, 1997) Theorem: if then Furthermore, p* is unique.

  29. Using Lagrange multipliers Minimize A(p):

  30. Two equivalent forms

  31. Relation to Maximum Likelihood The log-likelihood of the empirical distribution as predicted by a model q is defined as Theorem: if then Furthermore, p* is unique.

  32. Summary (so far) Goal: find p* in P, which maximizes H(p). It can be proved that, when p* exists, it is unique. The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample

  33. Summary (cont) • Adding constraints (features): (Klein and Manning, 2003) • Lower maximum entropy • Raise maximum likelihood of data • Bring the distribution further away from uniform • Bring the distribution closer to data

  34. Training

  35. Algorithms • Generalized Iterative Scaling (GIS): (Darroch and Ratcliff, 1972) • Improved Iterative Scaling (IIS): (Della Pietra et al., 1995)

  36. GIS: setup Requirements for running GIS: • Obey form of model and constraints: • An additional constraint: Let Add a new feature fk+1:

  37. GIS algorithm • Compute dj, j=1, …, k+1 • Initialize (any values, e.g., 0) • Repeat until converge • For each j • Compute • Compute • Update

  38. Approximation for calculating feature expectation

  39. Properties of GIS • L(p(n+1)) >= L(p(n)) • The sequence is guaranteed to converge to p*. • The converge can be very slow. • The running time of each iteration is O(NPA): • N: the training set size • P: the number of classes • A: the average number of features that are active for a given event (a, b).

  40. Feature selection

  41. Feature selection • Throw in many features and let the machine select the weights • Manually specify feature templates • Problem: too many features • An alternative: greedy algorithm • Start with an empty set S • Add a feature at each iteration

  42. Notation With the feature set S: After adding a feature: The gain in the log-likelihood of the training data:

  43. Feature selection algorithm(Berger et al., 1996) • Start with S being empty; thus ps is uniform. • Repeat until the gain is small enough • For each candidate feature f • Computer the model using IIS • Calculate the log-likelihood gain • Choose the feature with maximal gain, and add it to S  Problem: too expensive

  44. Approximating gains(Berger et. al., 1996) • Instead of recalculating all the weights, calculate only the weight of the new feature.

  45. Training a MaxEnt Model Scenario #1: no feature selection during training • Define features templates • Create the feature set • Determine the optimum feature weights via GIS or IIS Scenario #2: with feature selection during training • Define feature templates • Create candidate feature set S • At every iteration, choose the feature from S (with max gain) and determine its weight (or choose top-n features and their weights).

  46. Case study

  47. POS tagging(Ratnaparkhi, 1996) • Notation variation: • fj(a, b): a: class, b: context • fj(hi, ti): h: history for ith word, t: tag for ith word • History: • Training data: • Treat it as a list of (hi, ti) pairs. • How many pairs are there?

  48. Using a MaxEnt Model • Modeling: • Training: • Define features templates • Create the feature set • Determine the optimum feature weights via GIS or IIS • Decoding:

  49. Modeling

  50. Training step 1: define feature templates History hi Tag ti

More Related