1 / 48

A Maximum Entropy Approach to Natural Language Processing

This presentation discusses the maximum entropy modeling approach for natural language processing, covering historical perspectives, the concept of entropy, feature selection, case studies, and key takeaways.

hgreer
Download Presentation

A Maximum Entropy Approach to Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Maximum Entropy Approach to Natural Language Processing by Adam L. Berger, Stephen A. Della Pietra, Vincent J Della Pietra ˜ Presentation By SHWETHA SRINATH / MIHIR MATHUR

  2. Agenda ˜ • Some Historical perspective • A Motivating Scenario • What is Entropy and Max Entropy? • Maximum Entropy modelling • Feature Selection • Case Studies • Key Takeaways

  3. Some Historical Perspective ˜ Occam’s Razor: “Simpler solutions are more likely to be correct than complex ones” William of Ockham (c. 1287–1347) Principle of Insufficient Reason: “When one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely” P.S. Laplace (c. 1749–1827)

  4. A Motivating Scenario ˜ Task: Translate the English word “in” to French First, we collect a sample of an expert’s translations and observe: There are 5 French phrases with the same meaning as in!

  5. A Motivating Scenario ˜ Task: Translate the English word “in” to French Now, we want to make a model, p, that assigns a probability to each French phrase f p(dans) + p(en) + p(à)+p(au cours de) + p(pendant) = 1 This is our first constraint.

  6. A Motivating Scenario ˜ Task: Translate the English word “in” to French We observe the parallel phrases more closely: • the expert chooses between dans and en 30% of the time • the expert chooses between dans and à 50% of the time p(dans) + p(en) + p(à)+p(au cours de) + p(pendant) = 1 p(dans) + p(en) = 3/10 p(dans) + p(à) = 1/2

  7. A Motivating Scenario ˜ Task: Translate the English word “in” to French We have the constraints of our model, now what? Let’s just use the most uniform p! ...but what does “uniform” mean? how can we measure uniformity of a model? how can we find the most uniform model subject to these constraints?

  8. We will answer these questions through the concept of Max Entropy

  9. What is Entropy? ˜ The most simple definition: Entropy is the average rate at which information is produced by a stochastic source of data Eg. Consider tossing a coin with known, not necessarily fair, probabilities of coming up heads or tails

  10. What is Max Entropy? ˜ Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible Intuition of Principle of Maximum Entropy: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. According to this principle, the distribution with maximal information entropy is the best choice. https://en.wikipedia.org/wiki/Principle_of_maximum_entropy

  11. Max Entropy Modeling ˜ Consider a random process which produces an output value y ∈ 𝐘 In generating y, the process is influenced by some contextual information, x ∈𝐗 So the training data is of the form(x₁,y₁), (x₂,y₂)....(xn, yn) We can summarize the training sample in terms of its empirical probability distribution defined by: Typically, a particular pair (x,y) will either not occur at all in the sample or will occur at most a few times

  12. Max Entropy Modeling ˜ Consider a random process which produces an output value y ∈ 𝐘 In generating y, the process is influenced by some contextual information, x ∈𝐗 So the training data is of the form(x₁,y₁), (x₂,y₂)....(xn, yn) In our example, 𝐘 = {en, dans, à, au cours de, pendant} 𝐗 = the words in the English sentence surrounding “in” Training Data: pairs of (words surrounding “in”, expert translation of “in”) Task: construct a stochastic model, 𝒑, that accurately represents the behavior of the training sample

  13. Max Entropy Modeling ˜ FEATURES & CONSTRAINTS Our building blocks for 𝒑: statistics of training sample Constraints can be expressed as functions. For eg, consider the constraint: in translates as en when April is the following word FEATURE FUNCTION The expected value of 𝑓 w.r.t the empirical distribution:

  14. Max Entropy Modeling ˜ FEATURES & CONSTRAINTS Thus, we can express any statistic of the sample as the expected value of a feature function! If we feel a statistic is useful, we can require that our model, p, accords with it. We can do this by constraining the expected value that the model assigns to the corresponding feature function f The expected value of f with respect to the model p(y|x) Constraints of expected value of feature f

  15. Max Entropy Modeling ˜ FEATURES & CONSTRAINTS So the feature constraint equations we have so far: The expected value of 𝑓 w.r.t the empirical distribution. 1. 2. 3. The expected value of 𝑓 with respect to the model p(y|x) Constraints of expected value of feature 𝑓 Combining 1, 2, 3, we get the Constraint Equation:

  16. Max Entropy Modeling ˜ FEATURES & CONSTRAINTS - RECAP To sum up so far: We now have a means of representing statistical phenomena of a sample of data, and a means of requiring that our model of the process exhibit these phenomena

  17. Max Entropy Modeling ˜ MAX ENTROPY PRINCIPLE Now, say we have n feature functions fithat represent important statistics. We would like our model to accord with these stats. ⇒We would like p, to lie in the subset C of P: Among the models p in C, the maximum entropy philosophy dictates that we select the distribution that is most uniform

  18. Max Entropy Modeling ˜ MAX ENTROPY PRINCIPLE A mathematical measure of the uniformity of a conditional distribution 𝑝(𝘺 | 𝑥 ) is provided by conditional entropy: PRINCIPLE OF MAX ENTROPY:To select a model from a set C of allowed probability distributions, choose the model p* in C with maximum entropy H(p)

  19. Max Entropy Modeling ˜ PARAMETRIC FORM We now have a problem in constrained optimization: Find the p* in C, which maximizes H(p) Unfortunately, finding an explicit solution is not possible in most cases. So we can use Lagrange multipliers! PRIMAL PROBLEM For each feature fiwe introduce a parameter λi (a Lagrange multiplier). We define the Lagrangian Λ(𝑝, λ) by:

  20. Max Entropy Modeling ˜ PARAMETRIC FORM When we differentiate the Lagrangian Λ(𝑝, λ), we want to find the 𝑝 where the function achieves its maximum: And the value of the function at its maximum is denoted by:

  21. Max Entropy Modeling ˜ PARAMETRIC FORM By differentiating the Lagrangian w.r.t p and setting the value of the derivative to 0, we arrive at the following equation: where The value of the Lagrangian at this maximum is given by, Exponential form Normalizing constant

  22. Max Entropy Modeling ˜ PARAMETRIC FORM

  23. Max Entropy Modeling ˜ RELATION TO MAXIMUM LIKELIHOOD The log-likelihood Lp(p) of the empirical distribution as predicted by a model p is defined by: It is easy to check that the dual function 𝞇(λ) of the previous section is, in fact, just the log-likelihood for the exponential model—that is:

  24. Max Entropy Modeling ˜ COMPUTING PARAMETERS We need numerical methods to get λ*that maximizes 𝞇(λ). Some options: • Gradient Ascent • Coordinate-wise Ascent: iteratively maximizing, one coordinate at a time (Brown Algorithm) • Conjugate Gradient: iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation

  25. Max Entropy Modeling ˜ COMPUTING PARAMETERS: DARROCH-RATCLIFF PROCEDURE Input: Feature functions f1...fn, empirical distribution Output: Optimal parameter values λ* , optimal model 𝑝λ*

  26. Feature Selection Consider the problem of translating “in” to French. These are the features provided: • p(dans | in) • p(en | in) • p(à | in) • p(au cours de | in) • p(pendant | in) Without any information about these features, how would we assign probabilities to each of these words being the right translation for “in”?

  27. Feature Selection ˜ How do we choose features? Start with a “large” set. Choose a subset S: Active features Two goals: • S captures as much information as possible about the random process • S consists of reliably estimable features

  28. Feature Selection ˜

  29. Feature Selection ˜ Optimal model = model with greatest entropy

  30. Feature Selection ˜ Optimal model = model with greatest entropy When we add a new feature , what is the gain in log-likelihood?

  31. Feature Selection ˜ Problem with basic feature selection! Not efficient. Instead, we compute Approximate Gain. A model pS has one parameter for each feature When we add a new feature, another parameter is added.

  32. Feature Selection ˜ Approximate Gain Assumption: When we add a new feature to the set S, we only need to find the optimal value of the new parameter. (This is not true, it just makes things much easier for us)

  33. Feature Selection ˜ Assume the best model containing S U f has the form: We now maximise the approximate gain. Change our previous equations. New gain: Optimal model:

  34. Let’s look at some use cases of Max Entropy

  35. Statistical Translation ˜ • Candide Basic Translation Model Input: French sentence F Output: English sentence E And from Bayes’ theorem we have:

  36. Statistical Translation ˜ How do we get from E to F? • Each word in E generates 0 or more French words. • Order the French words to generate F. Define an alignment A that maps each word f of F to the word e in E that generated f. We now have: sums over |E||F| possible alignments!

  37. Statistical Translation ˜ Simplifying assumption: There exists a highly probable alignment for which: Then, we have: Basic Translation Model

  38. Statistical Translation

  39. Statistical Translation Errors in translation: We need to take context into account!

  40. Context-Dependent Word Models ˜ Max Entropy model for each English word e to predict the French word. Consider a six-word context. (3 before, 3 after)

  41. Context-Dependent Word Models Entropy when that feature is added Gain from selecting feature f

  42. Context-Dependent Word Models We can therefore replace the context-independent model with the following context-dependent model:

  43. Context-Dependent Word Models New and improved translation!

  44. Word Reordering ˜ There are certain instances where the English and French word orderings are interchanged. For example, “a blue cat” translates to “un chat bleu”. In some cases, the order does not change. NOUN de NOUN phrases: No-Interchange: conflit d'intérêt -> conflict of interest Interchange: taux d'intérêt -> interest rate Now Features can be defined as:

  45. Word Reordering ˜

  46. Key Takeaways ˜ Max Entropy Constraint Equation Max Likelihood Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible To obtain the constraint equation, equate the constraints on expected value of feature function in empirical distribution and the model Maximum Entropy model is the same as finding the Maximum Likelihood of seeing the training data Use Cases Parameter Selection Incremental Feature Selection Gradient Ascent or Darroch Ratcliff procedure can be used for finding the best parameters At each step, pick the feature that tells us the most about the training data. Translation, segmentation, word reordering - these are some NLP tasks we can solve.

  47. Questions? ˜

  48. Thank you Everyone! ˜

More Related