1 / 38

Team Project Proposal: Exponential and Maximum Entropy Models

This project proposal explores the use of Exponential and Maximum Entropy models for classification problems. It includes an introduction, proposed approaches, and a plan for the project.

sames
Download Presentation

Team Project Proposal: Exponential and Maximum Entropy Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project • Now it is time to think about the project • It is a team work • Each team will consist of 2 people • It is better to consider a project of your own • Otherwise, I will assign you to some “difficult” project . • Important date • 03/11: project proposal due • 04/01: project progress report due • 04/22 and 04/24: final presentation • 05/03: final report due

  2. Project Proposal • What do I expect? • Introduction: describe the research problem that you try to solve • Related wok: describe the existing approaches and their deficiency • Proposed approaches: describe your approaches and why it may have potential to alleviate the deficiency with existing approaches • Plan: what you plan to do in this project? • Format • It should look like a research paper • The required format (both Microsoft Word and Latex) can be downloaded from www.cse.msu.edu/~cse847/assignments/format.zip

  3. Project Progress Report • Introduction: overview the problem that you try to solve and the solutions that you present in the proposal • Progress • Algorithm description in more details • Related data collection and cleanup • Preliminary results • Format should be same as the project report

  4. Project Final Report • It should like a research paper that is ready for submission to research conferences • What do I expect? • Introduction • Algorithm description and discussion • Empirical studies • I am expecting careful analysis of results no matter if it is a successful approach or a complete failure • Presentation • 25 minute presentation • 5 minute discussion

  5. Exponential Model and Maximum Entropy Model Rong Jin

  6. Recap: Logistic Regression Model • Assume the inputs and outputs are related in the log linear function • Estimate weights: MLE approach

  7. How to Extend Logistic Regression Model to Multiple Classes? • y{+1, -1} {1,2,…,C}?

  8. Conditional Exponential Model • Introduce a different set of parameters for each class • Ensure the sum of probability to be 1

  9. Conditional Exponential Model • Predication probability • Model parameters: • For each class y, we have weights wy and threshold cy • Maximum likelihood estimation Any Problems?

  10. Solution: Set w1 to be a zero vector and c1 to be zero Conditional Exponential Model • Add a constant vector to every weight vector, we have the same log-likelihood function • Not unique optimum solution! • How to resolve this problem?

  11. Modified Conditional Exponential Model • Prediction probability • Model parameters: • For each class y>1, we have weights wy and threshold cy • Maximum likelihood estimation

  12. Maximum Entropy Model: Motivation • Consider a translation example • English ‘in’  French {dans, en, à, au-cours-de, pendant} • Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) • Case 1: no prior knowledge on tranlation • What is your guess of the probabilities?

  13. Maximum Entropy Model: Motivation • Consider a translation example • English ‘in’  French {dans, en, à, au cours de, pendant} • Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) • Case 1: no prior knowledge on tranlation • What is your guess of the probabilities? • p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5 • Case 2: 30% of times either dans or en is used

  14. Maximum Entropy Model: Motivation • Consider a translation example • English ‘in’  French {dans, en, à, au cours de, pendant} • Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) • Case 1: no prior knowledge on tranlation • What is your guess of the probabilities? • p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5 • Case 2: 30% of times either dans or en is used • What is your guess of the probabilities? • p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30 • Uniform distribution is favored

  15. Maximum Entropy Model: Motivation • Case 3: 30% of time dans or en is used, and 50% of times dans or à is used • What is your guess of the probabilities?

  16. Maximum Entropy Model: Motivation • Case 3: 30% of time dans or en is used, and 50% of times dans or à is used • What is your guess of the probabilities? • A good probability distribution should • Satisfy the constraints • Be close to uniform distribution, but how? Measure Uniformality using Kullback-Leibler Distance !

  17. Maximum Entropy Principle (MaxEnt) • A uniformity of distribution is measured by entropy of the distribution • Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2

  18. MaxEnt for Classification Problems • Want a p(y|x) to be close to a uniform distribution • Maximize the conditional entropy of training data • Constraints • Valid probability distribution • From training data: the model should be consistent with data • For each class, model mean of x = empirical mean of x

  19. MaxEnt for Classification Problems • Want a p(y|x) to be close to a uniform distribution • Maximize the conditional entropy of training data • Constraints • Valid probability distribution • From training data: the model should be consistent with data • For each class, model mean of x = empirical mean of x

  20. MaxEnt for Classification Problems • Requiring the mean be consistent between the empirical data and the model • No assumption about the parametric form for likelihood • Only assume it is C2 continuous

  21. MaxEnt Model • Consistency with data is ensured by the equality constraints • For each feature, the empirical mean equal to the model mean • Beyond feature vector x:

  22. Translation Problem • Parameters: p(dans), p(en), p(au), p(a), p(pendant) • Represent each French word with two features

  23. Constraints

  24. Solution to MaxEnt • Surprisingly, the solution is just conditional exponential model without thresholds • Why?

  25. Solution to MaxEnt

  26. Maximum Entropy Model Dual Problem Maximum Entropy Model versusConditional Exponential Model Conditional Exponential Model

  27. Maximum Entropy Model vs. Conditional Exponential Model • However, where is the threshold term c? Maximum Entropy Conditional Exponential

  28. Solving Maximum Entropy Model • Iterative scaling algorithm • Assume

  29. Solving Maximum Entropy Model • Compute the empirical mean for each feature of every class, i.e., for every j and every class y • Start w1,w2 …, wc = 0 • Repeat • Compute p(y|x) for each training data point (xi, yi) using w and c from the previous iteration • Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y • Compute for every j and every y • Update w as

  30. Solving Maximum Entropy Model • Compute the empirical mean for each feature of every class, i.e., for every j and every class y • Start w1,w2 …, wc = 0 • Repeat • Compute p(y|x) for each training data point (xi, yi) using w from the previous iteration • Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y • Compute for every j and every y • Update w as

  31. Solving Maximum Entropy Model • The likelihood function always increases !

  32. Solving Maximum Entropy Model • How about each feature can take both positive and negative values? • How about the sum of features is not a constant? • How to apply this approach to conditional exponential model with bias term (or threshold term)?

  33. Improved Iterative Scaling • It only requires all the input features to be positive • Compute the empirical mean for each feature of every class, i.e., for every j and every class y • Start w1,w2 …, wc = 0 • Repeat • Compute p(y|x) for each training data point (xi, yi) using w and c from the previous iteration • Solve for every j and every y • Update w as

  34. Choice of Features • A feature does not have to be one of the inputs • For maximum entropy model, bound features are more favorable. • Very often, people use binary feature • Feature selection • Features with small weights are eliminated

  35. Feature Selection vs. Regularizers • Regularizer  sparse solution  automatic feature selection • But, L2 regularizer rarely results in features with zero weights  not appropriate for feature selection • For the purpose of feature selection, usually using L1 norm

  36. Feature Selection vs. Regularizers • Regularizer  sparse solution  automatic feature selection • But, L2 regularizer rarely results in features with zero weights  not appropriate for feature selection • For the purpose of feature selection, usually using L1 norm

  37. Solving the L1 Regularized Conditional Exponential Model • Solving the L1 regularized conditional exponential model directly is rather difficult • Because the absolute value is a discontinuous function • Any suggestion to alleviate this problem?

  38. Solving the L1 Regularized Conditional Exponential Model Slack Variables

More Related