380 likes | 390 Views
This project proposal explores the use of Exponential and Maximum Entropy models for classification problems. It includes an introduction, proposed approaches, and a plan for the project.
E N D
Project • Now it is time to think about the project • It is a team work • Each team will consist of 2 people • It is better to consider a project of your own • Otherwise, I will assign you to some “difficult” project . • Important date • 03/11: project proposal due • 04/01: project progress report due • 04/22 and 04/24: final presentation • 05/03: final report due
Project Proposal • What do I expect? • Introduction: describe the research problem that you try to solve • Related wok: describe the existing approaches and their deficiency • Proposed approaches: describe your approaches and why it may have potential to alleviate the deficiency with existing approaches • Plan: what you plan to do in this project? • Format • It should look like a research paper • The required format (both Microsoft Word and Latex) can be downloaded from www.cse.msu.edu/~cse847/assignments/format.zip
Project Progress Report • Introduction: overview the problem that you try to solve and the solutions that you present in the proposal • Progress • Algorithm description in more details • Related data collection and cleanup • Preliminary results • Format should be same as the project report
Project Final Report • It should like a research paper that is ready for submission to research conferences • What do I expect? • Introduction • Algorithm description and discussion • Empirical studies • I am expecting careful analysis of results no matter if it is a successful approach or a complete failure • Presentation • 25 minute presentation • 5 minute discussion
Recap: Logistic Regression Model • Assume the inputs and outputs are related in the log linear function • Estimate weights: MLE approach
How to Extend Logistic Regression Model to Multiple Classes? • y{+1, -1} {1,2,…,C}?
Conditional Exponential Model • Introduce a different set of parameters for each class • Ensure the sum of probability to be 1
Conditional Exponential Model • Predication probability • Model parameters: • For each class y, we have weights wy and threshold cy • Maximum likelihood estimation Any Problems?
Solution: Set w1 to be a zero vector and c1 to be zero Conditional Exponential Model • Add a constant vector to every weight vector, we have the same log-likelihood function • Not unique optimum solution! • How to resolve this problem?
Modified Conditional Exponential Model • Prediction probability • Model parameters: • For each class y>1, we have weights wy and threshold cy • Maximum likelihood estimation
Maximum Entropy Model: Motivation • Consider a translation example • English ‘in’ French {dans, en, à, au-cours-de, pendant} • Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) • Case 1: no prior knowledge on tranlation • What is your guess of the probabilities?
Maximum Entropy Model: Motivation • Consider a translation example • English ‘in’ French {dans, en, à, au cours de, pendant} • Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) • Case 1: no prior knowledge on tranlation • What is your guess of the probabilities? • p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5 • Case 2: 30% of times either dans or en is used
Maximum Entropy Model: Motivation • Consider a translation example • English ‘in’ French {dans, en, à, au cours de, pendant} • Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) • Case 1: no prior knowledge on tranlation • What is your guess of the probabilities? • p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5 • Case 2: 30% of times either dans or en is used • What is your guess of the probabilities? • p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30 • Uniform distribution is favored
Maximum Entropy Model: Motivation • Case 3: 30% of time dans or en is used, and 50% of times dans or à is used • What is your guess of the probabilities?
Maximum Entropy Model: Motivation • Case 3: 30% of time dans or en is used, and 50% of times dans or à is used • What is your guess of the probabilities? • A good probability distribution should • Satisfy the constraints • Be close to uniform distribution, but how? Measure Uniformality using Kullback-Leibler Distance !
Maximum Entropy Principle (MaxEnt) • A uniformity of distribution is measured by entropy of the distribution • Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2
MaxEnt for Classification Problems • Want a p(y|x) to be close to a uniform distribution • Maximize the conditional entropy of training data • Constraints • Valid probability distribution • From training data: the model should be consistent with data • For each class, model mean of x = empirical mean of x
MaxEnt for Classification Problems • Want a p(y|x) to be close to a uniform distribution • Maximize the conditional entropy of training data • Constraints • Valid probability distribution • From training data: the model should be consistent with data • For each class, model mean of x = empirical mean of x
MaxEnt for Classification Problems • Requiring the mean be consistent between the empirical data and the model • No assumption about the parametric form for likelihood • Only assume it is C2 continuous
MaxEnt Model • Consistency with data is ensured by the equality constraints • For each feature, the empirical mean equal to the model mean • Beyond feature vector x:
Translation Problem • Parameters: p(dans), p(en), p(au), p(a), p(pendant) • Represent each French word with two features
Solution to MaxEnt • Surprisingly, the solution is just conditional exponential model without thresholds • Why?
Maximum Entropy Model Dual Problem Maximum Entropy Model versusConditional Exponential Model Conditional Exponential Model
Maximum Entropy Model vs. Conditional Exponential Model • However, where is the threshold term c? Maximum Entropy Conditional Exponential
Solving Maximum Entropy Model • Iterative scaling algorithm • Assume
Solving Maximum Entropy Model • Compute the empirical mean for each feature of every class, i.e., for every j and every class y • Start w1,w2 …, wc = 0 • Repeat • Compute p(y|x) for each training data point (xi, yi) using w and c from the previous iteration • Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y • Compute for every j and every y • Update w as
Solving Maximum Entropy Model • Compute the empirical mean for each feature of every class, i.e., for every j and every class y • Start w1,w2 …, wc = 0 • Repeat • Compute p(y|x) for each training data point (xi, yi) using w from the previous iteration • Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y • Compute for every j and every y • Update w as
Solving Maximum Entropy Model • The likelihood function always increases !
Solving Maximum Entropy Model • How about each feature can take both positive and negative values? • How about the sum of features is not a constant? • How to apply this approach to conditional exponential model with bias term (or threshold term)?
Improved Iterative Scaling • It only requires all the input features to be positive • Compute the empirical mean for each feature of every class, i.e., for every j and every class y • Start w1,w2 …, wc = 0 • Repeat • Compute p(y|x) for each training data point (xi, yi) using w and c from the previous iteration • Solve for every j and every y • Update w as
Choice of Features • A feature does not have to be one of the inputs • For maximum entropy model, bound features are more favorable. • Very often, people use binary feature • Feature selection • Features with small weights are eliminated
Feature Selection vs. Regularizers • Regularizer sparse solution automatic feature selection • But, L2 regularizer rarely results in features with zero weights not appropriate for feature selection • For the purpose of feature selection, usually using L1 norm
Feature Selection vs. Regularizers • Regularizer sparse solution automatic feature selection • But, L2 regularizer rarely results in features with zero weights not appropriate for feature selection • For the purpose of feature selection, usually using L1 norm
Solving the L1 Regularized Conditional Exponential Model • Solving the L1 regularized conditional exponential model directly is rather difficult • Because the absolute value is a discontinuous function • Any suggestion to alleviate this problem?
Solving the L1 Regularized Conditional Exponential Model Slack Variables