610 likes | 807 Views
Maximum Entropy: Modeling, Decoding, Training. Advanced Statistical Methods in NLP Ling 572 February 2, 2012. Roadmap. MaxEnt : Recap Modeling: Computing expectations Constraints in the model Decoding HW #5 MaxEnt (cont’d) Training . Maximum Entropy Principle: Summary.
E N D
Maximum Entropy:Modeling, Decoding, Training Advanced Statistical Methods in NLP Ling 572 February 2, 2012
Roadmap • MaxEnt: • Recap • Modeling: • Computing expectations • Constraints in the model • Decoding • HW #5 • MaxEnt (cont’d) • Training
Maximum Entropy Principle:Summary • Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes: • Questions: • 1) How do we model the constraints? • 2) How can select the distributions?
Example II: MT (Berger, 1996) • What we find out that translator uses dansor en 30%? • Constraint: p(dans)+p(en)=3/10 • Now what is maxent model? • p(dans)=p(en)=3/20 • p(à)=p(au cours de)=p(pendant)=7/30 • What if we also know translate picks à or dans 50%? • Add new constraint: p(à)+p(dans)=0.5 • Now what is maxent model?? • Not intuitively obvious…
Feature Functions • A feature function is a binary-valued indicator function: • In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class. • fj(x,y) = {1 if y=“guns” and x includes “rifle” • {0 otherwise
Empirical Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Raw counts Example due F. Xia
Empirical Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Empirical distribution Example due F. Xia
Calculating Empirical Expectation • Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • y = true label of x • For each feature t in x: • empirical_expectation[t][y] += 1/N
Model Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia
Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia
Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia
Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia
Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 Model Expectation Example due F. Xia
Calculating Model Expectation • Build previous table
Calculating Model Expectation • Build previous table • Collect a set of training samples of size N
Calculating Model Expectation • Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • Compute P(y|x) for each y in Y • For each feature t in x: • For each y in Y: • model_expectation[t][y] += 1/N*P(y|x)
Comparing Expectations • Empirical Expectation:
Comparing Expectations • Empirical Expectation: • Model Expectation:
Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood
Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints?
Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data
Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data • So, model expectation = empirical expectation
Conditional Likelihood • Given data (X,Y), conditional likelihood is function of parameters λ
Constraints • Make model more consistent with training data • Move away from simplest maximum entropy
Constraints • Make model more consistent with training data • Move away from simplest maximum entropy • Make model less uniform • Lower entropy • Increase likelihood
The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}}
The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints:
The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints: P={p| =dj, j={1,…,k}
Maximizing H(p) • Problem: Hard to analytically compute max of H(p)
Maximizing H(p) • Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p)
Maximizing H(p) • Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p) • Technically, employ Lagrange multipliers • Find multipliers λthat minimize Lagrangian • Solution minimizing new form will maximize H(p)
Solving w/Lagrange Multipliers Minimize A(p) Set A’(p)=0, and solve
The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what?
The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what? • Are there p’s that satisfy these constraints? • Does p* exist? • Is p* unique? • What is the form of p*? • How can we compute it?
p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k}
p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k}
p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k} • Theorem 1 of (Ratnaparkhi, 1997) shows that: • If p* , then p*=argmaxpH(p) and p* is unique
p* • Two forms: • By optimization and by constraint
p* • Two forms: • By optimization and by constraint
p* • Two forms: • By optimization and by constraint • Equivalent:
p* • Two forms: • By optimization and by constraint • Equivalent: π=1/Z; λj=ln αj
The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}}
The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form
The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form
Decoding • p(y|x) = ,Z is the normalization term
Decoding • Given a trained model with λis • Z=0
Decoding • Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight
Decoding • Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight • For each t in x: • sum += weight for (t,y)