Maximum Entropy Discrimination

1 / 26

# Maximum Entropy Discrimination - PowerPoint PPT Presentation

Maximum Entropy Discrimination. Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT. Classification. inputs x , class y = +1, -1 data D = { (x 1 ,y 1 ), …. (x T ,y T ) } learn f opt (x) discriminant function

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Maximum Entropy Discrimination' - olinda

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Maximum Entropy Discrimination

Tommi Jaakkola Marina Meila Tony Jebara

MIT CMU MIT

Classification
• inputs x, class y = +1, -1
• data D = { (x1,y1), …. (xT,yT) }
• learn fopt(x) discriminant function

from F = {f} family of discriminants

• classify y = sign fopt(x)
many f with near optimal performance

average over all f in F

Q(f) = weight of f

y(x) = sign Q(f)f(x)

F

= sign < f(x) >Q

To specify:

F = { f } family of discriminant functions

To learn Q(f) distribution over F

Model averaging
Goal of this work
• Define a discriminative criterion for averaging over models

• can incorporate prior
• can use generative model
• computationally feasible
• generalizes to other discrimination tasks
Maximum Entropy Discrimination

given data set D = { (x1,y1), … (xT,yT) } find

QME= argmaxQ H(Q)

s.t. yt< f(xt) >Q gfor all t = 1,…,T (C)

and some g > 0

• solution QME correctly classifies D
• among all admissible Q, QME has max entropy
• max entropy least specific about f

l=0

uniform

Q0

lME

QME

Solution: Q ME as a projection
• convex problem: QME unique
• solution

T

QME (f) ~ exp{ Sltytf(xt) }

t=1

• lt 0 Lagrange multipliers
Finding the solution
• needed lt, t = 1,...T
• by solving the dual problem

max J(l) = max [ - log Z + - log Z- - gSlt ]

s.t.lt>= 0 for t = 1,...T

Algorithm

• iterative ascent on J(l) until convergence
• derivative J/ lt= yt<log +b >Q(P) - g

l

l

P+(x)

P-(x)

QMEas sparse solution
• Classification rule

y(x) = sign< f(x) >QME

• g is classification margin
• lt> 0 for yt< f(xt) >Q =g

c

xt on the margin

(support vector!)

Q(f)

fopt

QME

Q0

f

QMEas regularization
• Uniform distribution Q0 l=0
• ”smoothness” of Q = H(Q)
• QME is smoothest admissible distribution
Goal of this work
• Define a discriminative criterion for averaging over models 4

Extensions

• incorporate prior
• relationship to support vectors
• use generative models
• generalizes to other discrimination tasks

prior

Q0

KL( Q || Q0)

QMRE

Priors
• prior Q0( f )
• Minimum Relative Entropy Discrimination

QMRE = argminQ KL( Q || Q0)

s.t. yt< f(xt) >Q g

for all t = 1,…,T(C)

• prior on g learn QMRE( f, g) soft margin
Soft margins
• average also over margin g
• define Q0 (f,g) = Q0(f) Q0(g)
• constraints <ytf(xt) - g >Q(f,g) 0
• learn QMRE (f, g) = QMRE(f) QMRE(g)

Q0(g) =c exp[c(g-1)]

Potential as function of l

Examples: support vector machines
• Theorem

For f(x) = q.x + b, Q0(q) = Normal( 0, I ), Q0(b) = non-informative prior, the Lagrange multipliers lare obtained by maximizing J(l) subject to 0lt0 and Stltyt = 0 , where

J(l) = St[ lt + log( 1 - lt/c) ] - 1/2St,sltlsytysxt.xs

• Separable D SVM recovered exactly
• Inseparable D SVM recovered with different misclassification penalty

Linear SVM

Max Likelihood Gaussian

MRE Gaussian

SVM extensions

P+(x)

P-(x)

• Example: Leptograpsus Crabs (5 inputs, Ttrain=80, Ttest=120)

f(x) = log + b

with P+( x ) = normal( x ; m+, V+ )

Q( V+, V- ) = distribution of kernel width

Using generative models
• generative models

P+(x), P-(x)for y = +1, -1

• f(x) = log + b
• learn QMRE (P+,P-, b, g)
• if Q0 (P+,P- b,g) = Q0 (P+) Q0 ( P-) Q0 ( b) Q0 (g )
• QMRE (P+,P-) = QME (P+) QME (P-) QMRE( b) QMRE (g )

(factored prior factored posterior)

P+(x)

P-(x)

Examples: other distributions
• Multinomial (1 discrete variable) 4
• Graphical model 4

(fixed structure, no hidden variables)

• Tree graphical model 4

( Q over structures and parameters)

Tree graphical models

P

E

• P(x| E, q) = P0(x) Puv(xuxv|quv)
• prior Q0(P) = Q0(E) Q0(q|E)
• Q0(E) = auv
• Q0(q|E) = conjugate prior

QMRE(P) = W0 Wuv

can be integrated analytically

P

E

Q0(P) conjugate prior

over E and q

P

E

ML, err=14%

MaxEnt, err=12.3%

Trees: experiments
• 25 inputs, 400 training examples
• compared with Max Likelihood trees

x

+

+

+

x

x

x

-

x

x

+

x

-

x

x

x

-

x

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

• Classification
• Classification with partially labeled data
• Anomaly detection
Partially labeled data
• Problem: given F families of discriminants and data set D = { (x1, y1)… (xT ,yT), xT+1,… xN } find

Q(f,g,y) = argminQ KL(Q||Q0)

s. t. < ytf(x) - g >Q 0 for all t = 1,…,T (C)

Complete data

10% labeled + 90% unlabeled

10% labeled

Partially labeled data : experiment
• Splice junction classification
• 25 inputs
• Ttotal=1000
Anomaly detection
• Problem: given P = { P } family of generative models and data set D = { x1, … xT } find Q(P) that

Q( P,g ) = argminQ KL(Q||Q0)

s. t. < log P(x) - g >Q 0 for all t = 1,…,T (C)

Conclusions
• New framework for classification
• Based on regularization in the space of distributions
• Enables use of generative models
• Enables use of priors
• Generalizes to other discrimination tasks