maximum entropy discrimination n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Maximum Entropy Discrimination PowerPoint Presentation
Download Presentation
Maximum Entropy Discrimination

Loading in 2 Seconds...

play fullscreen
1 / 26

Maximum Entropy Discrimination - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

Maximum Entropy Discrimination. Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT. Classification. inputs x , class y = +1, -1 data D = { (x 1 ,y 1 ), …. (x T ,y T ) } learn f opt (x) discriminant function

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Maximum Entropy Discrimination' - olinda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
maximum entropy discrimination
Maximum Entropy Discrimination

Tommi Jaakkola Marina Meila Tony Jebara

MIT CMU MIT

classification
Classification
  • inputs x, class y = +1, -1
  • data D = { (x1,y1), …. (xT,yT) }
  • learn fopt(x) discriminant function

from F = {f} family of discriminants

  • classify y = sign fopt(x)
model averaging
many f with near optimal performance

Instead of choosingfopt,

average over all f in F

Q(f) = weight of f

y(x) = sign Q(f)f(x)

F

= sign < f(x) >Q

To specify:

F = { f } family of discriminant functions

To learn Q(f) distribution over F

Model averaging
goal of this work
Goal of this work
  • Define a discriminative criterion for averaging over models

Advantages

    • can incorporate prior
    • can use generative model
    • computationally feasible
    • generalizes to other discrimination tasks
maximum entropy discrimination1
Maximum Entropy Discrimination

given data set D = { (x1,y1), … (xT,yT) } find

QME= argmaxQ H(Q)

s.t. yt< f(xt) >Q gfor all t = 1,…,T (C)

and some g > 0

  • solution QME correctly classifies D
  • among all admissible Q, QME has max entropy
  • max entropy least specific about f
solution q me as a projection

l=0

uniform

Q0

lME

QME

admissible Q

Solution: Q ME as a projection
  • convex problem: QME unique
  • solution

T

QME (f) ~ exp{ Sltytf(xt) }

t=1

  • lt 0 Lagrange multipliers
  • finding QME: start with l=0 and follow gradient of unsatisfied constraints
finding the solution
Finding the solution
  • needed lt, t = 1,...T
  • by solving the dual problem

max J(l) = max [ - log Z + - log Z- - gSlt ]

s.t.lt>= 0 for t = 1,...T

Algorithm

  • start with lt= 0(uniform distribution)
  • iterative ascent on J(l) until convergence
  • derivative J/ lt= yt<log +b >Q(P) - g

l

l

P+(x)

P-(x)

q me as sparse solution
QMEas sparse solution
  • Classification rule

y(x) = sign< f(x) >QME

  • g is classification margin
  • lt> 0 for yt< f(xt) >Q =g

c

xt on the margin

(support vector!)

q me as regularization

Q(f)

fopt

QME

Q0

f

QMEas regularization
  • Uniform distribution Q0 l=0
  • ”smoothness” of Q = H(Q)
  • QME is smoothest admissible distribution
goal of this work1
Goal of this work
  • Define a discriminative criterion for averaging over models 4

Extensions

    • incorporate prior
    • relationship to support vectors
    • use generative models
    • generalizes to other discrimination tasks
priors

prior

Q0

KL( Q || Q0)

QMRE

admissible Q

Priors
  • prior Q0( f )
  • Minimum Relative Entropy Discrimination

QMRE = argminQ KL( Q || Q0)

s.t. yt< f(xt) >Q g

for all t = 1,…,T(C)

  • prior on g learn QMRE( f, g) soft margin
soft margins
Soft margins
  • average also over margin g
  • define Q0 (f,g) = Q0(f) Q0(g)
  • constraints <ytf(xt) - g >Q(f,g) 0
  • learn QMRE (f, g) = QMRE(f) QMRE(g)

Q0(g) =c exp[c(g-1)]

Potential as function of l

examples support vector machines
Examples: support vector machines
  • Theorem

For f(x) = q.x + b, Q0(q) = Normal( 0, I ), Q0(b) = non-informative prior, the Lagrange multipliers lare obtained by maximizing J(l) subject to 0lt0 and Stltyt = 0 , where

J(l) = St[ lt + log( 1 - lt/c) ] - 1/2St,sltlsytysxt.xs

  • Separable D SVM recovered exactly
  • Inseparable D SVM recovered with different misclassification penalty
  • Adaptive kernel SVM....
svm extensions

Linear SVM

Max Likelihood Gaussian

MRE Gaussian

SVM extensions

P+(x)

P-(x)

  • Example: Leptograpsus Crabs (5 inputs, Ttrain=80, Ttest=120)

f(x) = log + b

with P+( x ) = normal( x ; m+, V+ )

quadratic classifier

Q( V+, V- ) = distribution of kernel width

using generative models
Using generative models
  • generative models

P+(x), P-(x)for y = +1, -1

  • f(x) = log + b
  • learn QMRE (P+,P-, b, g)
  • if Q0 (P+,P- b,g) = Q0 (P+) Q0 ( P-) Q0 ( b) Q0 (g )
  • QMRE (P+,P-) = QME (P+) QME (P-) QMRE( b) QMRE (g )

(factored prior factored posterior)

P+(x)

P-(x)

examples other distributions
Examples: other distributions
  • Multinomial (1 discrete variable) 4
  • Graphical model 4

(fixed structure, no hidden variables)

  • Tree graphical model 4

( Q over structures and parameters)

tree graphical models
Tree graphical models

P

E

  • P(x| E, q) = P0(x) Puv(xuxv|quv)
  • prior Q0(P) = Q0(E) Q0(q|E)
    • Q0(E) = auv
    • Q0(q|E) = conjugate prior

QMRE(P) = W0 Wuv

can be integrated analytically

P

E

Q0(P) conjugate prior

over E and q

P

E

trees experiments

ML, err=14%

MaxEnt, err=12.3%

Trees: experiments
  • Splice junction classification task
      • 25 inputs, 400 training examples
      • compared with Max Likelihood trees
discrimination tasks

x

+

+

+

x

x

x

-

x

x

+

x

-

x

x

x

-

x

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Discrimination tasks
  • Classification
  • Classification with partially labeled data
  • Anomaly detection
partially labeled data
Partially labeled data
  • Problem: given F families of discriminants and data set D = { (x1, y1)… (xT ,yT), xT+1,… xN } find

Q(f,g,y) = argminQ KL(Q||Q0)

s. t. < ytf(x) - g >Q 0 for all t = 1,…,T (C)

partially labeled data experiment

Complete data

10% labeled + 90% unlabeled

10% labeled

Partially labeled data : experiment
  • Splice junction classification
  • 25 inputs
  • Ttotal=1000
anomaly detection
Anomaly detection
  • Problem: given P = { P } family of generative models and data set D = { x1, … xT } find Q(P) that

Q( P,g ) = argminQ KL(Q||Q0)

s. t. < log P(x) - g >Q 0 for all t = 1,…,T (C)

conclusions
Conclusions
  • New framework for classification
  • Based on regularization in the space of distributions
  • Enables use of generative models
  • Enables use of priors
  • Generalizes to other discrimination tasks