1 / 26

Maximum Entropy Discrimination

Maximum Entropy Discrimination. Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT. Classification. inputs x , class y = +1, -1 data D = { (x 1 ,y 1 ), …. (x T ,y T ) } learn f opt (x) discriminant function

lavonn
Download Presentation

Maximum Entropy Discrimination

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT

  2. Classification • inputs x, class y = +1, -1 • data D = { (x1,y1), …. (xT,yT) } • learn fopt(x) discriminant function from F = {f} family of discriminants • classify y = sign fopt(x)

  3. many f with near optimal performance Instead of choosingfopt, average over all f in F Q(f) = weight of f y(x) = sign Q(f)f(x) F = sign < f(x) >Q To specify: F = { f } family of discriminant functions To learn Q(f) distribution over F Model averaging

  4. Goal of this work • Define a discriminative criterion for averaging over models Advantages • can incorporate prior • can use generative model • computationally feasible • generalizes to other discrimination tasks

  5. Maximum Entropy Discrimination given data set D = { (x1,y1), … (xT,yT) } find QME= argmaxQ H(Q) s.t. yt< f(xt) >Q gfor all t = 1,…,T (C) and some g > 0 • solution QME correctly classifies D • among all admissible Q, QME has max entropy • max entropy least specific about f

  6. l=0 uniform Q0 lME QME admissible Q Solution: Q ME as a projection • convex problem: QME unique • solution T QME (f) ~ exp{ Sltytf(xt) } t=1 • lt 0 Lagrange multipliers • finding QME: start with l=0 and follow gradient of unsatisfied constraints

  7. Finding the solution • needed lt, t = 1,...T • by solving the dual problem max J(l) = max [ - log Z + - log Z- - gSlt ] s.t.lt>= 0 for t = 1,...T Algorithm • start with lt= 0(uniform distribution) • iterative ascent on J(l) until convergence • derivative J/ lt= yt<log +b >Q(P) - g l l P+(x) P-(x)

  8. QMEas sparse solution • Classification rule y(x) = sign< f(x) >QME • g is classification margin • lt> 0 for yt< f(xt) >Q =g c xt on the margin (support vector!)

  9. Q(f) fopt QME Q0 f QMEas regularization • Uniform distribution Q0 l=0 • ”smoothness” of Q = H(Q) • QME is smoothest admissible distribution

  10. Goal of this work • Define a discriminative criterion for averaging over models 4 Extensions • incorporate prior • relationship to support vectors • use generative models • generalizes to other discrimination tasks

  11. prior Q0 KL( Q || Q0) QMRE admissible Q Priors • prior Q0( f ) • Minimum Relative Entropy Discrimination QMRE = argminQ KL( Q || Q0) s.t. yt< f(xt) >Q g for all t = 1,…,T(C) • prior on g learn QMRE( f, g) soft margin

  12. Soft margins • average also over margin g • define Q0 (f,g) = Q0(f) Q0(g) • constraints <ytf(xt) - g >Q(f,g) 0 • learn QMRE (f, g) = QMRE(f) QMRE(g) Q0(g) =c exp[c(g-1)] Potential as function of l

  13. Examples: support vector machines • Theorem For f(x) = q.x + b, Q0(q) = Normal( 0, I ), Q0(b) = non-informative prior, the Lagrange multipliers lare obtained by maximizing J(l) subject to 0lt0 and Stltyt = 0 , where J(l) = St[ lt + log( 1 - lt/c) ] - 1/2St,sltlsytysxt.xs • Separable D SVM recovered exactly • Inseparable D SVM recovered with different misclassification penalty • Adaptive kernel SVM....

  14. Linear SVM Max Likelihood Gaussian MRE Gaussian SVM extensions P+(x) P-(x) • Example: Leptograpsus Crabs (5 inputs, Ttrain=80, Ttest=120) f(x) = log + b with P+( x ) = normal( x ; m+, V+ ) quadratic classifier Q( V+, V- ) = distribution of kernel width

  15. Using generative models • generative models P+(x), P-(x)for y = +1, -1 • f(x) = log + b • learn QMRE (P+,P-, b, g) • if Q0 (P+,P- b,g) = Q0 (P+) Q0 ( P-) Q0 ( b) Q0 (g ) • QMRE (P+,P-) = QME (P+) QME (P-) QMRE( b) QMRE (g ) (factored prior factored posterior) P+(x) P-(x)

  16. Examples: other distributions • Multinomial (1 discrete variable) 4 • Graphical model 4 (fixed structure, no hidden variables) • Tree graphical model 4 ( Q over structures and parameters)

  17. Tree graphical models P E • P(x| E, q) = P0(x) Puv(xuxv|quv) • prior Q0(P) = Q0(E) Q0(q|E) • Q0(E) = auv • Q0(q|E) = conjugate prior QMRE(P) = W0 Wuv can be integrated analytically P E Q0(P) conjugate prior over E and q P E

  18. ML, err=14% MaxEnt, err=12.3% Trees: experiments • Splice junction classification task • 25 inputs, 400 training examples • compared with Max Likelihood trees

  19. Tree edges’ weights Trees experiments (contd)

  20. x + + + x x x - x x + x - x x x - x + + + + + + + + + + + + + + + + + + + + + + Discrimination tasks • Classification • Classification with partially labeled data • Anomaly detection

  21. Partially labeled data • Problem: given F families of discriminants and data set D = { (x1, y1)… (xT ,yT), xT+1,… xN } find Q(f,g,y) = argminQ KL(Q||Q0) s. t. < ytf(x) - g >Q 0 for all t = 1,…,T (C)

  22. Complete data 10% labeled + 90% unlabeled 10% labeled Partially labeled data : experiment • Splice junction classification • 25 inputs • Ttotal=1000

  23. Anomaly detection • Problem: given P = { P } family of generative models and data set D = { x1, … xT } find Q(P) that Q( P,g ) = argminQ KL(Q||Q0) s. t. < log P(x) - g >Q 0 for all t = 1,…,T (C)

  24. MaxLikelihood MaxEnt Anomaly detection: experiments

  25. MaxEnt MaxLikelihood Anomaly detection: experiments

  26. Conclusions • New framework for classification • Based on regularization in the space of distributions • Enables use of generative models • Enables use of priors • Generalizes to other discrimination tasks

More Related