1 / 17

Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller

Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller. Presented by Michael Cafarella CSE574 May 25, 2005. Introduction. Kernel methods (SVMs) and max-margin are terrific for classification No way to model structure, relations

akando
Download Presentation

Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Max-Margin Markov Networksby Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005

  2. Introduction • Kernel methods (SVMs) and max-margin are terrific for classification • No way to model structure, relations • Graphical models (Markov networks) can capture complex structure • Not trained for discrimination • Maximum Margin Markov (M3) Networks capture advantages of both

  3. Standard classification • Want to learn a classification function: • f(x,y) are the features (basis functions), w are weights • y is a multi-label classification. The possible assignments, Y, is exponential in number of labels l • So, can’t compute argmax, can’t even represent all the features

  4. Probabilistic classification • Graphical model defines P(Y|X). Select label argmaxy P(y | x) • Exploit sparseness in dependencies through model design. (e.g., OCR chars are independent given neighbors) • We’ll use pairwise Markov network to model: • Each pot-func is log sum of basis functions

  5. M3N • For regular Markov networks, we train w to maximize likelihood or cond. likelihood • For M3N, we’ll train w to maximize margin • Main contribution of this paper is how to choose w accordingly

  6. Choosing w • With SVMs, choose w to maximize margin • Where • Constraints ensure Maximizing margin magnifies difference between value of true label and the best runner up

  7. Multiple labels • Structured problems have multiple labels, not a single classification • We extend “margin” to scale with the number of mistaken labels. So we now have: • Where:

  8. Convert to optimization prob • We can remove margin term to obtain a quadratic program: • We have to add slack variables, because data might not be separable • We can now reformulate the whole M3N learning problem as the following optimization task…

  9. Grand formulation • The primal: • The dual: • Note extra dual vars; have no effect on sol.

  10. Unfortunately, not enough! • Constraints in primal, and #vars in dual, are exponential in #labels, l • Let’s interpret variables in dual as density function over y, conditional on x • Dual objective is function of expectations; we need just node, edge marginals of dual vars to compute them • Define marginal dual vars as:

  11. Now reformulate the QP • But first, a pause • I can’t copy any more formulae. • I’m sorry. • It’s making me crazy. • I just can’t. • Please refer to the paper, section 4! • OK, now back to work…

  12. Now reformulate the QP (2) • The duals vars must arise from a legal density. Or, they must be in the marginal polytope. • See equation 9! • That means we must enforce consistency between pairwise and singleton marginal vars • See equation 10! • If network is not a forest, those constraints aren’t enough • Can triangulate and add new vars, constraints • Or, approximate a relaxation of the polytope using belief prop

  13. Experiment #1: Handwriting • 6100 words, 8 chars long, 150 subjects • Each char is 16x8 pixels • Y is classified word, each Yi is one of the 26 letters • LogReg and CRFs, train by max’ing cond likelihood of labels given features • SVMs and M3N, train by margin maximization

  14. Experiment #2: Hypertext • The usual collective classification task • Four CS departments. Each page is one of course, faculty, student, project, other • Each page has web & anchor text, represented as binary feature vector • Also has hyperlinks to other examples • RMN trained to max CP of labels, given text & links • SVM and M3N trained w/max-margin

  15. Conclusions • M3N seem to work great for discriminative tasks • Nice to borrow theoretical results from SVMs • Not much testing so far • Future work should use more complicated models, problems • Future presentations should be done in Latex, not Powerpoint

More Related