1 / 56

Learning Structured Prediction Models: A Large Margin Approach

Learning Structured Prediction Models: A Large Margin Approach. Ben Taskar U.C. Berkeley Vassil Chatalbashev Michael Collins Carlos Guestrin Dan Klein Daphne Koller Chris Manning.

Download Presentation

Learning Structured Prediction Models: A Large Margin Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Structured Prediction Models:A Large Margin Approach Ben Taskar U.C. Berkeley Vassil Chatalbashev Michael Collins Carlos Guestrin Dan Klein Daphne Koller Chris Manning

  2. “Don’t worry, Howard. The big questions are multiple choice.”

  3. Handwriting recognition x y brace Sequential structure

  4. Object segmentation x y Spatial structure

  5. Natural language parsing x y The screen was a sea of red Recursive structure

  6. Disulfide connectivity prediction x y RSCCPCYWGGCPW GQNCYPEGCSGPKV Combinatorial structure

  7. Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Geometric View • Structured model polytopes • Linear programming inference • Structured large margin estimation • Min-max formulation • Application: 3D object segmentation • Certificate formulation • Application: disulfide connectivity prediction

  8. Structured models Mild assumption: linear combination scoring function space of feasible outputs

  9. a-z a-z a-z a-z a-z Chain Markov Net (aka CRF*) P(y|x)  i (xi,yi)i (yi,yi+1) (xi,yi) = exp{ wf(xi,yi)} (yi,yi+1)= exp{ wf (yi,yi+1)} f(y,y’) = I(y=‘z’,y’=‘a’) y f(x,y) = I(xp=1, y=‘z’) x *Lafferty et al. 01

  10. a-z a-z a-z a-z a-z Chain Markov Net (aka CRF*) P(y|x)i (xi,yi)i (yi,yi+1) = exp{wTf(x,y)} w = [… ,w, … , w, …] f(x,y) = [… ,f(x,y), … , f(x,y), …] i(xi,yi) = exp{ w if(xi,yi)} i(yi,yi+1)= exp{ w if (yi,yi+1)} f(x,y) = #(y=‘z’,y’=‘a’) y f(x,y) = #(xp=1, y=‘z’) x *Lafferty et al. 01

  11. Associative Markov Nets Edge features Point features spin-images, point height length of edge, edge orientation “associative” restriction i yi ij yj

  12. PCFG #(NP  DT NN) … #(PP  IN NP) … #(NN  ‘sea’)

  13. 2 3 1 4 6 5 Disulfide bonds: non-bipartite matching RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 1 6 2 5 4 3 Fariselli & Casadio `01, Baldi et al. ‘04

  14. 2 3 1 4 6 5 Scoring function RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 String features: residues, physical properties

  15. Structured models Mild assumption: Another mild assumption:  linear programming scoring function space of feasible outputs

  16. MAP inference  linear program • LP inference for • Chains • Trees • Associative Markov Nets • Bipartite Matchings • …

  17. Markov Net Inference LP Has integral solutions y for chains, trees Gives upper bound for general networks

  18. Associative MN Inference LP “associative” restriction • For K=2, solutions are always integral (optimal) • For K>2, within factor of 2 of optimal • Constraint matrix A is linear in number of nodes and edges, regardless of the tree-width

  19. Other Inference LPs • Context-free parsing • Dynamic programs • Bipartite matching • Network flow • Many other combinatorial problems

  20. Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Geometric View • Structured model polytopes • Linear programming inference • Structured large margin estimation • Min-max formulation • Application: 3D object segmentation • Certificate formulation • Application: disulfide connectivity prediction

  21. Learning w • Training example (x, y*) • Probabilistic approach: • Maximize conditional likelihood • Problem: computing Zw(x) is #P-complete

  22. Geometric Example Training data: Goal: Learn w s.t.wTf( , y*) points the “right” way

  23. OCR Example • We want: argmaxword wT f(,word) = “brace” • Equivalently: wTf(,“brace”) > wTf( ,“aaaaa”) wTf(,“brace”) > wTf( ,“aaaab”) … wTf(,“brace”) > wTf( ,“zzzzz”) a lot!

  24. Large margin estimation • Given training example (x, y*), we want: • Maximize margin • Mistake weighted margin: # of mistakes in y *Taskar et al. 03

  25. Large margin estimation • Brute force enumeration • Min-max formulation • ‘Plug-in’ linear program for inference

  26. Min-max formulation Assume linear loss (Hamming): Inference LP inference

  27. Min-max formulation By strong LP duality Minimize jointly over w, z

  28. Min-max formulation • Formulation produces compact QP for • Low-treewidth Markov networks • Associative Markov networks • Context free grammars • Bipartite matchings • Any problem with compact LP inference

  29. 3D Mapping Data provided by: Michael Montemerlo & Sebastian Thrun Laser Range Finder GPS IMU Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points

  30. Segmentation results Hand labeled 180K test points

  31. Fly-through

  32. 2 3 1 4 6 5 Certificate formulation • Non-bipartite matchings: • O(n3) combinatorial algorithm • No polynomial-size LP known • Spanning trees • No polynomial-size LP known • Simple certificate of optimality • Intuition: • Verifying optimality easier than optimizing • Compact optimality condition of y* wrt. kl ij

  33. 2 3 1 4 6 5 Certificate for non-bipartite matching Alternating cycle: • Every other edge is in matching Augmenting alternating cycle: • Score of edges not in matching greater than edges in matching Negate score of edges not in matching • Augmenting alternating cycle = negative length alternating cycle Matching is optimal no negative alternating cycles Edmonds ‘65

  34. 2 3 1 4 6 5 Certificate for non-bipartite matching Pick any node r as root = length of shortest alternating path from r to j Triangle inequality: Theorem: No negative length cycle distance function d exists Can be expressed as linear constraints: O(n) distance variables, O(n2) constraints

  35. Certificate formulation • Formulation produces compact QP for • Spanning trees • Non-bipartite matchings • Any problem with compact optimality condition

  36. Disulfide connectivity prediction • Dataset • Swiss Prot protein database, release 39 • Fariselli & Casadio 01, Baldi et al. 04 • 446 sequences (4-50 cysteines) • Features: window profiles (size 9) around each pair • Two modes: bonded state known/unknown • Comparison: • SVM-trained weights (ignoring constraints during learning) • DAG Recursive Neural Network [Baldi et al. 04] • Our model: • Max-margin matching using RBF kernel • Training: off-the-shelf LP/QP solver CPLEX (~1 hour)

  37. Known bonded state Precision / Accuracy 4-fold cross-validation

  38. Unknown bonded state Precision / Recall / Accuracy 4-fold cross-validation

  39. Formulation summary • Brute force enumeration • Min-max formulation • ‘Plug-in’ convex program for inference • Certificate formulation • Directly guarantee optimality of y*

  40. Estimation Margin Discriminative MEMMs CRFs P(y|x) HMMs PCFGs MRFs Generative P(x,y) Local Global P(z) = 1/Z c (zc) P(z) = iP(zi|z)

  41. Omissions • Formulation details • Kernels • Multiple examples • Slacks for non-separable case • Approximate learning of intractable models • General MRFs • Learning to cluster • Structured generalization bounds • Scalable algorithms (no QP solver needed) • Structured SMO (works for chains, trees) • Structured EG (works for chains, trees) • Structured PG (works for chains, matchings, AMNs, …)

  42. Current Work • Learning approximate energy functions • Protein folding • Physical processes • Semi-supervised learning • Hidden variables • Mixing labeled and unlabeled data • Discriminative structure learning • Using sparsifying priors

  43. Conclusion • Two general techniques for structured large-margin estimation • Exact, compact, convex formulations • Allow efficient use of kernels • Tractable when other estimation methods are not • Structured generalization bounds • Efficient learning algorithms • Empirical success on many domains • Papers at http://www.cs.berkeley.edu/~taskar

  44. Duals and Kernels • Kernel trick works! • Scoring functions (log-potentials) can use kernels • Same for certificate formulation

  45. raw pixels quadratic kernel cubic kernel Handwriting Recognition Length: ~8 chars Letter: 16x8 pixels 10-fold Train/Test 5000/50000 letters 600/6000 words Models: Multiclass-SVMs* CRFs M3 nets 30 better 25 20 Test error (average per-character) 15 10 45% error reduction over linear CRFs 33% error reduction over multiclass SVMs 5 0 MC–SVMs M^3 nets CRFs *Crammer & Singer 01

More Related