Learning structured prediction models a large margin approach
Download
1 / 56

Learning Structured Prediction Models: A Large Margin Approach - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Learning Structured Prediction Models: A Large Margin Approach. Ben Taskar U.C. Berkeley Vassil Chatalbashev Michael Collins Carlos Guestrin Dan Klein Daphne Koller Chris Manning.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Learning Structured Prediction Models: A Large Margin Approach ' - galvin-lester


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Learning structured prediction models a large margin approach

Learning Structured Prediction Models:A Large Margin Approach

Ben Taskar

U.C. Berkeley

Vassil Chatalbashev Michael Collins Carlos Guestrin Dan Klein

Daphne Koller Chris Manning



Handwriting recognition
Handwriting recognition choice.”

x

y

brace

Sequential structure


Object segmentation
Object segmentation choice.”

x

y

Spatial structure


Natural language parsing
Natural language parsing choice.”

x

y

The screen was

a sea of red

Recursive structure


Disulfide connectivity prediction
Disulfide connectivity prediction choice.”

x

y

RSCCPCYWGGCPW

GQNCYPEGCSGPKV

Combinatorial structure


Outline
Outline choice.”

  • Structured prediction models

    • Sequences (CRFs)

    • Trees (CFGs)

    • Associative Markov networks (Special MRFs)

    • Matchings

  • Geometric View

    • Structured model polytopes

    • Linear programming inference

  • Structured large margin estimation

    • Min-max formulation

    • Application: 3D object segmentation

    • Certificate formulation

    • Application: disulfide connectivity prediction


Structured models
Structured models choice.”

Mild assumption:

linear combination

scoring function

space of feasible outputs


Chain markov net aka crf

a-z choice.”

a-z

a-z

a-z

a-z

Chain Markov Net (aka CRF*)

P(y|x)  i (xi,yi)i (yi,yi+1)

(xi,yi) = exp{ wf(xi,yi)}

(yi,yi+1)= exp{ wf (yi,yi+1)}

f(y,y’) = I(y=‘z’,y’=‘a’)

y

f(x,y) = I(xp=1, y=‘z’)

x

*Lafferty et al. 01


Chain markov net aka crf1

a-z choice.”

a-z

a-z

a-z

a-z

Chain Markov Net (aka CRF*)

P(y|x)i (xi,yi)i (yi,yi+1)

= exp{wTf(x,y)}

w = [… ,w, … , w, …]

f(x,y) = [… ,f(x,y), … , f(x,y), …]

i(xi,yi) = exp{ w if(xi,yi)}

i(yi,yi+1)= exp{ w if (yi,yi+1)}

f(x,y) = #(y=‘z’,y’=‘a’)

y

f(x,y) = #(xp=1, y=‘z’)

x

*Lafferty et al. 01


Associative markov nets
Associative Markov Nets choice.”

Edge features

Point features

spin-images, point height

length of edge, edge orientation

“associative”

restriction

i

yi

ij

yj


PCFG choice.”

#(NP  DT NN)

#(PP  IN NP)

#(NN  ‘sea’)


Disulfide bonds non bipartite matching

2 choice.”

3

1

4

6

5

Disulfide bonds: non-bipartite matching

RSCCPCYWGGCPWGQNCYPEGCSGPKV

1 2 3 4 5 6

1

6

2

5

4

3

Fariselli & Casadio `01, Baldi et al. ‘04


Scoring function

2 choice.”

3

1

4

6

5

Scoring function

RSCCPCYWGGCPWGQNCYPEGCSGPKV

1 2 3 4 5 6

RSCCPCYWGGCPWGQNCYPEGCSGPKV

1 2 3 4 5 6

String features:

residues, physical properties


Structured models1
Structured models choice.”

Mild assumption:

Another mild assumption:

 linear programming

scoring function

space of feasible outputs


Map inference linear program
MAP inference choice.” linear program

  • LP inference for

    • Chains

    • Trees

    • Associative Markov Nets

    • Bipartite Matchings


Markov net inference lp
Markov Net Inference LP choice.”

Has integral solutions y for chains, trees

Gives upper bound for general networks


Associative mn inference lp
Associative MN Inference LP choice.”

“associative”

restriction

  • For K=2, solutions are always integral (optimal)

  • For K>2, within factor of 2 of optimal

  • Constraint matrix A is linear in number of nodes and edges, regardless of the tree-width


Other inference lps
Other Inference LPs choice.”

  • Context-free parsing

  • Dynamic programs

  • Bipartite matching

  • Network flow

  • Many other combinatorial problems


Outline1
Outline choice.”

  • Structured prediction models

    • Sequences (CRFs)

    • Trees (CFGs)

    • Associative Markov networks (Special MRFs)

    • Matchings

  • Geometric View

    • Structured model polytopes

    • Linear programming inference

  • Structured large margin estimation

    • Min-max formulation

    • Application: 3D object segmentation

    • Certificate formulation

    • Application: disulfide connectivity prediction


Learning w
Learning choice.”w

  • Training example (x, y*)

  • Probabilistic approach:

  • Maximize conditional likelihood

  • Problem: computing Zw(x) is #P-complete


Geometric example
Geometric Example choice.”

Training data:

Goal:

Learn w s.t.wTf( , y*) points the “right” way


Ocr example
OCR Example choice.”

  • We want:

    argmaxword wT f(,word) = “brace”

  • Equivalently:

    wTf(,“brace”) > wTf( ,“aaaaa”)

    wTf(,“brace”) > wTf( ,“aaaab”)

    wTf(,“brace”) > wTf( ,“zzzzz”)

a lot!


Large margin estimation
Large margin estimation choice.”

  • Given training example (x, y*), we want:

  • Maximize margin

  • Mistake weighted margin:

# of mistakes in y

*Taskar et al. 03


Large margin estimation1
Large margin estimation choice.”

  • Brute force enumeration

  • Min-max formulation

    • ‘Plug-in’ linear program for inference


Min max formulation
Min-max formulation choice.”

Assume linear loss (Hamming):

Inference

LP inference


Min max formulation1
Min-max formulation choice.”

By strong LP duality

Minimize jointly over w, z


Min max formulation2
Min-max formulation choice.”

  • Formulation produces compact QP for

    • Low-treewidth Markov networks

    • Associative Markov networks

    • Context free grammars

    • Bipartite matchings

    • Any problem with compact LP inference


3d mapping
3D Mapping choice.”

Data provided by: Michael Montemerlo & Sebastian Thrun

Laser Range Finder

GPS

IMU

Label: ground, building, tree, shrub

Training: 30 thousand points Testing: 3 million points


Segmentation results
Segmentation results choice.”

Hand labeled 180K test points


Fly through
Fly-through choice.”


Certificate formulation

2 choice.”

3

1

4

6

5

Certificate formulation

  • Non-bipartite matchings:

    • O(n3) combinatorial algorithm

    • No polynomial-size LP known

  • Spanning trees

    • No polynomial-size LP known

    • Simple certificate of optimality

  • Intuition:

    • Verifying optimality easier than optimizing

  • Compact optimality condition of y* wrt.

kl

ij


Certificate for non bipartite matching

2 choice.”

3

1

4

6

5

Certificate for non-bipartite matching

Alternating cycle:

  • Every other edge is in matching

    Augmenting alternating cycle:

  • Score of edges not in matching greater than edges in matching

    Negate score of edges not in matching

  • Augmenting alternating cycle = negative length alternating cycle

    Matching is optimal no negative alternating cycles

Edmonds ‘65


Certificate for non bipartite matching1

2 choice.”

3

1

4

6

5

Certificate for non-bipartite matching

Pick any node r as root

= length of shortest alternating

path from r to j

Triangle inequality:

Theorem:

No negative length cycle distance function d exists

Can be expressed as linear constraints:

O(n) distance variables, O(n2) constraints


Certificate formulation1
Certificate formulation choice.”

  • Formulation produces compact QP for

    • Spanning trees

    • Non-bipartite matchings

    • Any problem with compact optimality condition


Disulfide connectivity prediction1
Disulfide connectivity prediction choice.”

  • Dataset

    • Swiss Prot protein database, release 39

      • Fariselli & Casadio 01, Baldi et al. 04

    • 446 sequences (4-50 cysteines)

    • Features: window profiles (size 9) around each pair

    • Two modes: bonded state known/unknown

  • Comparison:

    • SVM-trained weights (ignoring constraints during learning)

    • DAG Recursive Neural Network [Baldi et al. 04]

  • Our model:

    • Max-margin matching using RBF kernel

    • Training: off-the-shelf LP/QP solver CPLEX (~1 hour)


Known bonded state
Known bonded state choice.”

Precision / Accuracy

4-fold cross-validation


Unknown bonded state
Unknown bonded state choice.”

Precision / Recall / Accuracy

4-fold cross-validation


Formulation summary
Formulation summary choice.”

  • Brute force enumeration

  • Min-max formulation

    • ‘Plug-in’ convex program for inference

  • Certificate formulation

    • Directly guarantee optimality of y*


Estimation
Estimation choice.”

Margin

Discriminative

MEMMs

CRFs

P(y|x)

HMMs

PCFGs

MRFs

Generative

P(x,y)

Local

Global

P(z) = 1/Z c (zc)

P(z) = iP(zi|z)


Omissions
Omissions choice.”

  • Formulation details

    • Kernels

    • Multiple examples

    • Slacks for non-separable case

  • Approximate learning of intractable models

    • General MRFs

    • Learning to cluster

  • Structured generalization bounds

  • Scalable algorithms (no QP solver needed)

    • Structured SMO (works for chains, trees)

    • Structured EG (works for chains, trees)

    • Structured PG (works for chains, matchings, AMNs, …)


Current work
Current Work choice.”

  • Learning approximate energy functions

    • Protein folding

    • Physical processes

  • Semi-supervised learning

    • Hidden variables

    • Mixing labeled and unlabeled data

  • Discriminative structure learning

    • Using sparsifying priors


Conclusion
Conclusion choice.”

  • Two general techniques for structured large-margin estimation

  • Exact, compact, convex formulations

  • Allow efficient use of kernels

  • Tractable when other estimation methods are not

  • Structured generalization bounds

  • Efficient learning algorithms

  • Empirical success on many domains

  • Papers at http://www.cs.berkeley.edu/~taskar


Duals and kernels
Duals and Kernels choice.”

  • Kernel trick works!

    • Scoring functions (log-potentials) can use kernels

    • Same for certificate formulation


Handwriting recognition1

raw choice.”

pixels

quadratic

kernel

cubic

kernel

Handwriting Recognition

Length: ~8 chars

Letter: 16x8 pixels

10-fold Train/Test

5000/50000 letters

600/6000 words

Models:

Multiclass-SVMs*

CRFs

M3 nets

30

better

25

20

Test error (average per-character)

15

10

45% error reduction over linear CRFs

33% error reduction over multiclass SVMs

5

0

MC–SVMs

M^3 nets

CRFs

*Crammer & Singer 01


Hypertext classification
Hypertext Classification choice.”

  • WebKB dataset

    • Four CS department websites: 1300 pages/3500 links

    • Classify each page: faculty, course, student, project, other

    • Train on three universities/test on fourth

better

relaxed

dual

53% errorreduction over SVMs

38% error reduction over RMNs

loopy belief propagation

*Taskar et al 02


Projected gradient
Projected Gradient choice.”

Projecting y’ onto constraints:

 min-cost convex flow for Markov nets, matchings

Convergence: same as steepest gradient

Conjugate gradient also possible (two-metric proj.)

yk+1

yk

yk+3

yk+2

yk+4


Min cost flow for markov chains

a-z choice.”

a-z

a-z

a-z

a-z

Min-Cost Flow for Markov Chains

  • Capacities = C

  • Edge costs =

  • For edges from node s, to node t, cost = 0

a

a

a

a

a

s

t

z

z

z

z

z


Min cost flow for bipartite matchings
Min-Cost Flow for Bipartite Matchings choice.”

t

s

  • Capacities = C

  • Edge costs =

  • For edges from node s, to node t, cost = 0


Cfg chart
CFG Chart choice.”

  • CNF tree = set of two types of parts:

    • Constituents (A, s, e)

    • CF-rules (A  B C, s, m, e)


Cfg inference lp
CFG Inference LP choice.”

inside

outside

Has integral solutions y for trees


ad