Loading in 2 Seconds...

Learning Structured Prediction Models: A Large Margin Approach

Loading in 2 Seconds...

- 108 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Learning Structured Prediction Models: A Large Margin Approach ' - galvin-lester

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Learning Structured Prediction Models:A Large Margin Approach

Ben Taskar

U.C. Berkeley

Vassil Chatalbashev Michael Collins Carlos Guestrin Dan Klein

Daphne Koller Chris Manning

Outline

- Structured prediction models
- Sequences (CRFs)
- Trees (CFGs)
- Associative Markov networks (Special MRFs)
- Matchings
- Geometric View
- Structured model polytopes
- Linear programming inference
- Structured large margin estimation
- Min-max formulation
- Application: 3D object segmentation
- Certificate formulation
- Application: disulfide connectivity prediction

a-z

a-z

a-z

a-z

a-z

Chain Markov Net (aka CRF*)P(y|x) i (xi,yi)i (yi,yi+1)

(xi,yi) = exp{ wf(xi,yi)}

(yi,yi+1)= exp{ wf (yi,yi+1)}

f(y,y’) = I(y=‘z’,y’=‘a’)

y

f(x,y) = I(xp=1, y=‘z’)

x

*Lafferty et al. 01

a-z

a-z

a-z

a-z

a-z

Chain Markov Net (aka CRF*)P(y|x)i (xi,yi)i (yi,yi+1)

= exp{wTf(x,y)}

w = [… ,w, … , w, …]

f(x,y) = [… ,f(x,y), … , f(x,y), …]

i(xi,yi) = exp{ w if(xi,yi)}

i(yi,yi+1)= exp{ w if (yi,yi+1)}

f(x,y) = #(y=‘z’,y’=‘a’)

y

f(x,y) = #(xp=1, y=‘z’)

x

*Lafferty et al. 01

Associative Markov Nets

Edge features

Point features

spin-images, point height

length of edge, edge orientation

“associative”

restriction

i

yi

ij

yj

2

3

1

4

6

5

Disulfide bonds: non-bipartite matchingRSCCPCYWGGCPWGQNCYPEGCSGPKV

1 2 3 4 5 6

1

6

2

5

4

3

Fariselli & Casadio `01, Baldi et al. ‘04

2

3

1

4

6

5

Scoring functionRSCCPCYWGGCPWGQNCYPEGCSGPKV

1 2 3 4 5 6

RSCCPCYWGGCPWGQNCYPEGCSGPKV

1 2 3 4 5 6

String features:

residues, physical properties

Structured models

Mild assumption:

Another mild assumption:

linear programming

scoring function

space of feasible outputs

MAP inference linear program

- LP inference for
- Chains
- Trees
- Associative Markov Nets
- Bipartite Matchings
- …

Markov Net Inference LP

Has integral solutions y for chains, trees

Gives upper bound for general networks

Associative MN Inference LP

“associative”

restriction

- For K=2, solutions are always integral (optimal)
- For K>2, within factor of 2 of optimal
- Constraint matrix A is linear in number of nodes and edges, regardless of the tree-width

Other Inference LPs

- Context-free parsing
- Dynamic programs
- Bipartite matching
- Network flow
- Many other combinatorial problems

Outline

- Structured prediction models
- Sequences (CRFs)
- Trees (CFGs)
- Associative Markov networks (Special MRFs)
- Matchings
- Geometric View
- Structured model polytopes
- Linear programming inference
- Structured large margin estimation
- Min-max formulation
- Application: 3D object segmentation
- Certificate formulation
- Application: disulfide connectivity prediction

Learning w

- Training example (x, y*)
- Probabilistic approach:
- Maximize conditional likelihood
- Problem: computing Zw(x) is #P-complete

OCR Example

- We want:

argmaxword wT f(,word) = “brace”

- Equivalently:

wTf(,“brace”) > wTf( ,“aaaaa”)

wTf(,“brace”) > wTf( ,“aaaab”)

…

wTf(,“brace”) > wTf( ,“zzzzz”)

a lot!

Large margin estimation

- Given training example (x, y*), we want:

- Maximize margin
- Mistake weighted margin:

# of mistakes in y

*Taskar et al. 03

Large margin estimation

- Brute force enumeration
- Min-max formulation
- ‘Plug-in’ linear program for inference

Min-max formulation

- Formulation produces compact QP for
- Low-treewidth Markov networks
- Associative Markov networks
- Context free grammars
- Bipartite matchings
- Any problem with compact LP inference

3D Mapping

Data provided by: Michael Montemerlo & Sebastian Thrun

Laser Range Finder

GPS

IMU

Label: ground, building, tree, shrub

Training: 30 thousand points Testing: 3 million points

Segmentation results

Hand labeled 180K test points

2

3

1

4

6

5

Certificate formulation- Non-bipartite matchings:
- O(n3) combinatorial algorithm
- No polynomial-size LP known
- Spanning trees
- No polynomial-size LP known
- Simple certificate of optimality
- Intuition:
- Verifying optimality easier than optimizing
- Compact optimality condition of y* wrt.

kl

ij

2

3

1

4

6

5

Certificate for non-bipartite matchingAlternating cycle:

- Every other edge is in matching

Augmenting alternating cycle:

- Score of edges not in matching greater than edges in matching

Negate score of edges not in matching

- Augmenting alternating cycle = negative length alternating cycle

Matching is optimal no negative alternating cycles

Edmonds ‘65

2

3

1

4

6

5

Certificate for non-bipartite matchingPick any node r as root

= length of shortest alternating

path from r to j

Triangle inequality:

Theorem:

No negative length cycle distance function d exists

Can be expressed as linear constraints:

O(n) distance variables, O(n2) constraints

Certificate formulation

- Formulation produces compact QP for
- Spanning trees
- Non-bipartite matchings
- Any problem with compact optimality condition

Disulfide connectivity prediction

- Dataset
- Swiss Prot protein database, release 39
- Fariselli & Casadio 01, Baldi et al. 04
- 446 sequences (4-50 cysteines)
- Features: window profiles (size 9) around each pair
- Two modes: bonded state known/unknown
- Comparison:
- SVM-trained weights (ignoring constraints during learning)
- DAG Recursive Neural Network [Baldi et al. 04]
- Our model:
- Max-margin matching using RBF kernel
- Training: off-the-shelf LP/QP solver CPLEX (~1 hour)

Formulation summary

- Brute force enumeration
- Min-max formulation
- ‘Plug-in’ convex program for inference
- Certificate formulation
- Directly guarantee optimality of y*

Estimation

Margin

Discriminative

MEMMs

CRFs

P(y|x)

HMMs

PCFGs

MRFs

Generative

P(x,y)

Local

Global

P(z) = 1/Z c (zc)

P(z) = iP(zi|z)

Omissions

- Formulation details
- Kernels
- Multiple examples
- Slacks for non-separable case
- Approximate learning of intractable models
- General MRFs
- Learning to cluster
- Structured generalization bounds
- Scalable algorithms (no QP solver needed)
- Structured SMO (works for chains, trees)
- Structured EG (works for chains, trees)
- Structured PG (works for chains, matchings, AMNs, …)

Current Work

- Learning approximate energy functions
- Protein folding
- Physical processes
- Semi-supervised learning
- Hidden variables
- Mixing labeled and unlabeled data
- Discriminative structure learning
- Using sparsifying priors

Conclusion

- Two general techniques for structured large-margin estimation
- Exact, compact, convex formulations
- Allow efficient use of kernels
- Tractable when other estimation methods are not
- Structured generalization bounds
- Efficient learning algorithms
- Empirical success on many domains
- Papers at http://www.cs.berkeley.edu/~taskar

Duals and Kernels

- Kernel trick works!
- Scoring functions (log-potentials) can use kernels
- Same for certificate formulation

raw

pixels

quadratic

kernel

cubic

kernel

Handwriting RecognitionLength: ~8 chars

Letter: 16x8 pixels

10-fold Train/Test

5000/50000 letters

600/6000 words

Models:

Multiclass-SVMs*

CRFs

M3 nets

30

better

25

20

Test error (average per-character)

15

10

45% error reduction over linear CRFs

33% error reduction over multiclass SVMs

5

0

MC–SVMs

M^3 nets

CRFs

*Crammer & Singer 01

Hypertext Classification

- WebKB dataset
- Four CS department websites: 1300 pages/3500 links
- Classify each page: faculty, course, student, project, other
- Train on three universities/test on fourth

better

relaxed

dual

53% errorreduction over SVMs

38% error reduction over RMNs

loopy belief propagation

*Taskar et al 02

Projected Gradient

Projecting y’ onto constraints:

min-cost convex flow for Markov nets, matchings

Convergence: same as steepest gradient

Conjugate gradient also possible (two-metric proj.)

yk+1

yk

yk+3

yk+2

yk+4

a-z

a-z

a-z

a-z

a-z

Min-Cost Flow for Markov Chains- Capacities = C
- Edge costs =
- For edges from node s, to node t, cost = 0

a

a

a

a

a

s

t

z

z

z

z

z

Min-Cost Flow for Bipartite Matchings

t

s

- Capacities = C
- Edge costs =
- For edges from node s, to node t, cost = 0

CFG Chart

- CNF tree = set of two types of parts:
- Constituents (A, s, e)
- CF-rules (A B C, s, m, e)

Download Presentation

Connecting to Server..