Structured prediction a large margin approach
Download
1 / 75

Structured Prediction: A Large Margin Approach - PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on

Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania Joint work with: V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan, D. Klein, D. Koller, S. Lacoste-Julien, C. Manning. “Don’t worry, Howard. The big questions are multiple choice.”.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Structured Prediction: A Large Margin Approach' - tevy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Structured prediction a large margin approach

Structured Prediction:A Large Margin Approach

Ben Taskar

University of Pennsylvania

Joint work with:

V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan, D. Klein, D. Koller, S. Lacoste-Julien, C. Manning



Handwriting recognition
Handwriting Recognition choice.”

x

y

brace

Sequential structure


Object segmentation
Object Segmentation choice.”

x

y

Spatial structure


Natural language parsing
Natural Language Parsing choice.”

x

y

The screen was

a sea of red

Recursive structure


Bilingual word alignment
Bilingual Word Alignment choice.”

En

vertu

de

les

nouvelles

propositions

,

quel

est

le

coût

prévu

de

perception

de

les

droits

?

x

y

What

is

the

anticipated

cost

of

collecting

fees

under

the

new

proposal

?

What is the anticipated cost of collecting fees under the new proposal?

En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?

Combinatorial structure


Protein structure and disulfide bridges
Protein Structure and Disulfide Bridges choice.”

AVITGACERDLQCG

KGTCCAVSLWIKSV

RVCTPVGTSGEDCH

PASHKIPFSGQRMH

HTCPCAPNLACVQT

SPKKFKCLSK

Protein: 1IMT


Local prediction
Local Prediction choice.”

Classify using local information

 Ignores correlations & constraints!

b

r

a

c

e


Local prediction1

building choice.”

tree

shrub

ground

Local Prediction


Structured prediction
Structured Prediction choice.”

  • Use local information

  • Exploit correlations

b

r

a

c

e


Structured prediction1

building choice.”

tree

shrub

ground

Structured Prediction


Outline
Outline choice.”

  • Structured prediction models

    • Sequences (CRFs)

    • Trees (CFGs)

    • Associative Markov networks (Special MRFs)

    • Matchings

  • Structured large margin estimation

    • Margins and structure

    • Min-max formulation

    • Linear programming inference

    • Certificate formulation


Structured models
Structured Models choice.”

Mild assumption:

linear combination

scoring function

space of feasible outputs


Chain markov net aka crf

a-z choice.”

a-z

a-z

a-z

a-z

Chain Markov Net (aka CRF*)

y

x

*Lafferty et al. 01


Chain markov net aka crf1

a-z choice.”

a-z

a-z

a-z

a-z

Chain Markov Net (aka CRF*)

y

x

*Lafferty et al. 01


Associative markov nets
Associative Markov Nets choice.”

Edge features

Point features

spin-images, point height

length of edge, edge orientation

“associative”

restriction

j

yj

jk

yk


Cfg parsing
CFG Parsing choice.”

#(NP  DT NN)

#(PP  IN NP)

#(NN  ‘sea’)


Bilingual word alignment1
Bilingual Word Alignment choice.”

  • position

  • orthography

  • association

En

vertu

de

les

nouvelles

propositions

,

quel

est

le

coût

prévu

de

perception

de

le

droits

?

What

is

the

anticipated

cost

of

collecting

fees

under

the

new

proposal

?

k

j


Disulfide bonds non bipartite matching
Disulfide Bonds: Non-bipartite Matching choice.”

RSCCPCYWGGCPWGQNCYPEGCSGPKV

1 2 3 4 5 6

2

3

1

5

1

4

2

6

4

6

5

3

Fariselli & Casadio `01, Baldi et al. ‘04


Scoring function

2 choice.”

3

1

4

6

5

Scoring Function

RSCCPCYWGGCPWGQNCYPEGCSGPKV

1 2 3 4 5 6

RSCCPCYWGGCPWGQNCYPEGCSGPKV

1 2 3 4 5 6

  • amino acid identities

  • phys/chem properties


Structured models1
Structured Models choice.”

Mild assumptions:

linear combination

sum of part scores

scoring function

space of feasible outputs


Supervised structured prediction
Supervised Structured Prediction choice.”

Model:

Prediction

Learning

Data

Estimatew

Example:

Weighted matching

Generally:

Combinatorialoptimization

Local

(ignores

structure)

Margin

Likelihood

(intractable)


Outline1
Outline choice.”

  • Structured prediction models

    • Sequences (CRFs)

    • Trees (CFGs)

    • Associative Markov networks (Special MRFs)

    • Matchings

  • Structured large margin estimation

    • Margins and structure

    • Min-max formulation

    • Linear programming inference

    • Certificate formulation


Ocr example
OCR Example choice.”

  • We want:

  • Equivalently:

“brace”

“brace”

“aaaaa”

“brace”

“aaaab”

a lot!

“brace”

“zzzzz”


Parsing example

S choice.”

S

A

E

B

F

G

C

H

D

S

S

S

S

A

A

A

A

B

B

B

B

S

C

C

C

C

D

D

D

D

A

B

D

F

Parsing Example

  • We want:

  • Equivalently:

‘It was red’

‘It was red’

‘It was red’

‘It was red’

‘It was red’

a lot!

‘It was red’

‘It was red’


Alignment example

1 choice.”

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

Alignment Example

  • We want:

  • Equivalently:

‘What is the’

‘Quel est le’

1

2

3

1

2

3

‘What is the’

‘Quel est le’

‘What is the’

‘Quel est le’

‘What is the’

‘Quel est le’

‘What is the’

‘Quel est le’

a lot!

‘What is the’

‘Quel est le’

‘What is the’

‘Quel est le’


Structured loss

S choice.”

S

B

B

D

E

A

A

C

C

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

S

A

B

S

C

D

A

E

C

D

1

2

3

1

2

3

Structured Loss

b c a r e

2

b r o r e

2

b r o c e

1

b r a c e

0

0 1 2 2

0 1 2 3

‘What is the’

‘Quel est le’

‘It was red’


Large margin estimation
Large margin estimation choice.”

  • Given training examples , we want:

  • Maximize margin

  • Mistake weighted margin:

# of mistakes in y

*Collins 02, Altun et al 03, Taskar 03


Large margin estimation1
Large margin estimation choice.”

  • Eliminate

  • Add slacks for inseparable case


Large margin estimation2
Large margin estimation choice.”

  • Brute force enumeration

  • Min-max formulation

    • ‘Plug-in’ linear program for inference


Min max formulation
Min-max formulation choice.”

Structured loss (Hamming):

Inference

LP Inference

Key step:

discrete optim.

continuous optim.


Outline2
Outline choice.”

  • Structured prediction models

    • Sequences (CRFs)

    • Trees (CFGs)

    • Associative Markov networks (Special MRFs)

    • Matchings

  • Structured large margin estimation

    • Margins and structure

    • Min-max formulation

    • Linear programming inference

    • Certificate formulation


Y z map for markov nets
y choice.” z Map for Markov Nets


Markov net inference lp
Markov Net Inference LP choice.”

normalization

agreement

Has integral solutions z for chains, trees

Can be fractional for untriangulated networks


Associative mn inference lp
Associative MN Inference LP choice.”

“associative”

restriction

  • For K=2, solutions are always integral (optimal)

  • For K>2, within factor of 2 of optimal


Cfg chart
CFG Chart choice.”

  • CNF tree = set of two types of parts:

    • Constituents (A, s, e)

    • CF-rules (A  B C, s, m, e)


Cfg inference lp
CFG Inference LP choice.”

root

inside

outside

Has integral solutions z


Matching inference lp
Matching Inference LP choice.”

En

vertu

de

les

nouvelles

propositions

,

quel

est

le

coût

prévu

de

perception

de

le

droits

?

k

What

is

the

anticipated

cost

of

collecting

fees

under

the

new

proposal

?

degree

j

Has integral solutions z


Lp duality
LP Duality choice.”

  • Linear programming duality

    • Variables  constraints

    • Constraints  variables

  • Optimal values are the same

    • When both feasible regions are bounded


Min max formulation1
Min-max Formulation choice.”

LP duality


Min max formulation summary
Min-max formulation summary choice.”

  • Formulation produces concise QP for

    • Low-treewidth Markov networks

    • Associative MNs (K=2)

    • Context free grammars

    • Bipartite matchings

    • Approximate for untriangulated MNs, AMNs with K>2

*Taskar et al 04


Unfactored primal dual
Unfactored Primal/Dual choice.”

QP duality

Exponentially many constraints/variables


Factored primal dual
Factored Primal/Dual choice.”

By QP duality

Dual inherits structure from problem-specific inference LP

Variables  correspond to a decomposition of  variables of the flat case


The connection
The Connection choice.”

b c a r e

2

.2

b r o r e

2

.15

b r o c e

.25

1

b r a c e

.4

0

r

c

a

1

1

.65

.8

.6

e

b

c

r

o

.4

.35

.2


Duals and kernels
Duals and Kernels choice.”

  • Kernel trick works:

    • Factored dual

    • Local functions (log-potentials) can use kernels


Alternatives perceptron
Alternatives: Perceptron choice.”

  • Simple iterative method

  • Unstable for structured output: fewer instances, big updates

    • May not converge if non-separable

    • Noisy

  • Voted / averaged perceptron [Freund & Schapire 99, Collins 02]

    • Regularize / reduce variance by aggregating over iterations


Alternatives constraint generation
Alternatives: Constraint Generation choice.”

  • Add most violated constraint

  • Handles more general loss functions

  • Only polynomial # of constraints needed

  • Need to re-solve QP many times

  • Worst case # of constraints larger than factored

[Collins 02; Altun et al, 03; Tsochantaridis et al, 04]


Handwriting recognition1

raw choice.”

pixels

quadratic

kernel

cubic

kernel

Handwriting Recognition

Length: ~8 chars

Letter: 16x8 pixels

10-fold Train/Test

5000/50000 letters

600/6000 words

Models:

Multiclass-SVMs*

CRFs

M3 nets

30

better

25

20

Test error (average per-character)

15

10

45% error reduction over linear CRFs

33% error reduction over multiclass SVMs

5

0

MC–SVMs

M^3 nets

CRFs

*Crammer & Singer 01


Hypertext classification
Hypertext Classification choice.”

  • WebKB dataset

    • Four CS department websites: 1300 pages/3500 links

    • Classify each page: faculty, course, student, project, other

    • Train on three universities/test on fourth

better

relaxed

dual

53% errorreduction over SVMs

38% error reduction over RMNs

loopy belief propagation

*Taskar et al 02


3d mapping
3D Mapping choice.”

Data provided by: Michael Montemerlo & Sebastian Thrun

Laser Range Finder

GPS

IMU

Label: ground, building, tree, shrub

Training: 30 thousand points Testing: 3 million points


Segmentation results
Segmentation results choice.”

Hand labeled 180K test points


Fly through
Fly-through choice.”


Word alignment results
Word Alignment Results choice.”

Data: [Hansards – Canadian Parliament]

Features induced on  1 mil unsupervised sentences

Trained on 100 sentences (10,000 edges)

Tested on 350 sentences (35,000 edges)

[Taskar+al 05]

*Error: weighted combination of precision/recall

[Lacoste-Julien+Taskar+al 06]


Outline3
Outline choice.”

  • Structured prediction models

    • Sequences (CRFs)

    • Trees (CFGs)

    • Associative Markov networks (Special MRFs)

    • Matchings

  • Structured large margin estimation

    • Margins and structure

    • Min-max formulation

    • Linear programming inference

    • Certificate formulation


Certificate formulation

2 choice.”

3

1

4

6

5

Certificate formulation

  • Non-bipartite matchings:

    • O(n3) combinatorial algorithm

    • No polynomial-size LP known

  • Spanning trees

    • No polynomial-size LP known

    • Simple certificate of optimality

  • Intuition:

    • Verifying optimality easier than optimizing

  • Compact optimality condition of wrt.

kl

ij


Certificate for non bipartite matching

2 choice.”

3

1

4

6

5

Certificate for non-bipartite matching

Alternating cycle:

  • Every other edge is in matching

    Augmenting alternating cycle:

  • Score of edges not in matching greater than edges in matching

    Negate score of edges not in matching

  • Augmenting alternating cycle = negative length alternating cycle

    Matching is optimal no negative alternating cycles

Edmonds ‘65


Certificate for non bipartite matching1

2 choice.”

3

1

4

6

5

Certificate for non-bipartite matching

Pick any node r as root

= length of shortest alternating

path from r to j

Triangle inequality:

Theorem:

No negative length cycle distance function d exists

Can be expressed as linear constraints:

O(n) distance variables, O(n2) constraints


Certificate formulation1
Certificate formulation choice.”

  • Formulation produces compact QP for

    • Spanning trees

    • Non-bipartite matchings

    • Any problem with compact optimality condition

*Taskar et al. ‘05


Disulfide bonding prediction
Disulfide Bonding Prediction choice.”

Data [Swiss Prot 39]

  • 450 sequences (4-10 cysteines)

  • Features:

    • windows around C-C pair

    • physical/chemical properties

C C CC C C C C C C

AVITGA ERDLQ GKGT AVSLWIKSVRV TPVGTSGED HPASHKIPFSGQRMHHT P APNLA VQTSPKKFK LSK

*Accuracy: % proteins with all correct bonds

[Taskar+al 05]


Formulation summary
Formulation summary choice.”

  • Brute force enumeration

  • Min-max formulation

    • ‘Plug-in’ convex program for inference

  • Certificate formulation

    • Directly guarantee optimality of


Omissions
Omissions choice.”

  • Kernels

    • Non-parametric models

  • Structured generalization bounds

    • Bounds on hamming loss

  • Scalable algorithms (no QP solver needed)

    • Structured SMO (works for chains, trees)[Taskar 04]

    • Structured ExpGrad (works for chains, trees)[Bartlett+al 04]

    • Structured ExtraGrad (works for matchings, AMNs)[Taskar+al 06]


Open questions
Open questions choice.”

  • Statistical consistency

    • Hinge loss not consistent for non-binary output

    • [See Tewari & Bartlett 05, McAllester 07]

  • Learning with approximate inference

    • Does constant factor approximate inference guarantee anything about learning?

    • No [See Kulesza & Pereira 07]

    • Perhaps other assumptions needed

  • Discriminative structure learning

    • Using sparsifying priors


Conclusion
Conclusion choice.”

  • Two general techniques for structured large-margin estimation

  • Exact, compact, convex formulations

  • Allow efficient use of kernels

  • Tractable when other estimation methods are not

  • Efficient learning algorithms

  • Empirical success on many domains


References
References choice.”

Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. ICML03.

M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP02

K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR01

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML04

  • More papers at http://www.cis.upenn.edu/~taskar


Modeling first order effects
Modeling First Order Effects choice.”

  • QAP NP-complete

  • Sentences (30 words, 1k vars)  few seconds (Mosek)

  • Learning: use LP relaxation

  • Testing: using LP, 83.5% sentences, 99.85% edges integral


Segmentation model min cut
Segmentation Model choice.” Min-Cut

Computing is hard in general, but

if edge potentials attractive min-cut algorithm

Multiway-cut for multiclass case  use LP relaxation

Local evidence

0

1

Spatial smoothness

[Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04]


Scalable algorithms
Scalable Algorithms choice.”

  • Batch and online

  • Linear in the size of the data

  • Iterate until convergence

    • For each example in the training sample

      • Run inference using current parameters (varies by method)

      • Online: Update parameters using computed example values

    • Batch: Update parameters using computed sample values

  • Structured SMO (Taskar et al, 03; Taskar 04)

  • Structured Exponentiated Gradient (Bartlett et al, 04)

  • Structured Extragradient (Taskar et al, 05)


Experimental setup
Experimental Setup choice.”

  • Standard Penn treebank split (2-21/22/23)

  • Generative baselines

    • Klein & Manning 03 and Collins 99

  • Discriminative

    • Basic = max-margin version of K&M 03

    • Lexical & Lexical + Aux

  • Lexical features (on constituent parts only)

    ts-1 [ts … te]te+1  predicted tags

    xs-1 [xs … xe]xe+1

  • Auxillary features

    • Flat classifier using same features

    • Prediction of K&M 03 on each span


Results for sentences 40 words
Results for sentences ≤40 words choice.”

*Trained only on sentences ≤20 words

*Taskar et al 04


Example
Example choice.”

The Egyptian president said he would visit

Libya today to resume the talks.

Generative model: Libya todayis base NP

Lexical model: today is a one word constituent


ad