Conditional random fields
This presentation is the property of its rightful owner.
Sponsored Links
1 / 47

Conditional Random Fields PowerPoint PPT Presentation


  • 48 Views
  • Uploaded on
  • Presentation posted in: General

Conditional Random Fields. Sequence Labeling: The Problem. Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:. DT. NN. VBD. IN. DT. NN. The cat sat on the mat. Sequence Labeling: The Problem.

Download Presentation

Conditional Random Fields

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Conditional random fields

Conditional Random Fields


Sequence labeling the problem

Sequence Labeling: The Problem

  • Given a sequence (in NLP, words), assign appropriate labels to each word.

  • For example, POS tagging:

DT

NN

VBD

IN

DT

NN

.

The cat sat on the mat .


Sequence labeling the problem1

Sequence Labeling: The Problem

  • Given a sequence (in NLP, words), assign appropriate labels to each word.

  • Another example, partial parsing (aka chunking):

B-NP

I-NP

B-VP

B-PP

B-NP

I-NP

The cat sat on the mat


Sequence labeling the problem2

Sequence Labeling: The Problem

  • Given a sequence (in NLP, words), assign appropriate labels to each word.

  • Another example, relation extraction:

B-Arg

I-Arg

B-Rel

I-Rel

B-Arg

I-Arg

The cat sat on the mat


The crf equation

The CRF Equation

  • A CRF model consists of

    • F = <f1, …, fk>, a vector of “feature functions”

    • θ = <θ1, …, θk>, a vector of weights for each feature function.

  • Let O = < o1, …, oT> be an observed sentence

  • Let X = <x1, …, xT> be the latent variables.

  • This is the same as the Maximum Entropy equation!


Crf equation standard format

CRF Equation, standard format

  • Note that the denominator depends on O, but not on y (it’s marginalizing over y).

  • Typically, we write

  • where


Making structured predictions

Making Structured Predictions


Structured prediction vs text classification

Structured prediction vs. Text Classification

Recall: max. ent. for text classification:

CRFs for sequence labeling:

What’s the difference?


Structured prediction vs text classification1

Structured prediction vs. Text Classification

Two (related) differences, both for the sake of efficiency:

  • Feature functions in CRFs are restricted to graph parts (described later)

  • We can’t do brute force to compute the argmax. Instead, we do Viterbi.


Finding the best sequence

Finding the Best Sequence

Best sequence is

Recall from HMM discussion:

If there are

K possible states for each xi variable,

and N total xi variables,

Then there are KN possible settings for x

So brute force can’t find the best sequence.

Instead, we resort to a Viterbi-like dynamic program.


Viterbi algorithm

Viterbi Algorithm

X1

Xt-1

Xt=hj

o1

ot-1

ot

ot+1

oT

The state sequence which maximizes the score of seeing the observations to time t-1, landing in state hj at time t, and seeing the observation at time t


Viterbi algorithm1

o1

ot-1

ot

ot+1

oT

Viterbi Algorithm

x1

xt-1

xt

xt+1

xT

Compute the most likely state sequence by working backwards


Viterbi algorithm2

Viterbi Algorithm

X1

Xt-1

Xt=hj

Xt+1

o1

ot-1

ot

ot+1

oT

??!

Recursive Computation

??!


Feature functions and graph parts

Feature functions and Graph parts

To make efficient computation (dynamic programs) possible, we restrict the feature functions to:

Graph parts (or just parts): A feature function that counts how often a particular configuration occurs for a clique in the CRF graph.

Clique: a set of completely connected nodes in a graph. That is, each node in the clique has an edge connecting it to every other node in the clique.


Clique example

X1

X2

X3

X4

X5

X6

o1

o2

o3

o4

o5

o6

Clique Example

The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.

CRF


Clique example1

X1

X2

X3

X4

X5

X6

o1

o2

o3

o4

o5

o6

Clique Example

The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.

Individual node cliques

CRF


Clique example2

X1

X2

X3

X4

X5

X6

o1

o2

o3

o4

o5

o6

Clique Example

The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.

Pair-of-node cliques

CRF


Clique example3

X1

X2

X3

X4

X5

X6

o1

o2

o3

o4

o5

o6

Clique Example

For non-linear-chain CRFs (something we won’t normally consider in this class), you can get larger cliques:

X5’

CRF

Larger cliques


Graph part as feature function example

x1=D

x2=N

x3=V

x4=D

x5=A

x6=N

o1

o2

o3

o4

o5

o6

Graph part as Feature Function Example

Graph parts are feature functions f(x,o)that count how many cliques have a particular configuration.

For example, f(x,o) = count of [xi = Noun].

Here, x2 and x6 are both Nouns, so f(x,o) = 2.

CRF


Graph part as feature function example1

x1=D

x2=N

x3=V

x4=D

x5=A

x6=N

o1

o2

o3

o4

o5

o6

Graph part as Feature Function Example

For a pair-of-nodes example,

f(x,o) = count of [xi = Noun,xi+1=Verb]

Here, x2 is a Noun and x3 is a Verb, so f(x,o) = 1.

CRF


Features can depend on the whole observation

X1

X1

X2

X2

X3

X3

X4

X4

X5

X5

X6

X6

o1

o1

o2

o2

o3

o3

o4

o4

o5

o5

o6

o6

Features can depend on the whole observation

In a CRF, each feature function can depend on o, in addition to a clique in x

Normally, we draw a CRF like this:

HMM

CRF


Features can depend on the whole observation1

X1

X2

X3

X4

X5

X6

o1

o2

o3

o4

o5

o6

X1

X2

X3

X4

X5

X6

o1

o2

o3

o4

o5

o6

Features can depend on the whole observation

In a CRF, each feature function can depend on o, in addition to a clique in x

But really, it’s more like this:

This would cause problems for a generative model, but in a conditional model, ois always a fixed constant. So we can still calculate relevant algorithms like Viterbi efficiently.

HMM

CRF


Graph part as feature function example2

x1=D

x2=N

x3=V

x4=D

x5=A

x6=N

The

cat

chased

the

tiny

fly

Graph part as Feature Function Example

An example part including x and o:

f(x,o) = count of [xi = A or D,xi+1=N,o2=cat]

Here, x1 is a D and x2 is a N, plus x5 is a A and x6 is a N, plus o2=cat, so f(x,o) = 2.

Notice that the clique x5-x6 is allowed to depend on o2.

CRF


Graph part as feature function example3

x1=D

x2=N

x3=V

x4=D

x5=A

x6=N

The

cat

chased

the

tiny

fly

Graph part as Feature Function Example

An more usual example including x and o:

f(x,o) = count of [xi = A or D,xi+1=N,oi+1=cat]

Here, x1 is a D and x2 is a N, plus o2=cat, so f(x,o)=1.

CRF


The crf equation with parts

The CRF Equation, with Parts

  • A CRF model consists of

    • P = <p1, …, pk>, a vector of parts

    • θ = <θ1, …, θk>, a vector of weights for each part.

  • Let O = < o1, …, oT> be an observed sentence

  • Let X = <x1, …, xT> be the latent variables.


Viterbi algorithm 2 nd try

Viterbi Algorithm – 2nd Try

X1

Xt-1

Xt=hj

Xt+1

o1

ot-1

ot

ot+1

oT

Recursive Computation


Supervised parameter estimation

Supervised Parameter Estimation


Conditional training

Conditional Training

  • Given a set of observations o and the correct labels xfor each, determine the best θ:

  • Because the CRF equation is just a special form of the maximum entropy equation, we can train it exactly the same way:

    • Determine the gradient

    • Step in the direction of the gradient

    • Repeat until convergence


Recall training a me model

Recall: Training a ME model

Training is an optimization problem:

find the value for λ that maximizes the conditional log-likelihood of the training data:


Recall training a me model1

Recall: Training a ME model

Optimization is normally performed using some form of gradient descent:

0) Initialize λ0 to 0

1) Compute the gradient: ∇CLL

2) Take a step in the direction of the gradient:

λi+1 = λi + α∇CLL

3) Repeat until CLL doesn’t improve:

stop when |CLL(λi+1) – CLL(λi)| < ε


Recall training a me model2

Recall: Training a ME model

Computing the gradient:


Recall training a me model3

Recall: Training a ME model

Computing the gradient:

Involves a sum over all possible classes


Recall training a me model expected feature counts

Recall: Training a ME model:Expected feature counts

  • In ME models, each document d is classified independently.

  • The sum involves as many terms as there are classes c’.

  • Very doable.


Training a crf

Training a CRF

The hard part for CRFs


Training a crf expected feature counts

Training a CRF: Expected feature counts

  • For CRFs, the term

    involves an exponential sum.

  • The solution again involves dynamic programming, very similar to the Forward algorithm for HMMs.


Crfs vs hmms

CRFs vs. HMMs


Generative joint probability models

Generative (Joint Probability) Models

  • HMMs are generative models: That is, they can compute the joint probability

    P(sentence, hidden-states)

  • From a generative model, one can compute

    • Two conditional models:

      • P(sentence | hidden-states) and

      • P(hidden-states| sentence)

    • Marginal models P(sentence) and P(hidden-states)

  • For sequence labeling, we want

    P(hidden-states | sentence)


Discriminative conditional models

Discriminative (Conditional) Models

  • Most often, people are most interested in the conditional probability

    P(hidden-states | sentence)

    For example, this is the distribution needed for sequence labeling.

  • Discriminative(also called conditional) models directly represent the conditional distribution

    P(hidden-states | sentence)

    • These models cannot tell you the joint distribution, marginals, or other conditionals.

    • But they’re quite good at this particular conditional distribution.


Discriminative vs generative

Discriminative vs. Generative


Crfs vs hmms a closer look

CRFs vs. HMMs, a closer look

It’s possible to convert an HMM into a CRF:

Set pprior,state(x,o) = count[x1=state]

Set θprior,state = log PHMM(x1=state) = log state

Set ptrans,state1,state2(x,o)= count[xi=state1,xi+1=state2]

Set θtrans,state1,state2 = log PHMM(xi+1=state2|xi=state1)

= log Astate1,state2

Set pobs,state,word(x,o)= count[xi=state,oi=word]

Set θobs,state,word = log PHMM(oi=word|xi=state)

= log Bstate,word


Crf vs hmm a closer look

CRF vs. HMM, a closer look

If we convert an HMM to a CRF, all of the CRF parameters θ will be logs of probabilities.

  • Therefore, they will all be between –∞ and 0

    Notice: CRF parameters can be between –∞ and +∞.

    So, how do HMMs and CRFs compare in terms of bias and variance (as sequence labelers)?

  • HMMs have more bias

  • CRFs have more variance


Comparing feature functions

Comparing feature functions

The biggest advantage of CRFs over HMMs is that they can handle overlapping features.

For example, for POS tagging, using words as a features (like xi=“the” or xj=“jogging”) is quite useful.

However, it’s often also useful to use “orthographic” features, like “the word ends in –ing” or “the word starts with a capital letter.”

These features overlap: some words end in “ing”, some don’t.

  • Generative models have trouble handling overlapping features correctly

  • Discriminative models don’t: they can simply use the features.


Crf example

CRF Example

A CRF POS Tagger for English


Vocabulary

Vocabulary

We need to determine the set of possible word types V.

Let V =

{all types in 1 million tokens of Wall Street Journal text, which we’ll use for training}

U

{UNKNOWN} (for word types we haven’t seen)


L label set

L = Label Set

Standard Penn Treebank tagset


L label set1

L = Label Set


Crf features

CRF Features


  • Login