Conditional random fields
This presentation is the property of its rightful owner.
Sponsored Links
1 / 102

Conditional Random Fields PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on
  • Presentation posted in: General

Conditional Random Fields. Advanced Statistical Methods in NLP Ling 572 February 9, 2012. Roadmap. Graphical Models Modeling independence Models revisited Generative & discriminative models Conditional random fields Linear chain models Skip chain models. Preview.

Download Presentation

Conditional Random Fields

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Conditional random fields

Conditional Random Fields

Advanced Statistical Methods in NLP

Ling 572

February 9, 2012


Roadmap

Roadmap

  • Graphical Models

    • Modeling independence

    • Models revisited

    • Generative & discriminative models

  • Conditional random fields

    • Linear chain models

    • Skip chain models


Preview

Preview

  • Conditional random fields

    • Undirected graphical model

      • Due to Lafferty, McCallum, and Pereira, 2001


Preview1

Preview

  • Conditional random fields

    • Undirected graphical model

      • Due to Lafferty, McCallum, and Pereira, 2001

    • Discriminative model

      • Supports integration of rich feature sets


Preview2

Preview

  • Conditional random fields

    • Undirected graphical model

      • Due to Lafferty, McCallum, and Pereira, 2001

    • Discriminative model

      • Supports integration of rich feature sets

    • Allows range of dependency structures

      • Linear-chain, skip-chain, general

      • Can encode long-distance dependencies


Preview3

Preview

  • Conditional random fields

    • Undirected graphical model

      • Due to Lafferty, McCallum, and Pereira, 2001

    • Discriminative model

      • Supports integration of rich feature sets

    • Allows range of dependency structures

      • Linear-chain, skip-chain, general

      • Can encode long-distance dependencies

    • Used diverse NLP sequence labeling tasks:

      • Named entity recognition, coreference resolution, etc


Graphical models

Graphical Models


Graphical models1

Graphical Models

  • Graphical model

    • Simple, graphical notation for conditional independence

    • Probabilistic model where:

      • Graph structure denotes conditional independence b/t random variables


Graphical models2

Graphical Models

  • Graphical model

    • Simple, graphical notation for conditional independence

    • Probabilistic model where:

      • Graph structure denotes conditional independence b/t random variables

      • Nodes: random variables


Graphical models3

Graphical Models

  • Graphical model

    • Simple, graphical notation for conditional independence

    • Probabilistic model where:

      • Graph structure denotes conditional independence b/t random variables

      • Nodes: random variables

      • Edges: dependency relation between random variables


Graphical models4

Graphical Models

  • Graphical model

    • Simple, graphical notation for conditional independence

    • Probabilistic model where:

      • Graph structure denotes conditional independence b/t random variables

      • Nodes: random variables

      • Edges: dependency relation between random variables

  • Model types:

    • Bayesian Networks

    • Markov Random Fields


Modeling in dependence

Modeling (In)dependence

  • Bayesian network


Modeling in dependence1

Modeling (In)dependence

  • Bayesian network

    • Directed acyclic graph (DAG)


Modeling in dependence2

Modeling (In)dependence

  • Bayesian network

    • Directed acyclic graph (DAG)

      • Nodes = Random Variables

      • Arc ~ directly influences, conditional dependency


Modeling in dependence3

Modeling (In)dependence

  • Bayesian network

    • Directed acyclic graph (DAG)

      • Nodes = Random Variables

      • Arc ~ directly influences, conditional dependency

    • Arcs = Child depends on parent(s)

      • No arcs = independent (0 incoming: only a priori)

      • Parents of X =

      • For each X need


Example i

Example I

Russel & Norvig, AIMA


Example i1

Example I

Russel & Norvig, AIMA


Example i2

Example I

Russel & Norvig, AIMA


Simple bayesian network

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

Need:

Truth table

A

B depends on

C depends on

D depends on

E depends on


Simple bayesian network1

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

Need:

P(A)

Truth table

2

A = only a priori

B depends on

C depends on

D depends on

E depends on


Simple bayesian network2

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

Need:

P(A)

P(B|A)

Truth table

2

2*2

A = only a priori

B depends on A

C depends on

D depends on

E depends on


Simple bayesian network3

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

Need:

P(A)

P(B|A)

P(C|A)

Truth table

2

2*2

2*2

A = only a priori

B depends on A

C depends on A

D depends on

E depends on


Simple bayesian network4

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

Need:

P(A)

P(B|A)

P(C|A)

P(D|B,C)

P(E|C)

Truth table

2

2*2

2*2

2*2*2

2*2

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C


Holmes example pearl

Holmes Example (Pearl)

Holmes is worried that his house will be burgled. For

the time period of interest, there is a 10^-4 a priori chance

of this happening, and Holmes has installed a burglar alarm

to try to forestall this event. The alarm is 95% reliable in

sounding when a burglary happens, but also has a false

positive rate of 1%. Holmes’ neighbor, Watson, is 90% sure

to call Holmes at his office if the alarm sounds, but he is also

a bit of a practical joker and, knowing Holmes’ concern,

might (30%) call even if the alarm is silent. Holmes’ other

neighbor Mrs. Gibbons is a well-known lush and often

befuddled, but Holmes believes that she is four times more

likely to call him if there is an alarm than not.


Holmes example model

Holmes Example: Model

There a four binary random variables:


Holmes example model1

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called


Holmes example model2

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called


Holmes example model3

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called


Holmes example model4

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called


Holmes example tables

Holmes Example: Tables

B = #t B=#f

A

#t

#f

W=#t W=#f

0.0001 0.9999

0.90 0.10

0.30 0.70

A=#t A=#f

B

#t

#f

A

#t

#f

G=#t G=#f

0.95 0.05

0.01 0.99

0.40 0.60

0.10 0.90


Bayes nets markov property

Bayes’ Nets: Markov Property

  • Bayes’s Nets:

    • Satisfy the local Markov property

      • Variables: conditionally independent of non-descendents given their parents


Bayes nets markov property1

Bayes’ Nets: Markov Property

  • Bayes’s Nets:

    • Satisfy the local Markov property

      • Variables: conditionally independent of non-descendents given their parents


Bayes nets markov property2

Bayes’ Nets: Markov Property

  • Bayes’s Nets:

    • Satisfy the local Markov property

      • Variables: conditionally independent of non-descendents given their parents


Simple bayesian network5

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=


Simple bayesian network6

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)


Simple bayesian network7

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)


Simple bayesian network8

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)


Simple bayesian network9

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)

There exist algorithms for training, inference on BNs


Na ve bayes model

Naïve Bayes Model

  • Bayes’ Net:

    • Conditional independence of features given class

Y

f1

f2

f3

fk


Na ve bayes model1

Naïve Bayes Model

  • Bayes’ Net:

    • Conditional independence of features given class

Y

f1

f2

f3

fk


Na ve bayes model2

Naïve Bayes Model

  • Bayes’ Net:

    • Conditional independence of features given class

Y

f1

f2

f3

fk


Hidden markov model

Hidden Markov Model

  • Bayesian Network where:

    • yt depends on


Hidden markov model1

Hidden Markov Model

  • Bayesian Network where:

    • yt depends on yt-1

    • xt


Hidden markov model2

Hidden Markov Model

  • Bayesian Network where:

    • yt depends on yt-1

    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk


Hidden markov model3

Hidden Markov Model

  • Bayesian Network where:

    • yt depends on yt-1

    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk


Hidden markov model4

Hidden Markov Model

  • Bayesian Network where:

    • yt depends on yt-1

    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk


Hidden markov model5

Hidden Markov Model

  • Bayesian Network where:

    • yt depends on yt-1

    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk


Generative models

Generative Models

  • Both Naïve Bayes and HMMs are generative models


Generative models1

Generative Models

  • Both Naïve Bayes and HMMs are generative models

  • We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.

    • (Sutton & McCallum, 2006)

  • State y generates an observation (instance) x


Generative models2

Generative Models

  • Both Naïve Bayes and HMMs are generative models

  • We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.

    • (Sutton & McCallum, 2006)

  • State y generates an observation (instance) x

  • Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts


  • Markov random fields

    Markov Random Fields

    • aka Markov Network

    • Graphical representation of probabilistic model

      • Undirected graph

        • Can represent cyclic dependencies

        • (vs DAG in Bayesian Networks, can represent induced dep)


    Markov random fields1

    Markov Random Fields

    • aka Markov Network

    • Graphical representation of probabilistic model

      • Undirected graph

        • Can represent cyclic dependencies

        • (vs DAG in Bayesian Networks, can represent induced dep)

    • Also satisfy local Markov property:

      • where ne(X) are the neighbors of X


    Factorizing mrfs

    Factorizing MRFs

    • Many MRFs can be analyzed in terms of cliques

      • Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

    Example due to F. Xia


    Factorizing mrfs1

    Factorizing MRFs

    • Many MRFs can be analyzed in terms of cliques

      • Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

      • Maximal clique can not be extended

    Example due to F. Xia


    Factorizing mrfs2

    Factorizing MRFs

    • Many MRFs can be analyzed in terms of cliques

      • Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

      • Maximal clique can not be extended

      • Maximum clique is largest clique in G.

        Clique:

        Maximal clique:

        Maximum clique:

    A

    C

    B

    D

    E

    Example due to F. Xia


    Conditional random fields

    MRFs

    • Given an undirected graph G(V,E), random vars: X

    • Cliques over G: cl(G)

    Example due to F. Xia


    Conditional random fields

    MRFs

    • Given an undirected graph G(V,E), random vars: X

    • Cliques over G: cl(G)

    C

    B

    D

    E

    Example due to F. Xia


    Conditional random fields

    MRFs

    • Given an undirected graph G(V,E), random vars: X

    • Cliques over G: cl(G)

    C

    B

    D

    E

    Example due to F. Xia


    Conditional random fields1

    Conditional Random Fields

    • Definition due to Lafferty et al, 2001:

      • Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G


    Conditional random fields2

    Conditional Random Fields

    • Definition due to Lafferty et al, 2001:

      • Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G.

    • A CRF is a Markov Random Field globally conditioned on the observation X, and has the form:


    Linear chain crf

    Linear-Chain CRF

    • CRFs can have arbitrary graphical structure, but..


    Linear chain crf1

    Linear-Chain CRF

    • CRFs can have arbitrary graphical structure, but..

    • Most common form is linear chain

      • Supports sequence modeling

      • Many sequence labeling NLP problems:

        • Named Entity Recognition (NER), Coreference


    Linear chain crf2

    Linear-Chain CRF

    • CRFs can have arbitrary graphical structure, but..

    • Most common form is linear chain

      • Supports sequence modeling

      • Many sequence labeling NLP problems:

        • Named Entity Recognition (NER), Coreference

      • Similar to combining HMM sequence w/MaxEnt model

        • Supports sequence structure like HMM

          • but HMMs can’t do rich feature structure


    Linear chain crf3

    Linear-Chain CRF

    • CRFs can have arbitrary graphical structure, but..

    • Most common form is linear chain

      • Supports sequence modeling

      • Many sequence labeling NLP problems:

        • Named Entity Recognition (NER), Coreference

      • Similar to combining HMM sequence w/MaxEnt model

        • Supports sequence structure like HMM

          • but HMMs can’t do rich feature structure

        • Supports rich, overlapping features like MaxEnt

          • but MaxEnt doesn’t directly supports sequences labeling


    Discriminative generative

    Discriminative & Generative

    • Model perspectives (Sutton & McCallum)


    Linear chain crfs

    Linear-Chain CRFs

    • Feature functions:

      • In MaxEnt: f: X x Y {0,1}

        • e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.


    Linear chain crfs1

    Linear-Chain CRFs

    • Feature functions:

      • In MaxEnt: f: X x Y {0,1}

        • e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.

      • In CRFs, f: Y x Y x X x T R

        • e.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0 o.w.

        • frequently indicator function, for efficiency


    Linear chain crfs2

    Linear-Chain CRFs

    • Feature functions:

      • In MaxEnt: f: X x Y {0,1}

        • e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.

      • In CRFs, f: Y x Y x X x T R

        • e.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0 o.w.

        • frequently indicator function, for efficiency


    Linear chain crfs3

    Linear-Chain CRFs


    Linear chain crfs4

    Linear-Chain CRFs


    Linear chain crfs training decoding

    Linear-chain CRFs:Training & Decoding

    • Training:


    Linear chain crfs training decoding1

    Linear-chain CRFs:Training & Decoding

    • Training:

      • Learn λj

      • Approach similar to MaxEnt: e.g. L-BFGS


    Linear chain crfs training decoding2

    Linear-chain CRFs:Training & Decoding

    • Training:

      • Learn λj

      • Approach similar to MaxEnt: e.g. L-BFGS

    • Decoding:

      • Compute label sequence that optimizes P(y|x)

      • Can use approaches like HMM, e.g. Viterbi


    Skip chain crfs

    Skip-chain CRFs


    Motivation

    Motivation

    • Long-distance dependencies:


    Motivation1

    Motivation

    • Long-distance dependencies:

      • Linear chain CRFs, HMMs, beam search, etc

      • All make very local Markov assumptions

        • Preceding label; current data given current label

        • Good for some tasks


    Motivation2

    Motivation

    • Long-distance dependencies:

      • Linear chain CRFs, HMMs, beam search, etc

      • All make very local Markov assumptions

        • Preceding label; current data given current label

        • Good for some tasks

      • However, longer context can be useful

        • e.g. NER: Repeated capitalized words should get same tag


    Motivation3

    Motivation

    • Long-distance dependencies:

      • Linear chain CRFs, HMMs, beam search, etc

      • All make local Markov assumptions

        • Preceding label; current data given current label

        • Good for some tasks

      • However, longer context can be useful

        • e.g. NER: Repeated capitalized words should get same tag


    Skip chain crfs1

    Skip-Chain CRFs

    • Basic approach:

      • Augment linear-chain CRF model with

      • Long-distance ‘skip edges’

        • Add evidence from both endpoints


    Skip chain crfs2

    Skip-Chain CRFs

    • Basic approach:

      • Augment linear-chain CRF model with

      • Long-distance ‘skip edges’

        • Add evidence from both endpoints

    • Which edges?


    Skip chain crfs3

    Skip-Chain CRFs

    • Basic approach:

      • Augment linear-chain CRF model with

      • Long-distance ‘skip edges’

        • Add evidence from both endpoints

    • Which edges?

      • Identical words, words with same stem?


    Skip chain crfs4

    Skip-Chain CRFs

    • Basic approach:

      • Augment linear-chain CRF model with

      • Long-distance ‘skip edges’

        • Add evidence from both endpoints

    • Which edges?

      • Identical words, words with same stem?

    • How many edges?


    Skip chain crfs5

    Skip-Chain CRFs

    • Basic approach:

      • Augment linear-chain CRF model with

      • Long-distance ‘skip edges’

        • Add evidence from both endpoints

    • Which edges?

      • Identical words, words with same stem?

    • How many edges?

      • Not too many, increases inference cost


    Skip chain crf model

    Skip Chain CRF Model

    • Two clique templates:

      • Standard linear chain template


    Skip chain crf model1

    Skip Chain CRF Model

    • Two clique templates:

      • Standard linear chain template

      • Skip edge template


    Skip chain crf model2

    Skip Chain CRF Model

    • Two clique templates:

      • Standard linear chain template

      • Skip edge template


    Skip chain crf model3

    Skip Chain CRF Model

    • Two clique templates:

      • Standard linear chain template

      • Skip edge template


    Skip chain ner

    Skip Chain NER

    • Named Entity Recognition:

      • Task: start time, end time, speaker, location

        • In corpus of seminar announcement emails


    Skip chain ner1

    Skip Chain NER

    • Named Entity Recognition:

      • Task: start time, end time, speaker, location

        • In corpus of seminar announcement emails

    • All approaches:

      • Orthographic, gazeteer, POS features

        • Within preceding, following 4 word window


    Skip chain ner2

    Skip Chain NER

    • Named Entity Recognition:

      • Task: start time, end time, speaker, location

        • In corpus of seminar announcement emails

    • All approaches:

      • Orthographic, gazeteer, POS features

        • Within preceding, following 4 word window

    • Skip chain CRFs:

      • Skip edges between identical capitalized words


    Ner features

    NER Features


    Skip chain ner results

    Skip Chain NER Results

    Skip chain improves substantially on ‘speaker’ recognition

    - Slight reduction in accuracy for times


    Summary

    Summary

    • Conditional random fields (CRFs)

      • Undirected graphical model

        • Compare with Bayesian Networks, Markov Random Fields


    Summary1

    Summary

    • Conditional random fields (CRFs)

      • Undirected graphical model

        • Compare with Bayesian Networks, Markov Random Fields

      • Linear-chain models

        • HMM sequence structure + MaxEnt feature models


    Summary2

    Summary

    • Conditional random fields (CRFs)

      • Undirected graphical model

        • Compare with Bayesian Networks, Markov Random Fields

      • Linear-chain models

        • HMM sequence structure + MaxEnt feature models

      • Skip-chain models

        • Augment with longer distance dependencies

      • Pros:


    Summary3

    Summary

    • Conditional random fields (CRFs)

      • Undirected graphical model

        • Compare with Bayesian Networks, Markov Random Fields

      • Linear-chain models

        • HMM sequence structure + MaxEnt feature models

      • Skip-chain models

        • Augment with longer distance dependencies

      • Pros: Good performance

      • Cons:


    Summary4

    Summary

    • Conditional random fields (CRFs)

      • Undirected graphical model

        • Compare with Bayesian Networks, Markov Random Fields

      • Linear-chain models

        • HMM sequence structure + MaxEnt feature models

      • Skip-chain models

        • Augment with longer distance dependencies

      • Pros: Good performance

      • Cons: Compute intensive


    Conditional random fields

    HW #5


    Hw 5 beam search

    HW #5: Beam Search

    • Apply Beam Search to MaxEnt sequence decoding

    • Task: POS tagging

    • Given files:

      • test data: usual format

      • boundary file: sentence lengths

      • model file

    • Comparisons:

      • Different topN, topK, beam_width


    Tag context

    Tag Context

    • Following Ratnaparkhi ‘96, model uses previous tag (prevT=tag) and previous tag bigram (prevTwoTags=tagi-2+tagi-1)

    • These are NOT in the data file; you compute them on the fly.

    • Notes:

      • Due to sparseness, it is possible a bigram may not appear in the model file. Skip it.

      • These are feature functions: If you have a different candidate tag for the same word, weights will differ.


    Uncertainty

    Uncertainty

    • Real world tasks:

      • Partially observable, stochastic, extremely complex

      • Probabilities capture “Ignorance & Laziness”

        • Lack relevant facts, conditions

        • Failure to enumerate all conditions, exceptions


    Motivation4

    Motivation

    • Uncertainty in medical diagnosis

      • Diseases produce symptoms

      • In diagnosis, observed symptoms => disease ID

      • Uncertainties

        • Symptoms may not occur

        • Symptoms may not be reported

        • Diagnostic tests not perfect

          • False positive, false negative

    • How do we estimate confidence?


  • Login