Conditional random fields
Download
1 / 102

Conditional Random Fields - PowerPoint PPT Presentation


  • 198 Views
  • Uploaded on

Conditional Random Fields. Advanced Statistical Methods in NLP Ling 572 February 9, 2012. Roadmap. Graphical Models Modeling independence Models revisited Generative & discriminative models Conditional random fields Linear chain models Skip chain models. Preview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Conditional Random Fields' - selima


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Conditional random fields

Conditional Random Fields

Advanced Statistical Methods in NLP

Ling 572

February 9, 2012


Roadmap
Roadmap

  • Graphical Models

    • Modeling independence

    • Models revisited

    • Generative & discriminative models

  • Conditional random fields

    • Linear chain models

    • Skip chain models


Preview
Preview

  • Conditional random fields

    • Undirected graphical model

      • Due to Lafferty, McCallum, and Pereira, 2001


Preview1
Preview

  • Conditional random fields

    • Undirected graphical model

      • Due to Lafferty, McCallum, and Pereira, 2001

    • Discriminative model

      • Supports integration of rich feature sets


Preview2
Preview

  • Conditional random fields

    • Undirected graphical model

      • Due to Lafferty, McCallum, and Pereira, 2001

    • Discriminative model

      • Supports integration of rich feature sets

    • Allows range of dependency structures

      • Linear-chain, skip-chain, general

      • Can encode long-distance dependencies


Preview3
Preview

  • Conditional random fields

    • Undirected graphical model

      • Due to Lafferty, McCallum, and Pereira, 2001

    • Discriminative model

      • Supports integration of rich feature sets

    • Allows range of dependency structures

      • Linear-chain, skip-chain, general

      • Can encode long-distance dependencies

    • Used diverse NLP sequence labeling tasks:

      • Named entity recognition, coreference resolution, etc



Graphical models1
Graphical Models

  • Graphical model

    • Simple, graphical notation for conditional independence

    • Probabilistic model where:

      • Graph structure denotes conditional independence b/t random variables


Graphical models2
Graphical Models

  • Graphical model

    • Simple, graphical notation for conditional independence

    • Probabilistic model where:

      • Graph structure denotes conditional independence b/t random variables

      • Nodes: random variables


Graphical models3
Graphical Models

  • Graphical model

    • Simple, graphical notation for conditional independence

    • Probabilistic model where:

      • Graph structure denotes conditional independence b/t random variables

      • Nodes: random variables

      • Edges: dependency relation between random variables


Graphical models4
Graphical Models

  • Graphical model

    • Simple, graphical notation for conditional independence

    • Probabilistic model where:

      • Graph structure denotes conditional independence b/t random variables

      • Nodes: random variables

      • Edges: dependency relation between random variables

  • Model types:

    • Bayesian Networks

    • Markov Random Fields


Modeling in dependence
Modeling (In)dependence

  • Bayesian network


Modeling in dependence1
Modeling (In)dependence

  • Bayesian network

    • Directed acyclic graph (DAG)


Modeling in dependence2
Modeling (In)dependence

  • Bayesian network

    • Directed acyclic graph (DAG)

      • Nodes = Random Variables

      • Arc ~ directly influences, conditional dependency


Modeling in dependence3
Modeling (In)dependence

  • Bayesian network

    • Directed acyclic graph (DAG)

      • Nodes = Random Variables

      • Arc ~ directly influences, conditional dependency

    • Arcs = Child depends on parent(s)

      • No arcs = independent (0 incoming: only a priori)

      • Parents of X =

      • For each X need


Example i
Example I

Russel & Norvig, AIMA


Example i1
Example I

Russel & Norvig, AIMA


Example i2
Example I

Russel & Norvig, AIMA


Simple bayesian network

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

Need:

Truth table

A

B depends on

C depends on

D depends on

E depends on


Simple bayesian network1

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

Need:

P(A)

Truth table

2

A = only a priori

B depends on

C depends on

D depends on

E depends on


Simple bayesian network2

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

Need:

P(A)

P(B|A)

Truth table

2

2*2

A = only a priori

B depends on A

C depends on

D depends on

E depends on


Simple bayesian network3

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

Need:

P(A)

P(B|A)

P(C|A)

Truth table

2

2*2

2*2

A = only a priori

B depends on A

C depends on A

D depends on

E depends on


Simple bayesian network4

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

Need:

P(A)

P(B|A)

P(C|A)

P(D|B,C)

P(E|C)

Truth table

2

2*2

2*2

2*2*2

2*2

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C


Holmes example pearl
Holmes Example (Pearl)

Holmes is worried that his house will be burgled. For

the time period of interest, there is a 10^-4 a priori chance

of this happening, and Holmes has installed a burglar alarm

to try to forestall this event. The alarm is 95% reliable in

sounding when a burglary happens, but also has a false

positive rate of 1%. Holmes’ neighbor, Watson, is 90% sure

to call Holmes at his office if the alarm sounds, but he is also

a bit of a practical joker and, knowing Holmes’ concern,

might (30%) call even if the alarm is silent. Holmes’ other

neighbor Mrs. Gibbons is a well-known lush and often

befuddled, but Holmes believes that she is four times more

likely to call him if there is an alarm than not.


Holmes example model
Holmes Example: Model

There a four binary random variables:


Holmes example model1

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called


Holmes example model2

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called


Holmes example model3

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called


Holmes example model4

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called


Holmes example tables
Holmes Example: Tables

B = #t B=#f

A

#t

#f

W=#t W=#f

0.0001 0.9999

0.90 0.10

0.30 0.70

A=#t A=#f

B

#t

#f

A

#t

#f

G=#t G=#f

0.95 0.05

0.01 0.99

0.40 0.60

0.10 0.90


Bayes nets markov property
Bayes’ Nets: Markov Property

  • Bayes’s Nets:

    • Satisfy the local Markov property

      • Variables: conditionally independent of non-descendents given their parents


Bayes nets markov property1
Bayes’ Nets: Markov Property

  • Bayes’s Nets:

    • Satisfy the local Markov property

      • Variables: conditionally independent of non-descendents given their parents


Bayes nets markov property2
Bayes’ Nets: Markov Property

  • Bayes’s Nets:

    • Satisfy the local Markov property

      • Variables: conditionally independent of non-descendents given their parents


Simple bayesian network5

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=


Simple bayesian network6

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)


Simple bayesian network7

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)


Simple bayesian network8

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)


Simple bayesian network9

A

B

C

D

E

Simple Bayesian Network

  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)

There exist algorithms for training, inference on BNs


Na ve bayes model
Naïve Bayes Model

  • Bayes’ Net:

    • Conditional independence of features given class

Y

f1

f2

f3

fk


Na ve bayes model1
Naïve Bayes Model

  • Bayes’ Net:

    • Conditional independence of features given class

Y

f1

f2

f3

fk


Na ve bayes model2
Naïve Bayes Model

  • Bayes’ Net:

    • Conditional independence of features given class

Y

f1

f2

f3

fk


Hidden markov model
Hidden Markov Model

  • Bayesian Network where:

    • yt depends on


Hidden markov model1
Hidden Markov Model

  • Bayesian Network where:

    • yt depends on yt-1

    • xt


Hidden markov model2
Hidden Markov Model

  • Bayesian Network where:

    • yt depends on yt-1

    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk


Hidden markov model3
Hidden Markov Model

  • Bayesian Network where:

    • yt depends on yt-1

    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk


Hidden markov model4
Hidden Markov Model

  • Bayesian Network where:

    • yt depends on yt-1

    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk


Hidden markov model5
Hidden Markov Model

  • Bayesian Network where:

    • yt depends on yt-1

    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk


Generative models
Generative Models

  • Both Naïve Bayes and HMMs are generative models


Generative models1
Generative Models

  • Both Naïve Bayes and HMMs are generative models

  • We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.

    • (Sutton & McCallum, 2006)

  • State y generates an observation (instance) x


Generative models2
Generative Models

  • Both Naïve Bayes and HMMs are generative models

  • We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.

    • (Sutton & McCallum, 2006)

  • State y generates an observation (instance) x

  • Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts


  • Markov random fields
    Markov Random Fields

    • aka Markov Network

    • Graphical representation of probabilistic model

      • Undirected graph

        • Can represent cyclic dependencies

        • (vs DAG in Bayesian Networks, can represent induced dep)


    Markov random fields1
    Markov Random Fields

    • aka Markov Network

    • Graphical representation of probabilistic model

      • Undirected graph

        • Can represent cyclic dependencies

        • (vs DAG in Bayesian Networks, can represent induced dep)

    • Also satisfy local Markov property:

      • where ne(X) are the neighbors of X


    Factorizing mrfs
    Factorizing MRFs

    • Many MRFs can be analyzed in terms of cliques

      • Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

    Example due to F. Xia


    Factorizing mrfs1
    Factorizing MRFs

    • Many MRFs can be analyzed in terms of cliques

      • Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

      • Maximal clique can not be extended

    Example due to F. Xia


    Factorizing mrfs2
    Factorizing MRFs

    • Many MRFs can be analyzed in terms of cliques

      • Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

      • Maximal clique can not be extended

      • Maximum clique is largest clique in G.

        Clique:

        Maximal clique:

        Maximum clique:

    A

    C

    B

    D

    E

    Example due to F. Xia


    MRFs

    • Given an undirected graph G(V,E), random vars: X

    • Cliques over G: cl(G)

    Example due to F. Xia


    MRFs

    • Given an undirected graph G(V,E), random vars: X

    • Cliques over G: cl(G)

    C

    B

    D

    E

    Example due to F. Xia


    MRFs

    • Given an undirected graph G(V,E), random vars: X

    • Cliques over G: cl(G)

    C

    B

    D

    E

    Example due to F. Xia


    Conditional random fields1
    Conditional Random Fields

    • Definition due to Lafferty et al, 2001:

      • Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G


    Conditional random fields2
    Conditional Random Fields

    • Definition due to Lafferty et al, 2001:

      • Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G.

    • A CRF is a Markov Random Field globally conditioned on the observation X, and has the form:


    Linear chain crf
    Linear-Chain CRF

    • CRFs can have arbitrary graphical structure, but..


    Linear chain crf1
    Linear-Chain CRF

    • CRFs can have arbitrary graphical structure, but..

    • Most common form is linear chain

      • Supports sequence modeling

      • Many sequence labeling NLP problems:

        • Named Entity Recognition (NER), Coreference


    Linear chain crf2
    Linear-Chain CRF

    • CRFs can have arbitrary graphical structure, but..

    • Most common form is linear chain

      • Supports sequence modeling

      • Many sequence labeling NLP problems:

        • Named Entity Recognition (NER), Coreference

      • Similar to combining HMM sequence w/MaxEnt model

        • Supports sequence structure like HMM

          • but HMMs can’t do rich feature structure


    Linear chain crf3
    Linear-Chain CRF

    • CRFs can have arbitrary graphical structure, but..

    • Most common form is linear chain

      • Supports sequence modeling

      • Many sequence labeling NLP problems:

        • Named Entity Recognition (NER), Coreference

      • Similar to combining HMM sequence w/MaxEnt model

        • Supports sequence structure like HMM

          • but HMMs can’t do rich feature structure

        • Supports rich, overlapping features like MaxEnt

          • but MaxEnt doesn’t directly supports sequences labeling


    Discriminative generative
    Discriminative & Generative

    • Model perspectives (Sutton & McCallum)


    Linear chain crfs
    Linear-Chain CRFs

    • Feature functions:

      • In MaxEnt: f: X x Y {0,1}

        • e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.


    Linear chain crfs1
    Linear-Chain CRFs

    • Feature functions:

      • In MaxEnt: f: X x Y {0,1}

        • e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.

      • In CRFs, f: Y x Y x X x T R

        • e.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0 o.w.

        • frequently indicator function, for efficiency


    Linear chain crfs2
    Linear-Chain CRFs

    • Feature functions:

      • In MaxEnt: f: X x Y {0,1}

        • e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.

      • In CRFs, f: Y x Y x X x T R

        • e.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0 o.w.

        • frequently indicator function, for efficiency




    Linear chain crfs training decoding
    Linear-chain CRFs:Training & Decoding

    • Training:


    Linear chain crfs training decoding1
    Linear-chain CRFs:Training & Decoding

    • Training:

      • Learn λj

      • Approach similar to MaxEnt: e.g. L-BFGS


    Linear chain crfs training decoding2
    Linear-chain CRFs:Training & Decoding

    • Training:

      • Learn λj

      • Approach similar to MaxEnt: e.g. L-BFGS

    • Decoding:

      • Compute label sequence that optimizes P(y|x)

      • Can use approaches like HMM, e.g. Viterbi



    Motivation
    Motivation

    • Long-distance dependencies:


    Motivation1
    Motivation

    • Long-distance dependencies:

      • Linear chain CRFs, HMMs, beam search, etc

      • All make very local Markov assumptions

        • Preceding label; current data given current label

        • Good for some tasks


    Motivation2
    Motivation

    • Long-distance dependencies:

      • Linear chain CRFs, HMMs, beam search, etc

      • All make very local Markov assumptions

        • Preceding label; current data given current label

        • Good for some tasks

      • However, longer context can be useful

        • e.g. NER: Repeated capitalized words should get same tag


    Motivation3
    Motivation

    • Long-distance dependencies:

      • Linear chain CRFs, HMMs, beam search, etc

      • All make local Markov assumptions

        • Preceding label; current data given current label

        • Good for some tasks

      • However, longer context can be useful

        • e.g. NER: Repeated capitalized words should get same tag


    Skip chain crfs1
    Skip-Chain CRFs

    • Basic approach:

      • Augment linear-chain CRF model with

      • Long-distance ‘skip edges’

        • Add evidence from both endpoints


    Skip chain crfs2
    Skip-Chain CRFs

    • Basic approach:

      • Augment linear-chain CRF model with

      • Long-distance ‘skip edges’

        • Add evidence from both endpoints

    • Which edges?


    Skip chain crfs3
    Skip-Chain CRFs

    • Basic approach:

      • Augment linear-chain CRF model with

      • Long-distance ‘skip edges’

        • Add evidence from both endpoints

    • Which edges?

      • Identical words, words with same stem?


    Skip chain crfs4
    Skip-Chain CRFs

    • Basic approach:

      • Augment linear-chain CRF model with

      • Long-distance ‘skip edges’

        • Add evidence from both endpoints

    • Which edges?

      • Identical words, words with same stem?

    • How many edges?


    Skip chain crfs5
    Skip-Chain CRFs

    • Basic approach:

      • Augment linear-chain CRF model with

      • Long-distance ‘skip edges’

        • Add evidence from both endpoints

    • Which edges?

      • Identical words, words with same stem?

    • How many edges?

      • Not too many, increases inference cost


    Skip chain crf model
    Skip Chain CRF Model

    • Two clique templates:

      • Standard linear chain template


    Skip chain crf model1
    Skip Chain CRF Model

    • Two clique templates:

      • Standard linear chain template

      • Skip edge template


    Skip chain crf model2
    Skip Chain CRF Model

    • Two clique templates:

      • Standard linear chain template

      • Skip edge template


    Skip chain crf model3
    Skip Chain CRF Model

    • Two clique templates:

      • Standard linear chain template

      • Skip edge template


    Skip chain ner
    Skip Chain NER

    • Named Entity Recognition:

      • Task: start time, end time, speaker, location

        • In corpus of seminar announcement emails


    Skip chain ner1
    Skip Chain NER

    • Named Entity Recognition:

      • Task: start time, end time, speaker, location

        • In corpus of seminar announcement emails

    • All approaches:

      • Orthographic, gazeteer, POS features

        • Within preceding, following 4 word window


    Skip chain ner2
    Skip Chain NER

    • Named Entity Recognition:

      • Task: start time, end time, speaker, location

        • In corpus of seminar announcement emails

    • All approaches:

      • Orthographic, gazeteer, POS features

        • Within preceding, following 4 word window

    • Skip chain CRFs:

      • Skip edges between identical capitalized words



    Skip chain ner results
    Skip Chain NER Results

    Skip chain improves substantially on ‘speaker’ recognition

    - Slight reduction in accuracy for times


    Summary
    Summary

    • Conditional random fields (CRFs)

      • Undirected graphical model

        • Compare with Bayesian Networks, Markov Random Fields


    Summary1
    Summary

    • Conditional random fields (CRFs)

      • Undirected graphical model

        • Compare with Bayesian Networks, Markov Random Fields

      • Linear-chain models

        • HMM sequence structure + MaxEnt feature models


    Summary2
    Summary

    • Conditional random fields (CRFs)

      • Undirected graphical model

        • Compare with Bayesian Networks, Markov Random Fields

      • Linear-chain models

        • HMM sequence structure + MaxEnt feature models

      • Skip-chain models

        • Augment with longer distance dependencies

      • Pros:


    Summary3
    Summary

    • Conditional random fields (CRFs)

      • Undirected graphical model

        • Compare with Bayesian Networks, Markov Random Fields

      • Linear-chain models

        • HMM sequence structure + MaxEnt feature models

      • Skip-chain models

        • Augment with longer distance dependencies

      • Pros: Good performance

      • Cons:


    Summary4
    Summary

    • Conditional random fields (CRFs)

      • Undirected graphical model

        • Compare with Bayesian Networks, Markov Random Fields

      • Linear-chain models

        • HMM sequence structure + MaxEnt feature models

      • Skip-chain models

        • Augment with longer distance dependencies

      • Pros: Good performance

      • Cons: Compute intensive



    Hw 5 beam search
    HW #5: Beam Search

    • Apply Beam Search to MaxEnt sequence decoding

    • Task: POS tagging

    • Given files:

      • test data: usual format

      • boundary file: sentence lengths

      • model file

    • Comparisons:

      • Different topN, topK, beam_width


    Tag context
    Tag Context

    • Following Ratnaparkhi ‘96, model uses previous tag (prevT=tag) and previous tag bigram (prevTwoTags=tagi-2+tagi-1)

    • These are NOT in the data file; you compute them on the fly.

    • Notes:

      • Due to sparseness, it is possible a bigram may not appear in the model file. Skip it.

      • These are feature functions: If you have a different candidate tag for the same word, weights will differ.


    Uncertainty
    Uncertainty

    • Real world tasks:

      • Partially observable, stochastic, extremely complex

      • Probabilities capture “Ignorance & Laziness”

        • Lack relevant facts, conditions

        • Failure to enumerate all conditions, exceptions


    Motivation4
    Motivation

    • Uncertainty in medical diagnosis

      • Diseases produce symptoms

      • In diagnosis, observed symptoms => disease ID

      • Uncertainties

        • Symptoms may not occur

        • Symptoms may not be reported

        • Diagnostic tests not perfect

          • False positive, false negative

    • How do we estimate confidence?


    ad