conditional random fields
Download
Skip this Video
Download Presentation
Conditional Random Fields

Loading in 2 Seconds...

play fullscreen
1 / 102

Conditional Random Fields - PowerPoint PPT Presentation


  • 209 Views
  • Uploaded on

Conditional Random Fields. Advanced Statistical Methods in NLP Ling 572 February 9, 2012. Roadmap. Graphical Models Modeling independence Models revisited Generative & discriminative models Conditional random fields Linear chain models Skip chain models. Preview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Conditional Random Fields' - selima


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
conditional random fields

Conditional Random Fields

Advanced Statistical Methods in NLP

Ling 572

February 9, 2012

roadmap
Roadmap
  • Graphical Models
    • Modeling independence
    • Models revisited
    • Generative & discriminative models
  • Conditional random fields
    • Linear chain models
    • Skip chain models
preview
Preview
  • Conditional random fields
    • Undirected graphical model
      • Due to Lafferty, McCallum, and Pereira, 2001
preview1
Preview
  • Conditional random fields
    • Undirected graphical model
      • Due to Lafferty, McCallum, and Pereira, 2001
    • Discriminative model
      • Supports integration of rich feature sets
preview2
Preview
  • Conditional random fields
    • Undirected graphical model
      • Due to Lafferty, McCallum, and Pereira, 2001
    • Discriminative model
      • Supports integration of rich feature sets
    • Allows range of dependency structures
      • Linear-chain, skip-chain, general
      • Can encode long-distance dependencies
preview3
Preview
  • Conditional random fields
    • Undirected graphical model
      • Due to Lafferty, McCallum, and Pereira, 2001
    • Discriminative model
      • Supports integration of rich feature sets
    • Allows range of dependency structures
      • Linear-chain, skip-chain, general
      • Can encode long-distance dependencies
    • Used diverse NLP sequence labeling tasks:
      • Named entity recognition, coreference resolution, etc
graphical models1
Graphical Models
  • Graphical model
    • Simple, graphical notation for conditional independence
    • Probabilistic model where:
      • Graph structure denotes conditional independence b/t random variables
graphical models2
Graphical Models
  • Graphical model
    • Simple, graphical notation for conditional independence
    • Probabilistic model where:
      • Graph structure denotes conditional independence b/t random variables
      • Nodes: random variables
graphical models3
Graphical Models
  • Graphical model
    • Simple, graphical notation for conditional independence
    • Probabilistic model where:
      • Graph structure denotes conditional independence b/t random variables
      • Nodes: random variables
      • Edges: dependency relation between random variables
graphical models4
Graphical Models
  • Graphical model
    • Simple, graphical notation for conditional independence
    • Probabilistic model where:
      • Graph structure denotes conditional independence b/t random variables
      • Nodes: random variables
      • Edges: dependency relation between random variables
  • Model types:
    • Bayesian Networks
    • Markov Random Fields
modeling in dependence
Modeling (In)dependence
  • Bayesian network
modeling in dependence1
Modeling (In)dependence
  • Bayesian network
    • Directed acyclic graph (DAG)
modeling in dependence2
Modeling (In)dependence
  • Bayesian network
    • Directed acyclic graph (DAG)
      • Nodes = Random Variables
      • Arc ~ directly influences, conditional dependency
modeling in dependence3
Modeling (In)dependence
  • Bayesian network
    • Directed acyclic graph (DAG)
      • Nodes = Random Variables
      • Arc ~ directly influences, conditional dependency
    • Arcs = Child depends on parent(s)
      • No arcs = independent (0 incoming: only a priori)
      • Parents of X =
      • For each X need
example i
Example I

Russel & Norvig, AIMA

example i1
Example I

Russel & Norvig, AIMA

example i2
Example I

Russel & Norvig, AIMA

simple bayesian network

A

B

C

D

E

Simple Bayesian Network
  • MCBN1

Need:

Truth table

A

B depends on

C depends on

D depends on

E depends on

simple bayesian network1

A

B

C

D

E

Simple Bayesian Network
  • MCBN1

Need:

P(A)

Truth table

2

A = only a priori

B depends on

C depends on

D depends on

E depends on

simple bayesian network2

A

B

C

D

E

Simple Bayesian Network
  • MCBN1

Need:

P(A)

P(B|A)

Truth table

2

2*2

A = only a priori

B depends on A

C depends on

D depends on

E depends on

simple bayesian network3

A

B

C

D

E

Simple Bayesian Network
  • MCBN1

Need:

P(A)

P(B|A)

P(C|A)

Truth table

2

2*2

2*2

A = only a priori

B depends on A

C depends on A

D depends on

E depends on

simple bayesian network4

A

B

C

D

E

Simple Bayesian Network
  • MCBN1

Need:

P(A)

P(B|A)

P(C|A)

P(D|B,C)

P(E|C)

Truth table

2

2*2

2*2

2*2*2

2*2

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

holmes example pearl
Holmes Example (Pearl)

Holmes is worried that his house will be burgled. For

the time period of interest, there is a 10^-4 a priori chance

of this happening, and Holmes has installed a burglar alarm

to try to forestall this event. The alarm is 95% reliable in

sounding when a burglary happens, but also has a false

positive rate of 1%. Holmes’ neighbor, Watson, is 90% sure

to call Holmes at his office if the alarm sounds, but he is also

a bit of a practical joker and, knowing Holmes’ concern,

might (30%) call even if the alarm is silent. Holmes’ other

neighbor Mrs. Gibbons is a well-known lush and often

befuddled, but Holmes believes that she is four times more

likely to call him if there is an alarm than not.

holmes example model
Holmes Example: Model

There a four binary random variables:

holmes example model1

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

holmes example model2

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

holmes example model3

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

holmes example model4

W

B

A

G

Holmes Example: Model

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

holmes example tables
Holmes Example: Tables

B = #t B=#f

A

#t

#f

W=#t W=#f

0.0001 0.9999

0.90 0.10

0.30 0.70

A=#t A=#f

B

#t

#f

A

#t

#f

G=#t G=#f

0.95 0.05

0.01 0.99

0.40 0.60

0.10 0.90

bayes nets markov property
Bayes’ Nets: Markov Property
  • Bayes’s Nets:
    • Satisfy the local Markov property
      • Variables: conditionally independent of non-descendents given their parents
bayes nets markov property1
Bayes’ Nets: Markov Property
  • Bayes’s Nets:
    • Satisfy the local Markov property
      • Variables: conditionally independent of non-descendents given their parents
bayes nets markov property2
Bayes’ Nets: Markov Property
  • Bayes’s Nets:
    • Satisfy the local Markov property
      • Variables: conditionally independent of non-descendents given their parents
simple bayesian network5

A

B

C

D

E

Simple Bayesian Network
  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=

simple bayesian network6

A

B

C

D

E

Simple Bayesian Network
  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)

simple bayesian network7

A

B

C

D

E

Simple Bayesian Network
  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)

simple bayesian network8

A

B

C

D

E

Simple Bayesian Network
  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)

simple bayesian network9

A

B

C

D

E

Simple Bayesian Network
  • MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)

There exist algorithms for training, inference on BNs

na ve bayes model
Naïve Bayes Model
  • Bayes’ Net:
    • Conditional independence of features given class

Y

f1

f2

f3

fk

na ve bayes model1
Naïve Bayes Model
  • Bayes’ Net:
    • Conditional independence of features given class

Y

f1

f2

f3

fk

na ve bayes model2
Naïve Bayes Model
  • Bayes’ Net:
    • Conditional independence of features given class

Y

f1

f2

f3

fk

hidden markov model
Hidden Markov Model
  • Bayesian Network where:
    • yt depends on
hidden markov model1
Hidden Markov Model
  • Bayesian Network where:
    • yt depends on yt-1
    • xt
hidden markov model2
Hidden Markov Model
  • Bayesian Network where:
    • yt depends on yt-1
    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

hidden markov model3
Hidden Markov Model
  • Bayesian Network where:
    • yt depends on yt-1
    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

hidden markov model4
Hidden Markov Model
  • Bayesian Network where:
    • yt depends on yt-1
    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

hidden markov model5
Hidden Markov Model
  • Bayesian Network where:
    • yt depends on yt-1
    • xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

generative models
Generative Models
  • Both Naïve Bayes and HMMs are generative models
generative models1
Generative Models
  • Both Naïve Bayes and HMMs are generative models
  • We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.
      • (Sutton & McCallum, 2006)
    • State y generates an observation (instance) x
generative models2
Generative Models
  • Both Naïve Bayes and HMMs are generative models
  • We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.
      • (Sutton & McCallum, 2006)
    • State y generates an observation (instance) x
  • Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts
markov random fields
Markov Random Fields
  • aka Markov Network
  • Graphical representation of probabilistic model
    • Undirected graph
      • Can represent cyclic dependencies
      • (vs DAG in Bayesian Networks, can represent induced dep)
markov random fields1
Markov Random Fields
  • aka Markov Network
  • Graphical representation of probabilistic model
    • Undirected graph
      • Can represent cyclic dependencies
      • (vs DAG in Bayesian Networks, can represent induced dep)
  • Also satisfy local Markov property:
    • where ne(X) are the neighbors of X
factorizing mrfs
Factorizing MRFs
  • Many MRFs can be analyzed in terms of cliques
    • Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

Example due to F. Xia

factorizing mrfs1
Factorizing MRFs
  • Many MRFs can be analyzed in terms of cliques
    • Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)
    • Maximal clique can not be extended

Example due to F. Xia

factorizing mrfs2
Factorizing MRFs
  • Many MRFs can be analyzed in terms of cliques
    • Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)
    • Maximal clique can not be extended
    • Maximum clique is largest clique in G.

Clique:

Maximal clique:

Maximum clique:

A

C

B

D

E

Example due to F. Xia

slide56
MRFs
  • Given an undirected graph G(V,E), random vars: X
  • Cliques over G: cl(G)

Example due to F. Xia

slide57
MRFs
  • Given an undirected graph G(V,E), random vars: X
  • Cliques over G: cl(G)

C

B

D

E

Example due to F. Xia

slide58
MRFs
  • Given an undirected graph G(V,E), random vars: X
  • Cliques over G: cl(G)

C

B

D

E

Example due to F. Xia

conditional random fields1
Conditional Random Fields
  • Definition due to Lafferty et al, 2001:
    • Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G
conditional random fields2
Conditional Random Fields
  • Definition due to Lafferty et al, 2001:
    • Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G.
  • A CRF is a Markov Random Field globally conditioned on the observation X, and has the form:
linear chain crf
Linear-Chain CRF
  • CRFs can have arbitrary graphical structure, but..
linear chain crf1
Linear-Chain CRF
  • CRFs can have arbitrary graphical structure, but..
  • Most common form is linear chain
    • Supports sequence modeling
    • Many sequence labeling NLP problems:
      • Named Entity Recognition (NER), Coreference
linear chain crf2
Linear-Chain CRF
  • CRFs can have arbitrary graphical structure, but..
  • Most common form is linear chain
    • Supports sequence modeling
    • Many sequence labeling NLP problems:
      • Named Entity Recognition (NER), Coreference
    • Similar to combining HMM sequence w/MaxEnt model
      • Supports sequence structure like HMM
        • but HMMs can’t do rich feature structure
linear chain crf3
Linear-Chain CRF
  • CRFs can have arbitrary graphical structure, but..
  • Most common form is linear chain
    • Supports sequence modeling
    • Many sequence labeling NLP problems:
      • Named Entity Recognition (NER), Coreference
    • Similar to combining HMM sequence w/MaxEnt model
      • Supports sequence structure like HMM
        • but HMMs can’t do rich feature structure
      • Supports rich, overlapping features like MaxEnt
        • but MaxEnt doesn’t directly supports sequences labeling
discriminative generative
Discriminative & Generative
  • Model perspectives (Sutton & McCallum)
linear chain crfs
Linear-Chain CRFs
  • Feature functions:
    • In MaxEnt: f: X x Y {0,1}
      • e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.
linear chain crfs1
Linear-Chain CRFs
  • Feature functions:
    • In MaxEnt: f: X x Y {0,1}
      • e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.
    • In CRFs, f: Y x Y x X x T R
      • e.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0 o.w.
      • frequently indicator function, for efficiency
linear chain crfs2
Linear-Chain CRFs
  • Feature functions:
    • In MaxEnt: f: X x Y {0,1}
      • e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.
    • In CRFs, f: Y x Y x X x T R
      • e.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0 o.w.
      • frequently indicator function, for efficiency
linear chain crfs training decoding1
Linear-chain CRFs:Training & Decoding
  • Training:
    • Learn λj
    • Approach similar to MaxEnt: e.g. L-BFGS
linear chain crfs training decoding2
Linear-chain CRFs:Training & Decoding
  • Training:
    • Learn λj
    • Approach similar to MaxEnt: e.g. L-BFGS
  • Decoding:
    • Compute label sequence that optimizes P(y|x)
    • Can use approaches like HMM, e.g. Viterbi
motivation
Motivation
  • Long-distance dependencies:
motivation1
Motivation
  • Long-distance dependencies:
    • Linear chain CRFs, HMMs, beam search, etc
    • All make very local Markov assumptions
      • Preceding label; current data given current label
      • Good for some tasks
motivation2
Motivation
  • Long-distance dependencies:
    • Linear chain CRFs, HMMs, beam search, etc
    • All make very local Markov assumptions
      • Preceding label; current data given current label
      • Good for some tasks
    • However, longer context can be useful
      • e.g. NER: Repeated capitalized words should get same tag
motivation3
Motivation
  • Long-distance dependencies:
    • Linear chain CRFs, HMMs, beam search, etc
    • All make local Markov assumptions
      • Preceding label; current data given current label
      • Good for some tasks
    • However, longer context can be useful
      • e.g. NER: Repeated capitalized words should get same tag
skip chain crfs1
Skip-Chain CRFs
  • Basic approach:
    • Augment linear-chain CRF model with
    • Long-distance ‘skip edges’
      • Add evidence from both endpoints
skip chain crfs2
Skip-Chain CRFs
  • Basic approach:
    • Augment linear-chain CRF model with
    • Long-distance ‘skip edges’
      • Add evidence from both endpoints
  • Which edges?
skip chain crfs3
Skip-Chain CRFs
  • Basic approach:
    • Augment linear-chain CRF model with
    • Long-distance ‘skip edges’
      • Add evidence from both endpoints
  • Which edges?
    • Identical words, words with same stem?
skip chain crfs4
Skip-Chain CRFs
  • Basic approach:
    • Augment linear-chain CRF model with
    • Long-distance ‘skip edges’
      • Add evidence from both endpoints
  • Which edges?
    • Identical words, words with same stem?
  • How many edges?
skip chain crfs5
Skip-Chain CRFs
  • Basic approach:
    • Augment linear-chain CRF model with
    • Long-distance ‘skip edges’
      • Add evidence from both endpoints
  • Which edges?
    • Identical words, words with same stem?
  • How many edges?
    • Not too many, increases inference cost
skip chain crf model
Skip Chain CRF Model
  • Two clique templates:
    • Standard linear chain template
skip chain crf model1
Skip Chain CRF Model
  • Two clique templates:
    • Standard linear chain template
    • Skip edge template
skip chain crf model2
Skip Chain CRF Model
  • Two clique templates:
    • Standard linear chain template
    • Skip edge template
skip chain crf model3
Skip Chain CRF Model
  • Two clique templates:
    • Standard linear chain template
    • Skip edge template
skip chain ner
Skip Chain NER
  • Named Entity Recognition:
    • Task: start time, end time, speaker, location
      • In corpus of seminar announcement emails
skip chain ner1
Skip Chain NER
  • Named Entity Recognition:
    • Task: start time, end time, speaker, location
      • In corpus of seminar announcement emails
  • All approaches:
    • Orthographic, gazeteer, POS features
      • Within preceding, following 4 word window
skip chain ner2
Skip Chain NER
  • Named Entity Recognition:
    • Task: start time, end time, speaker, location
      • In corpus of seminar announcement emails
  • All approaches:
    • Orthographic, gazeteer, POS features
      • Within preceding, following 4 word window
  • Skip chain CRFs:
    • Skip edges between identical capitalized words
skip chain ner results
Skip Chain NER Results

Skip chain improves substantially on ‘speaker’ recognition

- Slight reduction in accuracy for times

summary
Summary
  • Conditional random fields (CRFs)
    • Undirected graphical model
      • Compare with Bayesian Networks, Markov Random Fields
summary1
Summary
  • Conditional random fields (CRFs)
    • Undirected graphical model
      • Compare with Bayesian Networks, Markov Random Fields
    • Linear-chain models
      • HMM sequence structure + MaxEnt feature models
summary2
Summary
  • Conditional random fields (CRFs)
    • Undirected graphical model
      • Compare with Bayesian Networks, Markov Random Fields
    • Linear-chain models
      • HMM sequence structure + MaxEnt feature models
    • Skip-chain models
      • Augment with longer distance dependencies
    • Pros:
summary3
Summary
  • Conditional random fields (CRFs)
    • Undirected graphical model
      • Compare with Bayesian Networks, Markov Random Fields
    • Linear-chain models
      • HMM sequence structure + MaxEnt feature models
    • Skip-chain models
      • Augment with longer distance dependencies
    • Pros: Good performance
    • Cons:
summary4
Summary
  • Conditional random fields (CRFs)
    • Undirected graphical model
      • Compare with Bayesian Networks, Markov Random Fields
    • Linear-chain models
      • HMM sequence structure + MaxEnt feature models
    • Skip-chain models
      • Augment with longer distance dependencies
    • Pros: Good performance
    • Cons: Compute intensive
hw 5 beam search
HW #5: Beam Search
  • Apply Beam Search to MaxEnt sequence decoding
  • Task: POS tagging
  • Given files:
    • test data: usual format
    • boundary file: sentence lengths
    • model file
  • Comparisons:
    • Different topN, topK, beam_width
tag context
Tag Context
  • Following Ratnaparkhi ‘96, model uses previous tag (prevT=tag) and previous tag bigram (prevTwoTags=tagi-2+tagi-1)
  • These are NOT in the data file; you compute them on the fly.
  • Notes:
    • Due to sparseness, it is possible a bigram may not appear in the model file. Skip it.
    • These are feature functions: If you have a different candidate tag for the same word, weights will differ.
uncertainty
Uncertainty
  • Real world tasks:
    • Partially observable, stochastic, extremely complex
    • Probabilities capture “Ignorance & Laziness”
      • Lack relevant facts, conditions
      • Failure to enumerate all conditions, exceptions
motivation4
Motivation
  • Uncertainty in medical diagnosis
    • Diseases produce symptoms
    • In diagnosis, observed symptoms => disease ID
    • Uncertainties
      • Symptoms may not occur
      • Symptoms may not be reported
      • Diagnostic tests not perfect
        • False positive, false negative
  • How do we estimate confidence?
ad