- 171 Views
- Uploaded on
- Presentation posted in: General

Conditional Random Fields

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Conditional Random Fields

Advanced Statistical Methods in NLP

Ling 572

February 9, 2012

- Graphical Models
- Modeling independence
- Models revisited
- Generative & discriminative models

- Conditional random fields
- Linear chain models
- Skip chain models

- Conditional random fields
- Undirected graphical model
- Due to Lafferty, McCallum, and Pereira, 2001

- Undirected graphical model

- Conditional random fields
- Undirected graphical model
- Due to Lafferty, McCallum, and Pereira, 2001

- Discriminative model
- Supports integration of rich feature sets

- Undirected graphical model

- Conditional random fields
- Undirected graphical model
- Due to Lafferty, McCallum, and Pereira, 2001

- Discriminative model
- Supports integration of rich feature sets

- Allows range of dependency structures
- Linear-chain, skip-chain, general
- Can encode long-distance dependencies

- Undirected graphical model

- Conditional random fields
- Undirected graphical model
- Due to Lafferty, McCallum, and Pereira, 2001

- Discriminative model
- Supports integration of rich feature sets

- Allows range of dependency structures
- Linear-chain, skip-chain, general
- Can encode long-distance dependencies

- Used diverse NLP sequence labeling tasks:
- Named entity recognition, coreference resolution, etc

- Undirected graphical model

- Graphical model
- Simple, graphical notation for conditional independence
- Probabilistic model where:
- Graph structure denotes conditional independence b/t random variables

- Graphical model
- Simple, graphical notation for conditional independence
- Probabilistic model where:
- Graph structure denotes conditional independence b/t random variables
- Nodes: random variables

- Graphical model
- Simple, graphical notation for conditional independence
- Probabilistic model where:
- Graph structure denotes conditional independence b/t random variables
- Nodes: random variables
- Edges: dependency relation between random variables

- Graphical model
- Simple, graphical notation for conditional independence
- Probabilistic model where:
- Graph structure denotes conditional independence b/t random variables
- Nodes: random variables
- Edges: dependency relation between random variables

- Model types:
- Bayesian Networks
- Markov Random Fields

- Bayesian network

- Bayesian network
- Directed acyclic graph (DAG)

- Bayesian network
- Directed acyclic graph (DAG)
- Nodes = Random Variables
- Arc ~ directly influences, conditional dependency

- Directed acyclic graph (DAG)

- Bayesian network
- Directed acyclic graph (DAG)
- Nodes = Random Variables
- Arc ~ directly influences, conditional dependency

- Arcs = Child depends on parent(s)
- No arcs = independent (0 incoming: only a priori)
- Parents of X =
- For each X need

- Directed acyclic graph (DAG)

Russel & Norvig, AIMA

Russel & Norvig, AIMA

Russel & Norvig, AIMA

A

B

C

D

E

- MCBN1

Need:

Truth table

A

B depends on

C depends on

D depends on

E depends on

A

B

C

D

E

- MCBN1

Need:

P(A)

Truth table

2

A = only a priori

B depends on

C depends on

D depends on

E depends on

A

B

C

D

E

- MCBN1

Need:

P(A)

P(B|A)

Truth table

2

2*2

A = only a priori

B depends on A

C depends on

D depends on

E depends on

A

B

C

D

E

- MCBN1

Need:

P(A)

P(B|A)

P(C|A)

Truth table

2

2*2

2*2

A = only a priori

B depends on A

C depends on A

D depends on

E depends on

A

B

C

D

E

- MCBN1

Need:

P(A)

P(B|A)

P(C|A)

P(D|B,C)

P(E|C)

Truth table

2

2*2

2*2

2*2*2

2*2

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

Holmes is worried that his house will be burgled. For

the time period of interest, there is a 10^-4 a priori chance

of this happening, and Holmes has installed a burglar alarm

to try to forestall this event. The alarm is 95% reliable in

sounding when a burglary happens, but also has a false

positive rate of 1%. Holmes’ neighbor, Watson, is 90% sure

to call Holmes at his office if the alarm sounds, but he is also

a bit of a practical joker and, knowing Holmes’ concern,

might (30%) call even if the alarm is silent. Holmes’ other

neighbor Mrs. Gibbons is a well-known lush and often

befuddled, but Holmes believes that she is four times more

likely to call him if there is an alarm than not.

There a four binary random variables:

W

B

A

G

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

W

B

A

G

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

W

B

A

G

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

W

B

A

G

There a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

B = #t B=#f

A

#t

#f

W=#t W=#f

0.0001 0.9999

0.90 0.10

0.30 0.70

A=#t A=#f

B

#t

#f

A

#t

#f

G=#t G=#f

0.95 0.05

0.01 0.99

0.40 0.60

0.10 0.90

- Bayes’s Nets:
- Satisfy the local Markov property
- Variables: conditionally independent of non-descendents given their parents

- Satisfy the local Markov property

- Bayes’s Nets:
- Satisfy the local Markov property
- Variables: conditionally independent of non-descendents given their parents

- Satisfy the local Markov property

- Bayes’s Nets:
- Satisfy the local Markov property
- Variables: conditionally independent of non-descendents given their parents

- Satisfy the local Markov property

A

B

C

D

E

- MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=

A

B

C

D

E

- MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)

A

B

C

D

E

- MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)

A

B

C

D

E

- MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)

A

B

C

D

E

- MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)

There exist algorithms for training, inference on BNs

- Bayes’ Net:
- Conditional independence of features given class

Y

f1

f2

f3

fk

- Bayes’ Net:
- Conditional independence of features given class

Y

f1

f2

f3

fk

- Bayes’ Net:
- Conditional independence of features given class

Y

f1

f2

f3

fk

- Bayesian Network where:
- yt depends on

- Bayesian Network where:
- yt depends on yt-1
- xt

- Bayesian Network where:
- yt depends on yt-1
- xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

- Bayesian Network where:
- yt depends on yt-1
- xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

- Bayesian Network where:
- yt depends on yt-1
- xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

- Bayesian Network where:
- yt depends on yt-1
- xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

- Both Naïve Bayes and HMMs are generative models

- Both Naïve Bayes and HMMs are generative models
- We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.
- (Sutton & McCallum, 2006)

- State y generates an observation (instance) x

- Both Naïve Bayes and HMMs are generative models
- We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.
- (Sutton & McCallum, 2006)

- State y generates an observation (instance) x

- aka Markov Network
- Graphical representation of probabilistic model
- Undirected graph
- Can represent cyclic dependencies
- (vs DAG in Bayesian Networks, can represent induced dep)

- Undirected graph

- aka Markov Network
- Graphical representation of probabilistic model
- Undirected graph
- Can represent cyclic dependencies
- (vs DAG in Bayesian Networks, can represent induced dep)

- Undirected graph
- Also satisfy local Markov property:
- where ne(X) are the neighbors of X

- Many MRFs can be analyzed in terms of cliques
- Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

Example due to F. Xia

- Many MRFs can be analyzed in terms of cliques
- Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)
- Maximal clique can not be extended

Example due to F. Xia

- Many MRFs can be analyzed in terms of cliques
- Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)
- Maximal clique can not be extended
- Maximum clique is largest clique in G.
Clique:

Maximal clique:

Maximum clique:

A

C

B

D

E

Example due to F. Xia

- Given an undirected graph G(V,E), random vars: X
- Cliques over G: cl(G)

Example due to F. Xia

- Given an undirected graph G(V,E), random vars: X
- Cliques over G: cl(G)

C

B

D

E

Example due to F. Xia

- Given an undirected graph G(V,E), random vars: X
- Cliques over G: cl(G)

C

B

D

E

Example due to F. Xia

- Definition due to Lafferty et al, 2001:
- Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G

- Definition due to Lafferty et al, 2001:
- Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G.

- A CRF is a Markov Random Field globally conditioned on the observation X, and has the form:

- CRFs can have arbitrary graphical structure, but..

- CRFs can have arbitrary graphical structure, but..
- Most common form is linear chain
- Supports sequence modeling
- Many sequence labeling NLP problems:
- Named Entity Recognition (NER), Coreference

- CRFs can have arbitrary graphical structure, but..
- Most common form is linear chain
- Supports sequence modeling
- Many sequence labeling NLP problems:
- Named Entity Recognition (NER), Coreference

- Similar to combining HMM sequence w/MaxEnt model
- Supports sequence structure like HMM
- but HMMs can’t do rich feature structure

- Supports sequence structure like HMM

- CRFs can have arbitrary graphical structure, but..
- Most common form is linear chain
- Supports sequence modeling
- Many sequence labeling NLP problems:
- Named Entity Recognition (NER), Coreference

- Similar to combining HMM sequence w/MaxEnt model
- Supports sequence structure like HMM
- but HMMs can’t do rich feature structure

- Supports rich, overlapping features like MaxEnt
- but MaxEnt doesn’t directly supports sequences labeling

- Supports sequence structure like HMM

- Model perspectives (Sutton & McCallum)

- Feature functions:
- In MaxEnt: f: X x Y {0,1}
- e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.

- In MaxEnt: f: X x Y {0,1}

- Feature functions:
- In MaxEnt: f: X x Y {0,1}
- e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.

- In CRFs, f: Y x Y x X x T R
- e.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0 o.w.
- frequently indicator function, for efficiency

- In MaxEnt: f: X x Y {0,1}

- Feature functions:
- In MaxEnt: f: X x Y {0,1}
- e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.

- In CRFs, f: Y x Y x X x T R
- e.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0 o.w.
- frequently indicator function, for efficiency

- In MaxEnt: f: X x Y {0,1}

- Training:

- Training:
- Learn λj
- Approach similar to MaxEnt: e.g. L-BFGS

- Training:
- Learn λj
- Approach similar to MaxEnt: e.g. L-BFGS

- Decoding:
- Compute label sequence that optimizes P(y|x)
- Can use approaches like HMM, e.g. Viterbi

- Long-distance dependencies:

- Long-distance dependencies:
- Linear chain CRFs, HMMs, beam search, etc
- All make very local Markov assumptions
- Preceding label; current data given current label
- Good for some tasks

- Long-distance dependencies:
- Linear chain CRFs, HMMs, beam search, etc
- All make very local Markov assumptions
- Preceding label; current data given current label
- Good for some tasks

- However, longer context can be useful
- e.g. NER: Repeated capitalized words should get same tag

- Long-distance dependencies:
- Linear chain CRFs, HMMs, beam search, etc
- All make local Markov assumptions
- Preceding label; current data given current label
- Good for some tasks

- However, longer context can be useful
- e.g. NER: Repeated capitalized words should get same tag

- Basic approach:
- Augment linear-chain CRF model with
- Long-distance ‘skip edges’
- Add evidence from both endpoints

- Basic approach:
- Augment linear-chain CRF model with
- Long-distance ‘skip edges’
- Add evidence from both endpoints

- Which edges?

- Basic approach:
- Augment linear-chain CRF model with
- Long-distance ‘skip edges’
- Add evidence from both endpoints

- Which edges?
- Identical words, words with same stem?

- Basic approach:
- Augment linear-chain CRF model with
- Long-distance ‘skip edges’
- Add evidence from both endpoints

- Which edges?
- Identical words, words with same stem?

- How many edges?

- Basic approach:
- Augment linear-chain CRF model with
- Long-distance ‘skip edges’
- Add evidence from both endpoints

- Which edges?
- Identical words, words with same stem?

- How many edges?
- Not too many, increases inference cost

- Two clique templates:
- Standard linear chain template

- Two clique templates:
- Standard linear chain template
- Skip edge template

- Two clique templates:
- Standard linear chain template
- Skip edge template

- Two clique templates:
- Standard linear chain template
- Skip edge template

- Named Entity Recognition:
- Task: start time, end time, speaker, location
- In corpus of seminar announcement emails

- Task: start time, end time, speaker, location

- Named Entity Recognition:
- Task: start time, end time, speaker, location
- In corpus of seminar announcement emails

- Task: start time, end time, speaker, location
- All approaches:
- Orthographic, gazeteer, POS features
- Within preceding, following 4 word window

- Orthographic, gazeteer, POS features

- Named Entity Recognition:
- Task: start time, end time, speaker, location
- In corpus of seminar announcement emails

- Task: start time, end time, speaker, location
- All approaches:
- Orthographic, gazeteer, POS features
- Within preceding, following 4 word window

- Orthographic, gazeteer, POS features
- Skip chain CRFs:
- Skip edges between identical capitalized words

Skip chain improves substantially on ‘speaker’ recognition

- Slight reduction in accuracy for times

- Conditional random fields (CRFs)
- Undirected graphical model
- Compare with Bayesian Networks, Markov Random Fields

- Undirected graphical model

- Conditional random fields (CRFs)
- Undirected graphical model
- Compare with Bayesian Networks, Markov Random Fields

- Linear-chain models
- HMM sequence structure + MaxEnt feature models

- Undirected graphical model

- Conditional random fields (CRFs)
- Undirected graphical model
- Compare with Bayesian Networks, Markov Random Fields

- Linear-chain models
- HMM sequence structure + MaxEnt feature models

- Skip-chain models
- Augment with longer distance dependencies

- Pros:

- Undirected graphical model

- Conditional random fields (CRFs)
- Undirected graphical model
- Compare with Bayesian Networks, Markov Random Fields

- Linear-chain models
- HMM sequence structure + MaxEnt feature models

- Skip-chain models
- Augment with longer distance dependencies

- Pros: Good performance
- Cons:

- Undirected graphical model

- Conditional random fields (CRFs)
- Undirected graphical model
- Compare with Bayesian Networks, Markov Random Fields

- Linear-chain models
- HMM sequence structure + MaxEnt feature models

- Skip-chain models
- Augment with longer distance dependencies

- Pros: Good performance
- Cons: Compute intensive

- Undirected graphical model

- Apply Beam Search to MaxEnt sequence decoding
- Task: POS tagging
- Given files:
- test data: usual format
- boundary file: sentence lengths
- model file

- Comparisons:
- Different topN, topK, beam_width

- Following Ratnaparkhi ‘96, model uses previous tag (prevT=tag) and previous tag bigram (prevTwoTags=tagi-2+tagi-1)
- These are NOT in the data file; you compute them on the fly.
- Notes:
- Due to sparseness, it is possible a bigram may not appear in the model file. Skip it.
- These are feature functions: If you have a different candidate tag for the same word, weights will differ.

- Real world tasks:
- Partially observable, stochastic, extremely complex
- Probabilities capture “Ignorance & Laziness”
- Lack relevant facts, conditions
- Failure to enumerate all conditions, exceptions

- Uncertainty in medical diagnosis
- Diseases produce symptoms
- In diagnosis, observed symptoms => disease ID
- Uncertainties
- Symptoms may not occur
- Symptoms may not be reported
- Diagnostic tests not perfect
- False positive, false negative

- How do we estimate confidence?