Loading in 5 sec....

Conditional Random FieldsPowerPoint Presentation

Conditional Random Fields

- 207 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Conditional Random Fields' - selima

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Hidden Markov Model

Roadmap

- Graphical Models
- Modeling independence
- Models revisited
- Generative & discriminative models

- Conditional random fields
- Linear chain models
- Skip chain models

Preview

- Conditional random fields
- Undirected graphical model
- Due to Lafferty, McCallum, and Pereira, 2001

- Undirected graphical model

Preview

- Conditional random fields
- Undirected graphical model
- Due to Lafferty, McCallum, and Pereira, 2001

- Discriminative model
- Supports integration of rich feature sets

- Undirected graphical model

Preview

- Conditional random fields
- Undirected graphical model
- Due to Lafferty, McCallum, and Pereira, 2001

- Discriminative model
- Supports integration of rich feature sets

- Allows range of dependency structures
- Linear-chain, skip-chain, general
- Can encode long-distance dependencies

- Undirected graphical model

Preview

- Conditional random fields
- Undirected graphical model
- Due to Lafferty, McCallum, and Pereira, 2001

- Discriminative model
- Supports integration of rich feature sets

- Allows range of dependency structures
- Linear-chain, skip-chain, general
- Can encode long-distance dependencies

- Used diverse NLP sequence labeling tasks:
- Named entity recognition, coreference resolution, etc

- Undirected graphical model

Graphical Models

- Graphical model
- Simple, graphical notation for conditional independence
- Probabilistic model where:
- Graph structure denotes conditional independence b/t random variables

Graphical Models

- Graphical model
- Simple, graphical notation for conditional independence
- Probabilistic model where:
- Graph structure denotes conditional independence b/t random variables
- Nodes: random variables

Graphical Models

- Graphical model
- Simple, graphical notation for conditional independence
- Probabilistic model where:
- Graph structure denotes conditional independence b/t random variables
- Nodes: random variables
- Edges: dependency relation between random variables

Graphical Models

- Graphical model
- Simple, graphical notation for conditional independence
- Probabilistic model where:
- Graph structure denotes conditional independence b/t random variables
- Nodes: random variables
- Edges: dependency relation between random variables

- Model types:
- Bayesian Networks
- Markov Random Fields

Modeling (In)dependence

- Bayesian network

Modeling (In)dependence

- Bayesian network
- Directed acyclic graph (DAG)

Modeling (In)dependence

- Bayesian network
- Directed acyclic graph (DAG)
- Nodes = Random Variables
- Arc ~ directly influences, conditional dependency

- Directed acyclic graph (DAG)

Modeling (In)dependence

- Bayesian network
- Directed acyclic graph (DAG)
- Nodes = Random Variables
- Arc ~ directly influences, conditional dependency

- Arcs = Child depends on parent(s)
- No arcs = independent (0 incoming: only a priori)
- Parents of X =
- For each X need

- Directed acyclic graph (DAG)

Example I

Russel & Norvig, AIMA

Example I

Russel & Norvig, AIMA

Example I

Russel & Norvig, AIMA

B

C

D

E

Simple Bayesian Network- MCBN1

Need:

Truth table

A

B depends on

C depends on

D depends on

E depends on

B

C

D

E

Simple Bayesian Network- MCBN1

Need:

P(A)

Truth table

2

A = only a priori

B depends on

C depends on

D depends on

E depends on

B

C

D

E

Simple Bayesian Network- MCBN1

Need:

P(A)

P(B|A)

Truth table

2

2*2

A = only a priori

B depends on A

C depends on

D depends on

E depends on

B

C

D

E

Simple Bayesian Network- MCBN1

Need:

P(A)

P(B|A)

P(C|A)

Truth table

2

2*2

2*2

A = only a priori

B depends on A

C depends on A

D depends on

E depends on

B

C

D

E

Simple Bayesian Network- MCBN1

Need:

P(A)

P(B|A)

P(C|A)

P(D|B,C)

P(E|C)

Truth table

2

2*2

2*2

2*2*2

2*2

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

Holmes Example (Pearl)

Holmes is worried that his house will be burgled. For

the time period of interest, there is a 10^-4 a priori chance

of this happening, and Holmes has installed a burglar alarm

to try to forestall this event. The alarm is 95% reliable in

sounding when a burglary happens, but also has a false

positive rate of 1%. Holmes’ neighbor, Watson, is 90% sure

to call Holmes at his office if the alarm sounds, but he is also

a bit of a practical joker and, knowing Holmes’ concern,

might (30%) call even if the alarm is silent. Holmes’ other

neighbor Mrs. Gibbons is a well-known lush and often

befuddled, but Holmes believes that she is four times more

likely to call him if there is an alarm than not.

Holmes Example: Model

There a four binary random variables:

B

A

G

Holmes Example: ModelThere a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

B

A

G

Holmes Example: ModelThere a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

B

A

G

Holmes Example: ModelThere a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

B

A

G

Holmes Example: ModelThere a four binary random variables:

B: whether Holmes’ house has been burgled

A: whether his alarm sounded

W: whether Watson called

G: whether Gibbons called

Holmes Example: Tables

B = #t B=#f

A

#t

#f

W=#t W=#f

0.0001 0.9999

0.90 0.10

0.30 0.70

A=#t A=#f

B

#t

#f

A

#t

#f

G=#t G=#f

0.95 0.05

0.01 0.99

0.40 0.60

0.10 0.90

Bayes’ Nets: Markov Property

- Bayes’s Nets:
- Satisfy the local Markov property
- Variables: conditionally independent of non-descendents given their parents

- Satisfy the local Markov property

Bayes’ Nets: Markov Property

- Bayes’s Nets:
- Satisfy the local Markov property
- Variables: conditionally independent of non-descendents given their parents

- Satisfy the local Markov property

Bayes’ Nets: Markov Property

- Bayes’s Nets:
- Satisfy the local Markov property
- Variables: conditionally independent of non-descendents given their parents

- Satisfy the local Markov property

B

C

D

E

Simple Bayesian Network- MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=

B

C

D

E

Simple Bayesian Network- MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)

B

C

D

E

Simple Bayesian Network- MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)

B

C

D

E

Simple Bayesian Network- MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)

B

C

D

E

Simple Bayesian Network- MCBN1

A = only a priori

B depends on A

C depends on A

D depends on B,C

E depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)

There exist algorithms for training, inference on BNs

Hidden Markov Model

- Bayesian Network where:
- yt depends on

Hidden Markov Model

- Bayesian Network where:
- yt depends on yt-1
- xt

Hidden Markov Model

- Bayesian Network where:
- yt depends on yt-1
- xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

Hidden Markov Model

- Bayesian Network where:
- yt depends on yt-1
- xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

Hidden Markov Model

- Bayesian Network where:
- yt depends on yt-1
- xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

- Bayesian Network where:
- yt depends on yt-1
- xt depends on yt

y1

y2

y3

yk

x1 x2 x3xk

Generative Models

- Both Naïve Bayes and HMMs are generative models

Generative Models

- Both Naïve Bayes and HMMs are generative models
- We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.
- (Sutton & McCallum, 2006)

- State y generates an observation (instance) x

Generative Models Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts

- Both Naïve Bayes and HMMs are generative models
- We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.
- (Sutton & McCallum, 2006)

- State y generates an observation (instance) x

Markov Random Fields

- aka Markov Network
- Graphical representation of probabilistic model
- Undirected graph
- Can represent cyclic dependencies
- (vs DAG in Bayesian Networks, can represent induced dep)

- Undirected graph

Markov Random Fields

- aka Markov Network
- Graphical representation of probabilistic model
- Undirected graph
- Can represent cyclic dependencies
- (vs DAG in Bayesian Networks, can represent induced dep)

- Undirected graph
- Also satisfy local Markov property:
- where ne(X) are the neighbors of X

Factorizing MRFs

- Many MRFs can be analyzed in terms of cliques
- Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

Example due to F. Xia

Factorizing MRFs

- Many MRFs can be analyzed in terms of cliques
- Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)
- Maximal clique can not be extended

Example due to F. Xia

Factorizing MRFs

- Many MRFs can be analyzed in terms of cliques
- Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)
- Maximal clique can not be extended
- Maximum clique is largest clique in G.
Clique:

Maximal clique:

Maximum clique:

A

C

B

D

E

Example due to F. Xia

MRFs

- Given an undirected graph G(V,E), random vars: X
- Cliques over G: cl(G)

C

B

D

E

Example due to F. Xia

MRFs

- Given an undirected graph G(V,E), random vars: X
- Cliques over G: cl(G)

C

B

D

E

Example due to F. Xia

Conditional Random Fields

- Definition due to Lafferty et al, 2001:
- Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G

Conditional Random Fields

- Definition due to Lafferty et al, 2001:
- Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G.

- A CRF is a Markov Random Field globally conditioned on the observation X, and has the form:

Linear-Chain CRF

- CRFs can have arbitrary graphical structure, but..

Linear-Chain CRF

- CRFs can have arbitrary graphical structure, but..
- Most common form is linear chain
- Supports sequence modeling
- Many sequence labeling NLP problems:
- Named Entity Recognition (NER), Coreference

Linear-Chain CRF

- CRFs can have arbitrary graphical structure, but..
- Most common form is linear chain
- Supports sequence modeling
- Many sequence labeling NLP problems:
- Named Entity Recognition (NER), Coreference

- Similar to combining HMM sequence w/MaxEnt model
- Supports sequence structure like HMM
- but HMMs can’t do rich feature structure

- Supports sequence structure like HMM

Linear-Chain CRF

- CRFs can have arbitrary graphical structure, but..
- Most common form is linear chain
- Supports sequence modeling
- Many sequence labeling NLP problems:
- Named Entity Recognition (NER), Coreference

- Similar to combining HMM sequence w/MaxEnt model
- Supports sequence structure like HMM
- but HMMs can’t do rich feature structure

- Supports rich, overlapping features like MaxEnt
- but MaxEnt doesn’t directly supports sequences labeling

- Supports sequence structure like HMM

Discriminative & Generative

- Model perspectives (Sutton & McCallum)

Linear-Chain CRFs

- Feature functions:
- In MaxEnt: f: X x Y {0,1}
- e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.

- In MaxEnt: f: X x Y {0,1}

Linear-Chain CRFs

- Feature functions:
- In MaxEnt: f: X x Y {0,1}
- e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.

- In CRFs, f: Y x Y x X x T R
- e.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0 o.w.
- frequently indicator function, for efficiency

- In MaxEnt: f: X x Y {0,1}

Linear-Chain CRFs

- Feature functions:
- In MaxEnt: f: X x Y {0,1}
- e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w.

- In CRFs, f: Y x Y x X x T R
- e.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0 o.w.
- frequently indicator function, for efficiency

- In MaxEnt: f: X x Y {0,1}

Linear-chain CRFs:Training & Decoding

- Training:

Linear-chain CRFs:Training & Decoding

- Training:
- Learn λj
- Approach similar to MaxEnt: e.g. L-BFGS

Linear-chain CRFs:Training & Decoding

- Training:
- Learn λj
- Approach similar to MaxEnt: e.g. L-BFGS

- Decoding:
- Compute label sequence that optimizes P(y|x)
- Can use approaches like HMM, e.g. Viterbi

Motivation

- Long-distance dependencies:

Motivation

- Long-distance dependencies:
- Linear chain CRFs, HMMs, beam search, etc
- All make very local Markov assumptions
- Preceding label; current data given current label
- Good for some tasks

Motivation

- Long-distance dependencies:
- Linear chain CRFs, HMMs, beam search, etc
- All make very local Markov assumptions
- Preceding label; current data given current label
- Good for some tasks

- However, longer context can be useful
- e.g. NER: Repeated capitalized words should get same tag

Motivation

- Long-distance dependencies:
- Linear chain CRFs, HMMs, beam search, etc
- All make local Markov assumptions
- Preceding label; current data given current label
- Good for some tasks

- However, longer context can be useful
- e.g. NER: Repeated capitalized words should get same tag

Skip-Chain CRFs

- Basic approach:
- Augment linear-chain CRF model with
- Long-distance ‘skip edges’
- Add evidence from both endpoints

Skip-Chain CRFs

- Basic approach:
- Augment linear-chain CRF model with
- Long-distance ‘skip edges’
- Add evidence from both endpoints

- Which edges?

Skip-Chain CRFs

- Basic approach:
- Augment linear-chain CRF model with
- Long-distance ‘skip edges’
- Add evidence from both endpoints

- Which edges?
- Identical words, words with same stem?

Skip-Chain CRFs

- Basic approach:
- Augment linear-chain CRF model with
- Long-distance ‘skip edges’
- Add evidence from both endpoints

- Which edges?
- Identical words, words with same stem?

- How many edges?

Skip-Chain CRFs

- Basic approach:
- Augment linear-chain CRF model with
- Long-distance ‘skip edges’
- Add evidence from both endpoints

- Which edges?
- Identical words, words with same stem?

- How many edges?
- Not too many, increases inference cost

Skip Chain CRF Model

- Two clique templates:
- Standard linear chain template

Skip Chain CRF Model

- Two clique templates:
- Standard linear chain template
- Skip edge template

Skip Chain CRF Model

- Two clique templates:
- Standard linear chain template
- Skip edge template

Skip Chain CRF Model

- Two clique templates:
- Standard linear chain template
- Skip edge template

Skip Chain NER

- Named Entity Recognition:
- Task: start time, end time, speaker, location
- In corpus of seminar announcement emails

- Task: start time, end time, speaker, location

Skip Chain NER

- Named Entity Recognition:
- Task: start time, end time, speaker, location
- In corpus of seminar announcement emails

- Task: start time, end time, speaker, location
- All approaches:
- Orthographic, gazeteer, POS features
- Within preceding, following 4 word window

- Orthographic, gazeteer, POS features

Skip Chain NER

- Named Entity Recognition:
- Task: start time, end time, speaker, location
- In corpus of seminar announcement emails

- Task: start time, end time, speaker, location
- All approaches:
- Orthographic, gazeteer, POS features
- Within preceding, following 4 word window

- Orthographic, gazeteer, POS features
- Skip chain CRFs:
- Skip edges between identical capitalized words

Skip Chain NER Results

Skip chain improves substantially on ‘speaker’ recognition

- Slight reduction in accuracy for times

Summary

- Conditional random fields (CRFs)
- Undirected graphical model
- Compare with Bayesian Networks, Markov Random Fields

- Undirected graphical model

Summary

- Conditional random fields (CRFs)
- Undirected graphical model
- Compare with Bayesian Networks, Markov Random Fields

- Linear-chain models
- HMM sequence structure + MaxEnt feature models

- Undirected graphical model

Summary

- Conditional random fields (CRFs)
- Undirected graphical model
- Compare with Bayesian Networks, Markov Random Fields

- Linear-chain models
- HMM sequence structure + MaxEnt feature models

- Skip-chain models
- Augment with longer distance dependencies

- Pros:

- Undirected graphical model

Summary

- Conditional random fields (CRFs)
- Undirected graphical model
- Compare with Bayesian Networks, Markov Random Fields

- Linear-chain models
- HMM sequence structure + MaxEnt feature models

- Skip-chain models
- Augment with longer distance dependencies

- Pros: Good performance
- Cons:

- Undirected graphical model

Summary

- Conditional random fields (CRFs)
- Undirected graphical model
- Compare with Bayesian Networks, Markov Random Fields

- Linear-chain models
- HMM sequence structure + MaxEnt feature models

- Skip-chain models
- Augment with longer distance dependencies

- Pros: Good performance
- Cons: Compute intensive

- Undirected graphical model

HW #5: Beam Search

- Apply Beam Search to MaxEnt sequence decoding
- Task: POS tagging
- Given files:
- test data: usual format
- boundary file: sentence lengths
- model file

- Comparisons:
- Different topN, topK, beam_width

Tag Context

- Following Ratnaparkhi ‘96, model uses previous tag (prevT=tag) and previous tag bigram (prevTwoTags=tagi-2+tagi-1)
- These are NOT in the data file; you compute them on the fly.
- Notes:
- Due to sparseness, it is possible a bigram may not appear in the model file. Skip it.
- These are feature functions: If you have a different candidate tag for the same word, weights will differ.

Uncertainty

- Real world tasks:
- Partially observable, stochastic, extremely complex
- Probabilities capture “Ignorance & Laziness”
- Lack relevant facts, conditions
- Failure to enumerate all conditions, exceptions

Motivation

- Uncertainty in medical diagnosis
- Diseases produce symptoms
- In diagnosis, observed symptoms => disease ID
- Uncertainties
- Symptoms may not occur
- Symptoms may not be reported
- Diagnostic tests not perfect
- False positive, false negative

- How do we estimate confidence?

Download Presentation

Connecting to Server..