Conditional Random Fields

Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012

Roadmap • Graphical Models • Modeling independence • Models revisited • Generative & discriminative models • Conditional random fields • Linear chain models • Skip chain models

Preview • Conditional random fields • Undirected graphical model • Due to Lafferty, McCallum, and Pereira, 2001

Preview • Conditional random fields • Undirected graphical model • Due to Lafferty, McCallum, and Pereira, 2001 • Discriminative model • Supports integration of rich feature sets

Preview • Conditional random fields • Undirected graphical model • Due to Lafferty, McCallum, and Pereira, 2001 • Discriminative model • Supports integration of rich feature sets • Allows range of dependency structures • Linear-chain, skip-chain, general • Can encode long-distance dependencies

Preview • Conditional random fields • Undirected graphical model • Due to Lafferty, McCallum, and Pereira, 2001 • Discriminative model • Supports integration of rich feature sets • Allows range of dependency structures • Linear-chain, skip-chain, general • Can encode long-distance dependencies • Used diverse NLP sequence labeling tasks: • Named entity recognition, coreference resolution, etc

Graphical Models

Graphical Models • Graphical model • Simple, graphical notation for conditional independence • Probabilistic model where: • Graph structure denotes conditional independence b/t random variables

Graphical Models • Graphical model • Simple, graphical notation for conditional independence • Probabilistic model where: • Graph structure denotes conditional independence b/t random variables • Nodes: random variables

Graphical Models • Graphical model • Simple, graphical notation for conditional independence • Probabilistic model where: • Graph structure denotes conditional independence b/t random variables • Nodes: random variables • Edges: dependency relation between random variables

Graphical Models • Graphical model • Simple, graphical notation for conditional independence • Probabilistic model where: • Graph structure denotes conditional independence b/t random variables • Nodes: random variables • Edges: dependency relation between random variables • Model types: • Bayesian Networks • Markov Random Fields

Modeling (In)dependence • Bayesian network

Modeling (In)dependence • Bayesian network • Directed acyclic graph (DAG)

Modeling (In)dependence • Bayesian network • Directed acyclic graph (DAG) • Nodes = Random Variables • Arc ~ directly influences, conditional dependency

Modeling (In)dependence • Bayesian network • Directed acyclic graph (DAG) • Nodes = Random Variables • Arc ~ directly influences, conditional dependency • Arcs = Child depends on parent(s) • No arcs = independent (0 incoming: only a priori) • Parents of X = • For each X need

Example I Russel & Norvig, AIMA

A B C D E Simple Bayesian Network • MCBN1 Need: Truth table A B depends on C depends on D depends on E depends on

A B C D E Simple Bayesian Network • MCBN1 Need: P(A) Truth table 2 A = only a priori B depends on C depends on D depends on E depends on

A B C D E Simple Bayesian Network • MCBN1 Need: P(A) P(B|A) Truth table 2 2*2 A = only a priori B depends on A C depends on D depends on E depends on

A B C D E Simple Bayesian Network • MCBN1 Need: P(A) P(B|A) P(C|A) Truth table 2 2*2 2*2 A = only a priori B depends on A C depends on A D depends on E depends on

A B C D E Simple Bayesian Network • MCBN1 Need: P(A) P(B|A) P(C|A) P(D|B,C) P(E|C) Truth table 2 2*2 2*2 2*2*2 2*2 A = only a priori B depends on A C depends on A D depends on B,C E depends on C

Holmes Example (Pearl) Holmes is worried that his house will be burgled. For the time period of interest, there is a 10^-4 a priori chance of this happening, and Holmes has installed a burglar alarm to try to forestall this event. The alarm is 95% reliable in sounding when a burglary happens, but also has a false positive rate of 1%. Holmes’ neighbor, Watson, is 90% sure to call Holmes at his office if the alarm sounds, but he is also a bit of a practical joker and, knowing Holmes’ concern, might (30%) call even if the alarm is silent. Holmes’ other neighbor Mrs. Gibbons is a well-known lush and often befuddled, but Holmes believes that she is four times more likely to call him if there is an alarm than not.

Holmes Example: Model There a four binary random variables:

W B A G Holmes Example: Model There a four binary random variables: B: whether Holmes’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called

Holmes Example: Tables B = #t B=#f A #t #f W=#t W=#f 0.0001 0.9999 0.90 0.10 0.30 0.70 A=#t A=#f B #t #f A #t #f G=#t G=#f 0.95 0.05 0.01 0.99 0.40 0.60 0.10 0.90

Bayes’ Nets: Markov Property • Bayes’s Nets: • Satisfy the local Markov property • Variables: conditionally independent of non-descendents given their parents

A B C D E Simple Bayesian Network • MCBN1 A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=

A B C D E Simple Bayesian Network • MCBN1 A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)

A B C D E Simple Bayesian Network • MCBN1 A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)

A B C D E Simple Bayesian Network • MCBN1 A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)P(C|A)

A B C D E Simple Bayesian Network • MCBN1 A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C) There exist algorithms for training, inference on BNs

Naïve Bayes Model • Bayes’ Net: • Conditional independence of features given class Y f1 f2 f3 fk

Hidden Markov Model • Bayesian Network where: • yt depends on

Hidden Markov Model • Bayesian Network where: • yt depends on yt-1 • xt

Hidden Markov Model • Bayesian Network where: • yt depends on yt-1 • xt depends on yt y1 y2 y3 yk x1 x2 x3xk

Generative Models • Both Naïve Bayes and HMMs are generative models

Generative Models • Both Naïve Bayes and HMMs are generative models • We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y. • (Sutton & McCallum, 2006) • State y generates an observation (instance) x

Generative Models • Both Naïve Bayes and HMMs are generative models • We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y. • (Sutton & McCallum, 2006) • State y generates an observation (instance) x • Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts

Conditional Random Fields