Conditional Random Fields William W. Cohen CALD
Announcements • Upcoming assignments: • Today: Sha & Pereira, Lafferty et al • Mon 2/23: Klein & Manning, Toutanova et al • Wed 2/25: no writeup due • Mon 3/1: no writeup due • Wed 3/3: project proposal due: personnel + 1-2 page • Spring break week, no class
Review: motivation for CMM’s Ideally we would like to use many, arbitrary, overlapping features of words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1
Motivation for CMMs S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state
Implications of the model • Does this do what we want? • Q: does Y[i-1] depend on X[i+1] ? • “a nodes is conditionally independent of its non-descendents given its parents”
Label Bias Problem • Consider this MEMM: • P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) • P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) • SinceP(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) • In the training data, label value 2 is the only label value observed after label value 1 • ThereforeP(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x • However, we expectP(1 and 2 | ri)to be greater thanP(1 and 2 | ro). • Per-state normalization does not allow the required expectation
Label Bias Problem • Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1
How important is label bias? • Could be avoided in this case by changing structure: • Our models are always wrong – is this “wrongness” a problem? • See Klein & Manning’s paper for next week….
Another view of label bias [Sha & Pereira] So what’s the alternative?
New model From CMMs to CRFs Recall why we’re unhappy: we don’t want local normalization
x1 x2 x3 y1 y2 y3 What’s the new model look like? What’s independent?
What’s the new model look like? What’s independent now?? y1 y2 y3 x
Hammerley-Clifford • For positive distributions P(x1,…,xn): • Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi)) • Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B • P can be written as normalized product of “clique potentials” So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)
Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences
Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF
If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v Lafferty et al notation
Conditional Distribution (cont’d) • CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x • Learning: • Lafferty et al’s IIS-based method is rather inefficient. • Gradient-based methods are faster • Trickiest bit is computing normalization, which is over exponentially many y vectors.
CRF learning – from Sha & Pereira Something like forward-backward • Idea: • Define matrix of y,y’ “affinities” at stage i • Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I • Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1
y1 y2 y3 x y1 y2 y3
Forward backward ideas a e name name name c g b f nonName nonName nonName d h
Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probablybeats voted perceptron
Sha & Pereira results in minutes, 375k examples
POS tagging Experiments in Lafferty et al • Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging • Each word in a given input sentence must be labeled with one of 45 syntactic tags • Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies • oov = out-of-vocabulary (not observed in the training set)