Conditional Random Fields

Conditional Random Fields William W. Cohen CALD

Announcements • Upcoming assignments: • Today: Sha & Pereira, Lafferty et al • Mon 2/23: Klein & Manning, Toutanova et al • Wed 2/25: no writeup due • Mon 3/1: no writeup due • Wed 3/3: project proposal due: personnel + 1-2 page • Spring break week, no class

Review: motivation for CMM’s Ideally we would like to use many, arbitrary, overlapping features of words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

Motivation for CMMs S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

Implications of the model • Does this do what we want? • Q: does Y[i-1] depend on X[i+1] ? • “a nodes is conditionally independent of its non-descendents given its parents”

How important is label bias? • Could be avoided in this case by changing structure: • Our models are always wrong – is this “wrongness” a problem? • See Klein & Manning’s paper for next week….

Another view of label bias [Sha & Pereira] So what’s the alternative?

Review of maxent

Review of maxent/MEMM/CMMs

Details on CMMs

New model From CMMs to CRFs Recall why we’re unhappy: we don’t want local normalization

x1 x2 x3 y1 y2 y3 What’s the new model look like? What’s independent?

What’s the new model look like? What’s independent now?? y1 y2 y3 x

Hammerley-Clifford • For positive distributions P(x1,…,xn): • Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi)) • Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B • P can be written as normalized product of “clique potentials” So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)

Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

Example of CRFs

Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v Lafferty et al notation

Conditional Distribution (cont’d) • CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x • Learning: • Lafferty et al’s IIS-based method is rather inefficient. • Gradient-based methods are faster • Trickiest bit is computing normalization, which is over exponentially many y vectors.

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira Something like forward-backward • Idea: • Define matrix of y,y’ “affinities” at stage i • Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I • Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

y1 y2 y3 x y1 y2 y3

Forward backward ideas a e name name name c g b f nonName nonName nonName d h

CRF learning – from Sha & Pereira

Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probablybeats voted perceptron

Sha & Pereira results in minutes, 375k examples

POS tagging Experiments in Lafferty et al • Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging • Each word in a given input sentence must be labeled with one of 45 syntactic tags • Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies • oov = out-of-vocabulary (not observed in the training set)

POS tagging vs MXPost

Conditional Random Fields