# Conditional Random Fields - PowerPoint PPT Presentation Download Presentation Conditional Random Fields

Conditional Random Fields Download Presentation ## Conditional Random Fields

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Conditional Random Fields William W. Cohen CALD

2. Announcements • Upcoming assignments: • Today: Sha & Pereira, Lafferty et al • Mon 2/23: Klein & Manning, Toutanova et al • Wed 2/25: no writeup due • Mon 3/1: no writeup due • Wed 3/3: project proposal due: personnel + 1-2 page • Spring break week, no class

3. Review: motivation for CMM’s Ideally we would like to use many, arbitrary, overlapping features of words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

4. Motivation for CMMs S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

5. Implications of the model • Does this do what we want? • Q: does Y[i-1] depend on X[i+1] ? • “a nodes is conditionally independent of its non-descendents given its parents”

6. Label Bias Problem • Consider this MEMM: • P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) • P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) • SinceP(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) • In the training data, label value 2 is the only label value observed after label value 1 • ThereforeP(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x • However, we expectP(1 and 2 | ri)to be greater thanP(1 and 2 | ro). • Per-state normalization does not allow the required expectation

7. Label Bias Problem • Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1

8. How important is label bias? • Could be avoided in this case by changing structure: • Our models are always wrong – is this “wrongness” a problem? • See Klein & Manning’s paper for next week….

9. Another view of label bias [Sha & Pereira] So what’s the alternative?

10. Review of maxent

11. Review of maxent/MEMM/CMMs

12. Details on CMMs

13. New model From CMMs to CRFs Recall why we’re unhappy: we don’t want local normalization

14. x1 x2 x3 y1 y2 y3 What’s the new model look like? What’s independent?

15. What’s the new model look like? What’s independent now?? y1 y2 y3 x

16. Hammerley-Clifford • For positive distributions P(x1,…,xn): • Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi)) • Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B • P can be written as normalized product of “clique potentials” So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)

17. Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

18. Example of CRFs

19. Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

20. If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v Lafferty et al notation

21. Conditional Distribution (cont’d) • CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x • Learning: • Lafferty et al’s IIS-based method is rather inefficient. • Gradient-based methods are faster • Trickiest bit is computing normalization, which is over exponentially many y vectors.

22. CRF learning – from Sha & Pereira

23. CRF learning – from Sha & Pereira

24. CRF learning – from Sha & Pereira Something like forward-backward • Idea: • Define matrix of y,y’ “affinities” at stage i • Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I • Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

25. y1 y2 y3 x y1 y2 y3

26. Forward backward ideas a e name name name c g b f nonName nonName nonName d h

27. CRF learning – from Sha & Pereira

28. CRF learning – from Sha & Pereira

29. Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probablybeats voted perceptron

30. Sha & Pereira results in minutes, 375k examples

31. POS tagging Experiments in Lafferty et al • Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging • Each word in a given input sentence must be labeled with one of 45 syntactic tags • Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies • oov = out-of-vocabulary (not observed in the training set)

32. POS tagging vs MXPost