1 / 44

An Introduction to Conditional Random Field

An Introduction to Conditional Random Field. Ching -Chun Hsiao. Outline. Problem description Why conditional random fields(CRF) Introduction to CRF CRF model Inference of CRF Learning of CRF Applications References. Reference.

quanda
Download Presentation

An Introduction to Conditional Random Field

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Conditional Random Field Ching-Chun Hsiao

  2. Outline • Problem description • Why conditional random fields(CRF) • Introduction to CRF • CRF model • Inference of CRF • Learning of CRF • Applications • References

  3. Reference • Charles Elkan, “Log-linear Models and Conditional Random Field,” Notes for a tutorial at CIKM, 2008. • Charles Sutton and Andrew McCallum, “An Introduction to Conditional Random Fields for Relational Learning,” MIT Press, 2006. • Andrew Y. Ng and Michael I. Jordan, “On Discriminative Vs. Generative Classifiers: A Comparison Of Logistic Regression And Naive Bayes,” In Advances in Neural Information Processing Systems (NIPS), 2002. • John Lafferty, Andrew McCallum and Fernando Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, ” In Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289, 2001.

  4. Outline • Problem description • Why conditional random fields(CRF) • Introduction to CRF • CRF model • Inference of CRF • Learning of CRF • Applications • References

  5. Problem Description • Given observed data X, we wish to predict Y (labels) • Example: • X = {Temperature, Humidity, ...}  Xn = observation on day n • Y = {Sunny, Rainy, Cloudy}  Yn = weather on day n Sunny?Rainy?Cloudy? May depend on the weather of yesterday Light breeze 30°C 20% May depend on one another

  6. Outline • Problem description • Why conditional random fields(CRF) • Introduction to CRF • CRF model • Inference of CRF • Learning of CRF • Applications • References

  7. Generative Model vs. Discriminative Model • Generative model and discriminative model • Generative model • A model that generate observed data randomly • Model the joint probability p(x,y) • Discriminative model • Directly estimate the posterior probability p(y|x) • Aim at modeling the “discrimination” between different outputs General Sequence Single variable Naïve Bayes, … HMM, … Bayesian network, MRF, … Conditional Logistic regression, … Linear-chain CRF MEMM, … General CRF, …

  8. Why Conditional Random Fields –1 • Generative model • Generative model targets to find the joint probability p(x,y) and make the prediction based on Bayes rule to calculate p(y|x) • ex:naive Bayes (single output) and HMM (Hidden Markov Model) (sequence output) Assume that given y, features are independent a vector of features Assumption:1.each state t only depends on its immediate predecessor 2. Conditional independence of observed given its state. Sequence output

  9. Why Conditional Random Fields –2 A  B: A causes B Light breeze 30°C Humidity, temperature and the wind scale are independent 20% Wed. Thu. Mon. Tue. {22°C, 60%, moderate breeze} {30°C, 20%, light breeze} {28°C, 30%, light breeze} {25°C, 40%, moderate breeze}

  10. Why Conditional Random Fields –3 • Difficulties for generative models • Not practical to represent multiple interacting features (hard to model p(x)) or long-range dependencies of the observations • Very strict independence assumptions on the observations Wed. Thu. Mon. Tue. {22°C, 60%, moderate breeze} {30°C, 20%, light breeze} {28°C, 30%, light breeze} {25°C, 40%, moderate breeze}

  11. Why Conditional Random Fields –4 • Discriminative models • Directly model the posterior p(y|x) • Aim at modeling the “discrimination” between different outputs • Ex: logistic regression (maximum entropy) and CRF

  12. Why Conditional Random Fields –5 • Advantages of discriminative models • Training process aim at finding optimal coefficients for features no matter the features are correlated • Not sensitive to unbalanced training data • Especially for the classification problem, we don’t have to care about p(x)

  13. Why Conditional Random Fields –6 • Logistic regression (maximum entropy) • Suppose we have a bin of candies, each with an associated label (A,B,C, or D) • Each candy has multiple colors in its wrapper • Each candy is assigned a label randomly based on some distribution over wrapper colors Label: 4 kinds of favors Observation: the color of the wrapper A: chocolate B: strawberry C: lemon D: milk

  14. Why Conditional Random Field –7 • For any candy with a red label pulled from the bin: • P(A|red)+P(B|red)+P(C|red)+P(D|red) = 1 • Infinite number of distributions exist that fit this constraint • The distribution that fits with the idea of maximum entropy is: (the most uniform) • P(A|red)=0.25 • P(B|red)=0.25 • P(C|red)=0.25 • P(D|red)=0.25

  15. Why Conditional Random Field –8 • Now suppose we add some evidence to our model • We note that 80% of all candies with red labels are either labeled A or B • P(A|red) + P(B|red) = 0.8 • The updated model that reflects this would be: • P(A|red) = 0.4 • P(B|red) = 0.4 • P(C|red) = 0.1 • P(D|red) = 0.1 • As we make more observations and find more constraints, the model gets more complex

  16. Why Conditional Random Field –9 • Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible Defined feature functions  evidence By learning Factor Graph: y x1 x2 xd

  17. Outline • Problem description • Why conditional random fields(CRF) • Introduction to CRF • CRF model • Inference of CRF • Learning of CRF • Applications • References

  18. Linear-Chain CRF –1 • If we extend the logistic regression to a sequence problem( ): Entire x yt-1 yt yt+1 x1 x1 x1 x2 x2 x2 xd xd xd

  19. Linear-Chain CRF –2 y1 y2 y3 x1 x2 x3 y1 y2 y3 x

  20. General CRF • Divide Graph G into many templates ψA. The parameters inside each template are tied • K(A) is the number of feature functions for the template

  21. Inference of CRF • Problem description: • Given the observations({xi}) and the probability model(parameters such as ωimentioned above), we target to find the best state sequence • For general graphs, the problem of exact inference in CRFs is intractable • Chain or tree like CRF can yield exact inference • Approximation solutions

  22. Inference of Linear-Chain CRF –1 • The inference of linear-chain CRF is very similar to that of HMM • Example: POS(part of speech) tagging • the identification of words as nouns, verbs,adjectives, adverbs, etc. Students need another break noun verb article noun

  23. Inference of Linear-Chain CRF –2 • We firstly illustrate the inference of HMM 7.6x10-6 0.00031 0 2.6x10-9 students/V need/V another/V break/V 0.00725 1.3x10-5 1.2x10-7 4.3x10-6 students/N need/N another/N break/N o/s 0 0.0002 0 0 students/P need/P another/P break/P 0 0 0 7.2x10-5 break/ART students/ART need/ART another/ART

  24. Inference of Linear-Chain CRF –3 • Then back to CRF

  25. Inference of Linear-Chain CRF –4 • gi can be represented as a mxm matrix where m is the cardinality of the set of the tags yi N V ART N N N V V V yi-1 ART ART ART

  26. Inference of Linear-Chain CRF –5 • The inference of linear-chain CRF is similar to that of HMM, which uses Viterbi algorithm. • v: range over the tags • U(k,v) to be the score of the best sequence of tags from 1 to k, where tag k is required to be v

  27. Learning of CRF • Problem description • Given training pairs({xi,yi}), we wish to estimate the parameters of the model ({ωi}) • Method • For chain or tree structured CRFs, they can be trained by maximum likelihood  we will focus on the learning of linear chain CRF • General CRFs are intractable hence approximation solutions are necessary

  28. Learning of Linear-chain CRF –1 • Conditional maximum likelihood (CML) • x: observations; y: labels • Apply CML to the learning of CRF • It can be shown that the conditional log-likelihood of the linear-chain CRF is a convex function  we can apply gradient ascent to the CML problem

  29. Learning of Linear-chain CRF –2 • For the entire training set T Ep[·] denotes expectation with respect to distribution p. The expectation of the feature fx with respect to the model distribution The expectation of the feature fx with respect to the empirical distribution

  30. Learning of Linear-chain CRF –3 • To yield the best model: • The expectation of each feature with respect to the model distribution is equal to the expected value under the empirical distribution of the training data • The same as the “maximum entropy model” Extend to sequence Logistic regression(maximum entropy) Linear-Chain CRF

  31. Learning of Linear-chain CRF –4 • Apply stochastic gradient ascent • Change the parameter values one example at a time • Stochastic: because the derivative based on a randomly chosen single example is a random approximation to the true derivative based on all training data

  32. Outline • Problem description • Why conditional random fields(CRF) • Introduction to CRF • CRF model • Inference of CRF • Learning of CRF • Comparisons • Applications • References

  33. Outline • Problem description • Why conditional random fields(CRF) • Introduction to CRF • CRF model • Inference of CRF • Learning of CRF • Applications • References

  34. Application – Stereo Matching (1) • Ref : Learning Conditional Random Fields for Stereo(CVPR, 2007) Rectified imL Rectified imR obj

  35. Application – Stereo Matching (2) • Model the stereo matching problem using CRF • p: pixels in the reference images • dp: the disparity at pixel p • cp: the matching cost at pixel p • gpg: the color gradient between neighbor pixels p and q, (p,q) N

  36. Application – Image Labeling(1) • Ref: Multiscale Conditional Random Fields for Image Labeling (CVPR, 2004) • Image labeling: assign each pixel to one of a finite set of labels

  37. Application – Image Labeling(2) • Model the image labeling problem using CRF • X: input image • L: output label field • S: the entire image Regional feature extracted from L Local classifier applied to the X Global feature extracted from L

  38. Application – Image Labeling(3)

  39. Application – Gesture Recognition (1) • Ref.: S. Wang, A. Quattoni, L. Morency, D. Demirdjian, and T. Darrell., “Hidden conditional random fields for gesture recognition,” CVPR, 2006.

  40. Application – Gesture Recognition (2) • s = {s1, s2, ..., sm}, each si ∈ S captures certain underlying structure of each class and S is the set of hidden states in the model

  41. Application – Gesture Recognition (3) • The graph E is a chain where each node corresponds to a hidden state variable at time t • ωthat defines the amount of past and future history to be used when predicting the state at time t. • Assume θ= [θeθyθs] θs[sj] to refer to the parameters θsthat correspond to state sj ∈ S. θy[y, sj ] stands for parameters that correspond to class y and state sj a vector that can include any feature of the observation sequence for a specific window size ω. θe[y, sj, sk] refers to parameters that correspond to class y and the pair of states sj and sk.

  42. Application – Gesture Recognition (4) • Thirteen users were asked to perform these six gestures; an average of 90 gestures per class were collected.

  43. Summary • Discriminative model has the advantage of • Less sensitive to the unbalanced training data • Deal with correlated features • CRF is one of the discriminative model and meets the maximum entropy model

  44. Factor Graph f(y2|y1) f(y3|y2) f(y1) y y1 y2 y3 f(y1|x1) f(y2|x2) f(y3|x3) x2 x3 x1 x1 x2 x3 Represent naïve Bayes using factor graphs Represent HMM using factor graphs

More Related