1 / 16

Introduction to C onditional R andom F ields

Introduction to C onditional R andom F ields. John Osborne Sept 4, 2009. Overview. Useful Definitions Background HMM MEMM Conditional Random Fields Statistical and Graph Definitions Computation (Training and Inference) Extensions Bayesian Conditional Random Fields

carter
Download Presentation

Introduction to C onditional R andom F ields

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Conditional Random Fields John Osborne Sept 4, 2009

  2. Overview • Useful Definitions • Background • HMM • MEMM • Conditional Random Fields • Statistical and Graph Definitions • Computation (Training and Inference) • Extensions • Bayesian Conditional Random Fields • Hierarchical Conditional Random Fields • Semi-CRFs • Future Directions

  3. Useful Definitions • Random Field (wikipedia) • In probability theory, let S = {X1, ..., Xn}, with the Xi in {0, 1, ..., G − 1} being a set of random variables on the sample space Ω = {0, 1, ..., G − 1}n. A probability measure π is a random field if, for all ω in Ω, π(ω) > 0. • Markov Process (chain if finite sequence) • Stochastic process with Markov property • Markov Property • The probability that a random variable assumes a value depends on the other random variables only through the ones that are its immediate neighbors • “memoryless” • Hidden Markov Model (HMM) • Markov Model where the current state is unobserved • Viterbi Algorithm • Dynamic programming technique to discover the most likely sequence of states required to explain the observed states in an HMM • Determine labels • Potential Function == Feature Function • In CRF the potential function scores the compatibility of yt, yt-1 and wt(X)

  4. Background • Interest in CRFs arose from Richa’s work with gene expression • Current literature shows them performing better on NLP tasks than other commonly used NLP approaches like Support Vector Machines (SVM), neural networks, HMMs and others • Termed coined by Lafftery in 2001 • Predecessor was HMM and maximum entropy Markov models (MEMM)

  5. HMM • Definition • Markov Model where the current state is unobserved • Generative Model • To examine all input X would be prohibitive, hence Markov property looking at only current element in the sequence • No multiple interacting features, long range dependencies

  6. MEMMs • McCallum et al, 2000 • Non-generative finite-state model based on next-state classifier • Directed graph • P(YjX) = ∏t P(yt| yt-1 wt(X)) where wt(X) is a sliding window over the X sequence

  7. Label Bias Problem • Transitions leaving a given state complete only against each other, rather than against all other transitions in the model • Implies “Conversation of score mass” (Bottou, 1991) • Observations can be ignored, Viterbi decoding can’t downgrade a branch • CRF will solve this problem by having a single exponential model for the joint probability of the ENTIRE SEQUENCE OF LABELS given the observation sequence

  8. Big Picture Definition • Wikipedia Definition (Aug 2009) • A conditional random field (CRF) is a type of discriminativeprobabilistic model most often used for the labeling or parsing of sequential data, such as natural language text or biological sequences. • Probabilistic model is a statistical model, in math terms “a pair (Y,P) where Y is the set of possible observations and P the set of possible probability distributions on Y” • In statistics terms this means the objective is to infer (or pick) the distinct element (probability distribution) in the set “P” given your observation Y • Discriminative model meaning it models the conditional probability distribution P(y|x) which can predict y given x. • It can not do it the other way around (produce x fromy) since it does not a generative model (capable of generating sample data given a model) as it does not model a joint probability distribution • Similar to other discriminative models like support vector machines and neural networks • When analyzing sequential data a conditional model specifies the probabilities of possible label sequences given an observation sequence

  9. CRF Graphical Definition Definition from Lafferty CRF Undirected Graph • Undirected graphical model • Let g = (V,E) be a graph such that Y = (Yv)vεV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w≠v)=p(Yv|X,Yw,w~v), where w~v means that w and v are neighbors in G

  10. Computation of CRF • Training • Conditioning • Calculation of Feature Function • P(Y|X) = 1/Z(X)exp ∑t PSI (yt, yt-1 and wt(X)) • Z is normalizing factor • Potential Function in paratheses • Inference • Viterbi Decoding • Approximate Model Averaging • Others?

  11. Training Approaches • CRF is supervised learning so can train using • Maximum Likehood (original paper) • Used iterative scaling method, was very slow • Gradient Assent • Also slow when naïve • Mallet Implementation used BFGS algorithm • http://en.wikipedia.org/wiki/BFGS • Broyden-Fletcher-Goldfarb – Shanno • Approximate 2nd order algorithm • Stochastic Gradient Method (2006) accelerated via Stochastic Meta Descent • Gradient Tree Boosting (variant of a 2001 • http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf • Potential functions are sums of regression trees • Decision trees using real values • Published 2008 • Competitive with Mallet • Bayesian (estimate posterior probability)

  12. Conditional Random Field ExtensionsSemi-CRF • Semi-CRF • Instead of assigning labels to each member of sequence, labels are assigned to sub-sequences • Advantage – “features for semi-CRF can measure properties of segments, and transition within a segment can be non-Markovian” • http://www.cs.cmu.edu/~wcohen/postscript/semiCRF.pdf

  13. Bayesian CRF • Qi et al, (2005) • http://www.cs.purdue.edu/homes/alanqi/papers/Qi-Bayesian-CRF-AIstat05.pdf • Replacement for ML method of Lafferty • Reducing over-fitting • “Power EP Method”

  14. Hierarchical CRF (HCRF) • http://www.springerlink.com/content/r84055k2754464v5/ • http://www.cs.washington.edu/homes/fox/postscripts/places-isrr-05.pdf • GPS motion, for surveillance, tracking, dividing people’s workday into labels of work, travel, sleep, etc.. • Less work

  15. Future Directions • Less work on conditional random fields in biology • PubMed hits • Conditional Random Field - 21 • Conditional Random Fields - 43 • CRF variants & promoter/regulatory element shows no hits • CRF and ontology show no hits • Plan • Implement CRF in Java, apply to biology problems, try to find ways to extend?

  16. Useful Papers • Link to original paper and review paper • http://www.inference.phy.cam.ac.uk/hmw26/crf/ • Review paper: • http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf • Another review • http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf • Review slides • http://www.cs.pitt.edu/~mrotaru/comp/nlp/Random%20Fields/Tutorial%20CRF%20Lafferty.pdf • The boosting paper has a nice review • http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf

More Related