Introduction to C onditional R andom F ields

Introduction to Conditional Random Fields John Osborne Sept 4, 2009

Overview • Useful Definitions • Background • HMM • MEMM • Conditional Random Fields • Statistical and Graph Definitions • Computation (Training and Inference) • Extensions • Bayesian Conditional Random Fields • Hierarchical Conditional Random Fields • Semi-CRFs • Future Directions

Useful Definitions • Random Field (wikipedia) • In probability theory, let S = {X1, ..., Xn}, with the Xi in {0, 1, ..., G − 1} being a set of random variables on the sample space Ω = {0, 1, ..., G − 1}n. A probability measure π is a random field if, for all ω in Ω, π(ω) > 0. • Markov Process (chain if finite sequence) • Stochastic process with Markov property • Markov Property • The probability that a random variable assumes a value depends on the other random variables only through the ones that are its immediate neighbors • “memoryless” • Hidden Markov Model (HMM) • Markov Model where the current state is unobserved • Viterbi Algorithm • Dynamic programming technique to discover the most likely sequence of states required to explain the observed states in an HMM • Determine labels • Potential Function == Feature Function • In CRF the potential function scores the compatibility of yt, yt-1 and wt(X)

Background • Interest in CRFs arose from Richa’s work with gene expression • Current literature shows them performing better on NLP tasks than other commonly used NLP approaches like Support Vector Machines (SVM), neural networks, HMMs and others • Termed coined by Lafftery in 2001 • Predecessor was HMM and maximum entropy Markov models (MEMM)

HMM • Definition • Markov Model where the current state is unobserved • Generative Model • To examine all input X would be prohibitive, hence Markov property looking at only current element in the sequence • No multiple interacting features, long range dependencies

MEMMs • McCallum et al, 2000 • Non-generative finite-state model based on next-state classifier • Directed graph • P(YjX) = ∏t P(yt| yt-1 wt(X)) where wt(X) is a sliding window over the X sequence

Label Bias Problem • Transitions leaving a given state complete only against each other, rather than against all other transitions in the model • Implies “Conversation of score mass” (Bottou, 1991) • Observations can be ignored, Viterbi decoding can’t downgrade a branch • CRF will solve this problem by having a single exponential model for the joint probability of the ENTIRE SEQUENCE OF LABELS given the observation sequence

Big Picture Definition • Wikipedia Definition (Aug 2009) • A conditional random field (CRF) is a type of discriminativeprobabilistic model most often used for the labeling or parsing of sequential data, such as natural language text or biological sequences. • Probabilistic model is a statistical model, in math terms “a pair (Y,P) where Y is the set of possible observations and P the set of possible probability distributions on Y” • In statistics terms this means the objective is to infer (or pick) the distinct element (probability distribution) in the set “P” given your observation Y • Discriminative model meaning it models the conditional probability distribution P(y|x) which can predict y given x. • It can not do it the other way around (produce x fromy) since it does not a generative model (capable of generating sample data given a model) as it does not model a joint probability distribution • Similar to other discriminative models like support vector machines and neural networks • When analyzing sequential data a conditional model specifies the probabilities of possible label sequences given an observation sequence

CRF Graphical Definition Definition from Lafferty CRF Undirected Graph • Undirected graphical model • Let g = (V,E) be a graph such that Y = (Yv)vεV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w≠v)=p(Yv|X,Yw,w~v), where w~v means that w and v are neighbors in G

Computation of CRF • Training • Conditioning • Calculation of Feature Function • P(Y|X) = 1/Z(X)exp ∑t PSI (yt, yt-1 and wt(X)) • Z is normalizing factor • Potential Function in paratheses • Inference • Viterbi Decoding • Approximate Model Averaging • Others?

Training Approaches • CRF is supervised learning so can train using • Maximum Likehood (original paper) • Used iterative scaling method, was very slow • Gradient Assent • Also slow when naïve • Mallet Implementation used BFGS algorithm • http://en.wikipedia.org/wiki/BFGS • Broyden-Fletcher-Goldfarb – Shanno • Approximate 2nd order algorithm • Stochastic Gradient Method (2006) accelerated via Stochastic Meta Descent • Gradient Tree Boosting (variant of a 2001 • http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf • Potential functions are sums of regression trees • Decision trees using real values • Published 2008 • Competitive with Mallet • Bayesian (estimate posterior probability)

Conditional Random Field ExtensionsSemi-CRF • Semi-CRF • Instead of assigning labels to each member of sequence, labels are assigned to sub-sequences • Advantage – “features for semi-CRF can measure properties of segments, and transition within a segment can be non-Markovian” • http://www.cs.cmu.edu/~wcohen/postscript/semiCRF.pdf

Bayesian CRF • Qi et al, (2005) • http://www.cs.purdue.edu/homes/alanqi/papers/Qi-Bayesian-CRF-AIstat05.pdf • Replacement for ML method of Lafferty • Reducing over-fitting • “Power EP Method”

Hierarchical CRF (HCRF) • http://www.springerlink.com/content/r84055k2754464v5/ • http://www.cs.washington.edu/homes/fox/postscripts/places-isrr-05.pdf • GPS motion, for surveillance, tracking, dividing people’s workday into labels of work, travel, sleep, etc.. • Less work

Future Directions • Less work on conditional random fields in biology • PubMed hits • Conditional Random Field - 21 • Conditional Random Fields - 43 • CRF variants & promoter/regulatory element shows no hits • CRF and ontology show no hits • Plan • Implement CRF in Java, apply to biology problems, try to find ways to extend?

Useful Papers • Link to original paper and review paper • http://www.inference.phy.cam.ac.uk/hmw26/crf/ • Review paper: • http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf • Another review • http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf • Review slides • http://www.cs.pitt.edu/~mrotaru/comp/nlp/Random%20Fields/Tutorial%20CRF%20Lafferty.pdf • The boosting paper has a nice review • http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf

Introduction to C onditional R andom F ields

Introduction to C onditional R andom F ields

Presentation Transcript

F R A N C E

C onditional C ontrol F low

Introduction to C/C++

A F R I C A

Introduction to C o l o r

F R A N C E !

C C R e c f o r T R A M S for Windows

F R A N C E

Working in the F ields

Magnetic f ields

loveLife’s Mobile C onditional Incentive Programme: iloveLife

A F R I C A

F o r c e s

An Empirical Study of Learning from Imbalanced Data Using R andom F orest

Introduction to F#

H o r e c a f o r c e

Introduction to C/C++

Introduction to C onditional R andom F ields