1 / 16

# Introduction to C onditional R andom F ields - PowerPoint PPT Presentation

Introduction to C onditional R andom F ields. John Osborne Sept 4, 2009. Overview. Useful Definitions Background HMM MEMM Conditional Random Fields Statistical and Graph Definitions Computation (Training and Inference) Extensions Bayesian Conditional Random Fields

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Introduction to C onditional R andom F ields' - thad

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Introduction to Conditional Random Fields

John Osborne

Sept 4, 2009

• Useful Definitions

• Background

• HMM

• MEMM

• Conditional Random Fields

• Statistical and Graph Definitions

• Computation (Training and Inference)

• Extensions

• Bayesian Conditional Random Fields

• Hierarchical Conditional Random Fields

• Semi-CRFs

• Future Directions

• Random Field (wikipedia)

• In probability theory, let S = {X1, ..., Xn}, with the Xi in {0, 1, ..., G − 1} being a set of random variables on the sample space Ω = {0, 1, ..., G − 1}n. A probability measure π is a random field if, for all ω in Ω, π(ω) > 0.

• Markov Process (chain if finite sequence)

• Stochastic process with Markov property

• Markov Property

• The probability that a random variable assumes a value depends on the other random variables only through the ones that are its immediate neighbors

• “memoryless”

• Hidden Markov Model (HMM)

• Markov Model where the current state is unobserved

• Viterbi Algorithm

• Dynamic programming technique to discover the most likely sequence of states required to explain the observed states in an HMM

• Determine labels

• Potential Function == Feature Function

• In CRF the potential function scores the compatibility of yt, yt-1 and wt(X)

• Interest in CRFs arose from Richa’s work with gene expression

• Current literature shows them performing better on NLP tasks than other commonly used NLP approaches like Support Vector Machines (SVM), neural networks, HMMs and others

• Termed coined by Lafftery in 2001

• Predecessor was HMM and maximum entropy Markov models (MEMM)

• Definition

• Markov Model where the current state is unobserved

• Generative Model

• To examine all input X would be prohibitive, hence Markov property looking at only current element in the sequence

• No multiple interacting features, long range dependencies

• McCallum et al, 2000

• Non-generative finite-state model based on next-state classifier

• Directed graph

• P(YjX) = ∏t P(yt| yt-1 wt(X)) where wt(X) is a sliding window over the X sequence

• Transitions leaving a given state complete only against each other, rather than against all other transitions in the model

• Implies “Conversation of score mass” (Bottou, 1991)

• Observations can be ignored, Viterbi decoding can’t downgrade a branch

• CRF will solve this problem by having a single exponential model for the joint probability of the ENTIRE SEQUENCE OF LABELS given the observation sequence

Big Picture Definition

• Wikipedia Definition (Aug 2009)

• A conditional random field (CRF) is a type of discriminativeprobabilistic model most often used for the labeling or parsing of sequential data, such as natural language text or biological sequences.

• Probabilistic model is a statistical model, in math terms “a pair (Y,P) where Y is the set of possible observations and P the set of possible probability distributions on Y”

• In statistics terms this means the objective is to infer (or pick) the distinct element (probability distribution) in the set “P” given your observation Y

• Discriminative model meaning it models the conditional probability distribution P(y|x) which can predict y given x.

• It can not do it the other way around (produce x fromy) since it does not a generative model (capable of generating sample data given a model) as it does not model a joint probability distribution

• Similar to other discriminative models like support vector machines and neural networks

• When analyzing sequential data a conditional model specifies the probabilities of possible label sequences given an observation sequence

CRF Graphical Definition

Definition from Lafferty

CRF Undirected Graph

• Undirected graphical model

• Let g = (V,E) be a graph such that Y = (Yv)vεV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w≠v)=p(Yv|X,Yw,w~v), where w~v means that w and v are neighbors in G

• Training

• Conditioning

• Calculation of Feature Function

• P(Y|X) = 1/Z(X)exp ∑t PSI (yt, yt-1 and wt(X))

• Z is normalizing factor

• Potential Function in paratheses

• Inference

• Viterbi Decoding

• Approximate Model Averaging

• Others?

• CRF is supervised learning so can train using

• Maximum Likehood (original paper)

• Used iterative scaling method, was very slow

• Also slow when naïve

• Mallet Implementation used BFGS algorithm

• http://en.wikipedia.org/wiki/BFGS

• Broyden-Fletcher-Goldfarb – Shanno

• Approximate 2nd order algorithm

• Stochastic Gradient Method (2006) accelerated via Stochastic Meta Descent

• Gradient Tree Boosting (variant of a 2001

• http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf

• Potential functions are sums of regression trees

• Decision trees using real values

• Published 2008

• Competitive with Mallet

• Bayesian (estimate posterior probability)

• Semi-CRF

• Instead of assigning labels to each member of sequence, labels are assigned to sub-sequences

• Advantage – “features for semi-CRF can measure properties of segments, and transition within a segment can be non-Markovian”

• http://www.cs.cmu.edu/~wcohen/postscript/semiCRF.pdf

• Qi et al, (2005)

• http://www.cs.purdue.edu/homes/alanqi/papers/Qi-Bayesian-CRF-AIstat05.pdf

• Replacement for ML method of Lafferty

• Reducing over-fitting

• “Power EP Method”

Hierarchical CRF (HCRF)

• http://www.cs.washington.edu/homes/fox/postscripts/places-isrr-05.pdf

• GPS motion, for surveillance, tracking, dividing people’s workday into labels of work, travel, sleep, etc..

• Less work

• Less work on conditional random fields in biology

• PubMed hits

• Conditional Random Field - 21

• Conditional Random Fields - 43

• CRF variants & promoter/regulatory element shows no hits

• CRF and ontology show no hits

• Plan

• Implement CRF in Java, apply to biology problems, try to find ways to extend?

• Link to original paper and review paper

• http://www.inference.phy.cam.ac.uk/hmw26/crf/

• Review paper:

• http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf

• Another review

• http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf

• Review slides

• http://www.cs.pitt.edu/~mrotaru/comp/nlp/Random%20Fields/Tutorial%20CRF%20Lafferty.pdf

• The boosting paper has a nice review

• http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf