Introduction to c onditional r andom f ields
1 / 16

Introduction to C onditional R andom F ields - PowerPoint PPT Presentation

  • Uploaded on

Introduction to C onditional R andom F ields. John Osborne Sept 4, 2009. Overview. Useful Definitions Background HMM MEMM Conditional Random Fields Statistical and Graph Definitions Computation (Training and Inference) Extensions Bayesian Conditional Random Fields

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Introduction to C onditional R andom F ields' - thad

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Introduction to c onditional r andom f ields

Introduction to Conditional Random Fields

John Osborne

Sept 4, 2009


  • Useful Definitions

  • Background

    • HMM

    • MEMM

  • Conditional Random Fields

    • Statistical and Graph Definitions

  • Computation (Training and Inference)

  • Extensions

    • Bayesian Conditional Random Fields

    • Hierarchical Conditional Random Fields

    • Semi-CRFs

  • Future Directions

Useful definitions
Useful Definitions

  • Random Field (wikipedia)

    • In probability theory, let S = {X1, ..., Xn}, with the Xi in {0, 1, ..., G − 1} being a set of random variables on the sample space Ω = {0, 1, ..., G − 1}n. A probability measure π is a random field if, for all ω in Ω, π(ω) > 0.

  • Markov Process (chain if finite sequence)

    • Stochastic process with Markov property

  • Markov Property

    • The probability that a random variable assumes a value depends on the other random variables only through the ones that are its immediate neighbors

    • “memoryless”

  • Hidden Markov Model (HMM)

    • Markov Model where the current state is unobserved

  • Viterbi Algorithm

    • Dynamic programming technique to discover the most likely sequence of states required to explain the observed states in an HMM

    • Determine labels

  • Potential Function == Feature Function

    • In CRF the potential function scores the compatibility of yt, yt-1 and wt(X)


  • Interest in CRFs arose from Richa’s work with gene expression

  • Current literature shows them performing better on NLP tasks than other commonly used NLP approaches like Support Vector Machines (SVM), neural networks, HMMs and others

    • Termed coined by Lafftery in 2001

  • Predecessor was HMM and maximum entropy Markov models (MEMM)

Introduction to c onditional r andom f ields

  • Definition

    • Markov Model where the current state is unobserved

  • Generative Model

  • To examine all input X would be prohibitive, hence Markov property looking at only current element in the sequence

  • No multiple interacting features, long range dependencies


  • McCallum et al, 2000

  • Non-generative finite-state model based on next-state classifier

  • Directed graph

  • P(YjX) = ∏t P(yt| yt-1 wt(X)) where wt(X) is a sliding window over the X sequence

Label bias problem
Label Bias Problem

  • Transitions leaving a given state complete only against each other, rather than against all other transitions in the model

  • Implies “Conversation of score mass” (Bottou, 1991)

  • Observations can be ignored, Viterbi decoding can’t downgrade a branch

  • CRF will solve this problem by having a single exponential model for the joint probability of the ENTIRE SEQUENCE OF LABELS given the observation sequence

Big picture definition
Big Picture Definition

  • Wikipedia Definition (Aug 2009)

    • A conditional random field (CRF) is a type of discriminativeprobabilistic model most often used for the labeling or parsing of sequential data, such as natural language text or biological sequences.

  • Probabilistic model is a statistical model, in math terms “a pair (Y,P) where Y is the set of possible observations and P the set of possible probability distributions on Y”

    • In statistics terms this means the objective is to infer (or pick) the distinct element (probability distribution) in the set “P” given your observation Y

  • Discriminative model meaning it models the conditional probability distribution P(y|x) which can predict y given x.

    • It can not do it the other way around (produce x fromy) since it does not a generative model (capable of generating sample data given a model) as it does not model a joint probability distribution

    • Similar to other discriminative models like support vector machines and neural networks

  • When analyzing sequential data a conditional model specifies the probabilities of possible label sequences given an observation sequence

Crf graphical definition
CRF Graphical Definition

Definition from Lafferty

CRF Undirected Graph

  • Undirected graphical model

  • Let g = (V,E) be a graph such that Y = (Yv)vεV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w≠v)=p(Yv|X,Yw,w~v), where w~v means that w and v are neighbors in G

Computation of crf
Computation of CRF

  • Training

    • Conditioning

    • Calculation of Feature Function

    • P(Y|X) = 1/Z(X)exp ∑t PSI (yt, yt-1 and wt(X))

      • Z is normalizing factor

      • Potential Function in paratheses

  • Inference

    • Viterbi Decoding

    • Approximate Model Averaging

    • Others?

Training approaches
Training Approaches

  • CRF is supervised learning so can train using

    • Maximum Likehood (original paper)

      • Used iterative scaling method, was very slow

    • Gradient Assent

      • Also slow when naïve

    • Mallet Implementation used BFGS algorithm


      • Broyden-Fletcher-Goldfarb – Shanno

      • Approximate 2nd order algorithm

    • Stochastic Gradient Method (2006) accelerated via Stochastic Meta Descent

    • Gradient Tree Boosting (variant of a 2001


      • Potential functions are sums of regression trees

        • Decision trees using real values

      • Published 2008

      • Competitive with Mallet

    • Bayesian (estimate posterior probability)

Conditional random field extensions semi crf
Conditional Random Field ExtensionsSemi-CRF

  • Semi-CRF

    • Instead of assigning labels to each member of sequence, labels are assigned to sub-sequences

    • Advantage – “features for semi-CRF can measure properties of segments, and transition within a segment can be non-Markovian”


Bayesian crf
Bayesian CRF

  • Qi et al, (2005)


  • Replacement for ML method of Lafferty

  • Reducing over-fitting

  • “Power EP Method”

Hierarchical crf hcrf
Hierarchical CRF (HCRF)



  • GPS motion, for surveillance, tracking, dividing people’s workday into labels of work, travel, sleep, etc..

  • Less work

Future directions
Future Directions

  • Less work on conditional random fields in biology

    • PubMed hits

      • Conditional Random Field - 21

      • Conditional Random Fields - 43

    • CRF variants & promoter/regulatory element shows no hits

  • CRF and ontology show no hits

  • Plan

    • Implement CRF in Java, apply to biology problems, try to find ways to extend?

Useful papers
Useful Papers

  • Link to original paper and review paper


    • Review paper:


  • Another review


  • Review slides


  • The boosting paper has a nice review