maximum entropy me maximum entropy markov model memm conditional random field crf l.
Skip this Video
Loading SlideShow in 5 Seconds..
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF ) PowerPoint Presentation
Download Presentation
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF )

Loading in 2 Seconds...

play fullscreen
1 / 31

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF ) - PowerPoint PPT Presentation

  • Uploaded on

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF ). Boltzmann-Gibbs Distribution. Given: States s 1 , s 2 , … , s n Density p ( s ) = p s Maximum entropy principle : Without any information, one chooses the density p s to maximize the entropy

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF )' - foy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
maximum entropy me maximum entropy markov model memm conditional random field crf

Maximum Entropy (ME)Maximum Entropy Markov Model (MEMM)Conditional Random Field (CRF)

boltzmann gibbs distribution
Boltzmann-Gibbs Distribution
  • Given:
    • States s1, s2, …, sn
    • Density p(s) = ps
  • Maximum entropy principle:
    • Without any information, one chooses the density ps to maximize the entropy

subject to the constraints

boltzmann gibbs cnt d
Boltzmann-Gibbs (Cnt’d)
  • Consider the Lagrangian
  • Take partial derivatives of L with respect to psand set them to zero, we obtain Boltzmann-Gibbs density functions

where Z is the normalizing factor

  • From the Lagrangian


boltzmann gibbs cnt d5
Boltzmann-Gibbs (Cnt’d)
  • Classification Rule
    • Use of Boltzmann-Gibbs as prior distribution
    • Compute the posterior for given observed data and features fi
    • Use the optimal posterior to classify
boltzmann gibbs cnt d6
Boltzmann-Gibbs (Cnt’d)
  • Maximum Entropy (ME)
    • The posterior is the state probability density

p(s | X), where X = (x1, x2, …, xn)

  • Maximum entropy Markov model (MEMM)
    • The posterior consists of transition probability densities p(s | s´, X)
boltzmann gibbs cnt d7
Boltzmann-Gibbs (Cnt’d)
  • Conditional random field (CRF)
    • The posterior consists of both transition probability densities p(s | s´, X) and state probability densities

p(s | X)

  • R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2001.
  • T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001.
  • P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Approach, The MIT Press, 2001.
an example
An Example
  • Five possible French translations of the English word in:
    • Dans, en, à, au cours de, pendant
  • Certain constraints obeyed:
    • When April follows in, the proper translation is en
  • How do we make the proper translation of a French word y under an English context x?
  • Probability assignment p(y|x):
    • y: French word, x: English context
  • Indicator function of a context feature f
expected values of f
Expected Values of f
  • The expected value of f with respect to the empirical distribution
  • The expected value of f with respect to the conditional probabilityp(y|x)
constraint equation
Constraint Equation
  • Set equal the two expected values:

or equivalently,

maximum entropy principle
Maximum Entropy Principle
  • Given n feature functions fi, we want p(y|x) to maximize the entropy measure

where p is chosen from

constrained optimization problem
Constrained Optimization Problem
  • The Lagrangian
  • Solutions
iterative solution
Iterative Solution
  • Compute the expectation of fi under the current estimate of probability function
  • Update Lagrange multipliers
  • Update probability functions
feature selection
Feature Selection
  • Motivation:
    • For a large collection of candidate features, we want to select a small subset
    • Incremental growth
  • Computation of maximum entropy model is costly for each candidate f
  • Simplification assumption:
    • The multipliers λ associated with S do not change when f is added to S
difference from memm
Difference from MEMM
  • If the state feature is dropped, we obtain a MEMM model
  • The drawback of MEMM
    • The state probabilities are not learnt, but inferred
    • Bias can be generated, since the transition feature is dominating in the training
difference from hmm
Difference from HMM
  • HMM is a generative model
  • In order to define a joint distribution, this model must enumerate all possible observation sequences and their corresponding label sequences
  • This task is intractable, unless observation elements are represented as isolated units
crf training methods
CRF Training Methods
  • CRF training requires intensive efforts in numerical manipulation
  • Preconditioned conjugate gradient
    • Instead of searching along the gradient, conjugate gradient searches along a carefully chosen linear combination of the gradient and the previous search direction
  • Limited-Memory Quasi-Newton
    • Limited-memory BFGS (L-BFGS) is a second-order method that estimates the curvature numerically from previous gradients and updates, avoiding the need for an exact Hessian inverse computation
  • Voted perceptron
voted perceptron
Voted Perceptron
  • Like the perceptron algorithm, this algorithm scans through the training instances, updating the weight vectorλt when a prediction error is detected
  • Instead of taking just the final weight vector, the voted perceptron algorithms takes the average of theλt


  • A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A maximum entropy approach to natural language processing
  • A. McCallum and F. Pereira, Maximum entropy Markov models for information extraction and segmentation
  • H. M. Wallach, Conditional random fields: an introduction
  • J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmentation and labeling sequence data
  • F. Sha and F. Pereira, Shallow parsing with conditional random fields