christopher m bishop l.
Skip this Video
Loading SlideShow in 5 Seconds..
Christopher M. Bishop PowerPoint Presentation
Download Presentation
Christopher M. Bishop

Loading in 2 Seconds...

play fullscreen
1 / 101

Christopher M. Bishop - PowerPoint PPT Presentation

  • Uploaded on

Latent Variables, Mixture Models and EM. Christopher M. Bishop. Microsoft Research, Cambridge. BCS Summer School Exeter, 2003. Overview. K-means clustering Gaussian mixtures Maximum likelihood and EM Probabilistic graphical models Latent variables: EM revisited

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Christopher M. Bishop' - paul

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
christopher m bishop

Latent Variables,Mixture Modelsand EM

Christopher M. Bishop

Microsoft Research, Cambridge

BCS Summer SchoolExeter, 2003

  • K-means clustering
  • Gaussian mixtures
  • Maximum likelihood and EM
  • Probabilistic graphical models
  • Latent variables: EM revisited
  • Bayesian Mixtures of Gaussians
  • Variational Inference
old faithful data set
Old Faithful Data Set

Time betweeneruptions (minutes)

Duration of eruption (minutes)

k means algorithm
K-means Algorithm
  • Goal: represent a data set in terms of K clusters each of which is summarized by a prototype
  • Initialize prototypes, then iterate between two phases:
    • E-step: assign each data point to nearest prototype
    • M-step: update prototypes to be the cluster means
  • Simplest version is based on Euclidean distance
    • re-scale Old Faithful data
  • Responsibilities assign data points to clusterssuch that
  • Example: 5 data points and 3 clusters
k means cost function




K-means Cost Function
minimizing the cost function
Minimizing the Cost Function
  • E-step: minimize w.r.t.
    • assigns each data point to nearest prototype
  • M-step: minimize w.r.t
    • gives
    • each prototype set to the mean of points in that cluster
  • Convergence guaranteed since there is a finite number of possible settings for the responsibilities
limitations of k means
Limitations of K-means
  • Hard assignments of data points to clusters – small shift of a data point can flip it to a different cluster
  • Not clear how to choose the value of K
  • Solution: replace ‘hard’ clustering of K-means with ‘soft’ probabilistic assignments
  • Represents the probability distribution of the data as a Gaussian mixture model
the gaussian distribution



The Gaussian Distribution
  • Multivariate Gaussian
  • Define precision to be the inverse of the covariance
  • In 1-dimension
likelihood function
Likelihood Function
  • Data set
  • Assume observed data points generated independently
  • Viewed as a function of the parameters, this is known as the likelihood function
maximum likelihood
Maximum Likelihood
  • Set the parameters by maximizing the likelihood function
  • Equivalently maximize the log likelihood
maximum likelihood solution
Maximum Likelihood Solution
  • Maximizing w.r.t. the mean gives the sample mean
  • Maximizing w.r.t covariance gives the sample covariance
bias of maximum likelihood
Bias of Maximum Likelihood
  • Consider the expectations of the maximum likelihood estimates under the Gaussian distribution
  • The maximum likelihood solution systematically under-estimates the covariance
  • This is an example of over-fitting
unbiased variance estimate
Unbiased Variance Estimate
  • Clearly we can remove the bias by usingsince this gives
  • Arises naturally in a Bayesian treatment (see later)
  • For an infinite data set the two expressions are equal
gaussian mixtures
Gaussian Mixtures
  • Linear super-position of Gaussians
  • Normalization and positivity require
  • Can interpret the mixing coefficients as prior probabilities
sampling from the gaussian
Sampling from the Gaussian
  • To generate a data point:
    • first pick one of the components with probability
    • then draw a sample from that component
  • Repeat these two steps for each new data point
fitting the gaussian mixture
Fitting the Gaussian Mixture
  • We wish to invert this process – given the data set, find the corresponding parameters:
    • mixing coefficients
    • means
    • covariances
  • If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster
  • Problem: the data set is unlabelled
  • We shall refer to the labels as latent (= hidden) variables
posterior probabilities
Posterior Probabilities
  • We can think of the mixing coefficients as prior probabilities for the components
  • For a given value of we can evaluate the corresponding posterior probabilities, called responsibilities
  • These are given from Bayes’ theorem by
maximum likelihood for the gmm
Maximum Likelihood for the GMM
  • The log likelihood function takes the form
  • Note: sum over components appears inside the log
  • There is no closed form solution for maximum likelihood
over fitting in gaussian mixture models
Over-fitting in Gaussian Mixture Models
  • Singularities in likelihood function when a component ‘collapses’ onto a data point:then consider
  • Likelihood function gets larger as we add more components (and hence parameters) to the model
    • not clear how to choose the number K of components
problems and solutions
Problems and Solutions
  • How to maximize the log likelihood
    • solved by expectation-maximization (EM) algorithm
  • How to avoid singularities in the likelihood function
    • solved by a Bayesian treatment
  • How to choose number K of components
    • also solved by a Bayesian treatment
em algorithm informal derivation
EM Algorithm – Informal Derivation
  • Let us proceed by simply differentiating the log likelihood
  • Setting derivative with respect to equal to zero givesgivingwhich is simply the weighted mean of the data
em algorithm informal derivation43
EM Algorithm – Informal Derivation
  • Similarly for the covariances
  • For mixing coefficients use a Lagrange multiplier to give
em algorithm informal derivation44
EM Algorithm – Informal Derivation
  • The solutions are not closed form since they are coupled
  • Suggests an iterative scheme for solving them:
    • Make initial guesses for the parameters
    • Alternate between the following two stages:
      • E-step: evaluate responsibilities
      • M-step: update parameters using ML results
digression probabilistic graphical models
Digression: Probabilistic Graphical Models
  • Graphical representation of a probabilistic model
  • Each variable corresponds to a node in the graph
  • Links in the graph denote relations between variables
  • Motivation:
    • visualization of models and motivation for new models
    • graphical determination of conditional independence
    • complex calculations (inference) performed using graphical operations (e.g. forward-backward for HMM)
  • Here we consider directed graphs
example 3 variables
Example: 3 Variables
  • General distribution over 3 variables
  • Apply product rule of probability twice
  • Express as a directed graph
general decomposition formula
General Decomposition Formula
  • Joint distribution is product of conditionals, conditioned on parent nodes
  • Example:
em latent variable viewpoint
EM – Latent Variable Viewpoint
  • Binary latent variables describing which component generated each data point
  • Conditional distribution of observed variable
  • Prior distribution of latent variables
  • Marginalizing over the latent variables we obtain
latent variable view of em
Latent Variable View of EM
  • If we knew the values for the latent variables, we would maximize the complete-data log likelihoodwhich gives a trivial closed-form solution (fit each component to the corresponding set of data points)
  • We don’t know the values of the latent variables
  • However, for given parameter values we can compute the expected values of the latent variables
expected complete data log likelihood
Expected Complete-Data Log Likelihood
  • Suppose we make a guess for the parameter values (means, covariances and mixing coefficients)
  • Use these to evaluate the responsibilities
  • Consider expected complete-data log likelihood where responsibilities are computed using
  • We are implicitly ‘filling in’ latent variables with best guess
  • Keeping the responsibilities fixed and maximizing with respect to the parameters give the previous results
k means revisited
K-means Revisited
  • Consider GMM with common covariances
  • Take limit
  • Responsibilities become binary
  • Expected complete-data log likelihood becomes
em in general
EM in General
  • Consider arbitrary distribution over the latent variables
  • The following decomposition always holdswhere
optimizing the bound
Optimizing the Bound
  • E-step: maximize with respect to
    • equivalent to minimizing KL divergence
    • sets equal to the posterior distribution
  • M-step: maximize bound with respect to
    • equivalent to maximizing expected complete-data log likelihood
  • Each EM cycle must increase incomplete-data likelihood unless already at a (local) maximum
bayesian inference
Bayesian Inference
  • Include prior distributions over parameters
  • Advantages in using conjugate priors
  • Example: consider a single Gaussian over one variable
    • assume variance is known and mean is unknown
    • likelihood function for the mean
  • Choose Gaussian prior for mean
bayesian inference for a gaussian
Bayesian Inference for a Gaussian
  • Posterior (proportional to product of prior and likelihood) will then also be Gaussianwhere
bayesian mixture of gaussians
Bayesian Mixture of Gaussians
  • Conjugate priors for the parameters:
    • Dirichlet prior for mixing coefficients
    • Normal-Wishart prior for means and precisionswhere the Wishart distribution is given by
graphical representation
Graphical Representation
  • Parameters and latent variables appear on equal footing
variational inference
Variational Inference
  • As with many Bayesian models, exact inference for the mixture of Gaussians is intractable
  • Approximate Bayesian inference traditionally based on Laplace’s method (local Gaussian approximation to the posterior) or Markov chain Monte Carlo
  • Variational Inference is an alternative, broadly applicable deterministic approximation scheme
general view of variational inference
General View of Variational Inference
  • Consider again the previous decomposition, but where the posterior is over all latent variables and parameterswhere
  • Maximizing over would give the true posterior distribution – but this is intractable by definition
factorized approximation
Factorized Approximation
  • Goal: choose a family of distributions which are:
    • sufficiently flexible to give good posterior approximation
    • sufficiently simple to remain tractable
  • Here we consider factorized distributions
  • No further assumptions are required!
  • Optimal solution for one factor, keeping the remained fixed
  • Coupled solutions so initialize then cyclically update
lower bound
Lower Bound
  • Can also be evaluated
  • Useful for maths/code verification
  • Also useful for model comparison:
illustration univariate gaussian
Illustration: Univariate Gaussian
  • Likelihood function
  • Conjugate priors
  • Factorized variational distribution
exact solution
Exact Solution
  • For this very simple example there is an exact solution
  • Expected precision given by
  • Compare with earlier maximum likelihood solution
variational mixture of gaussians
Variational Mixture of Gaussians
  • Assume factorized posterior distribution
  • Gives optimal solution in the formwhere is a Dirichlet, and is a Normal-Wishart
sufficient statistics
Sufficient Statistics
  • Small computational overhead compared to maximum likelihood EM
sparse bayes for gaussian mixture
Sparse Bayes for Gaussian Mixture
  • Instead of comparing different values of K, start with a large value and prune out excess components
  • Treat mixing coefficients as parameters, and maximize marginal likelihood [Corduneanu + Bishop, AIStats ’01]
  • Gives simple re-estimation equations for the mixing coefficients – interleave with variational updates
general variational framework
General Variational Framework
  • Currently for each new model:
    • derive the variational update equations
    • write application-specific code to find the solution
  • Both stages are time consuming and error prone
  • Can we build a general-purpose inference engine which automates these procedures?
  • Variational Inference for Bayesian Networks
  • Bishop and Winn (1999)
  • A general inference engine using variational methods
  • Models specified graphically
local computation in vibes
Local Computation in VIBES
  • A key observation is that in the general solutionthe update for a particular node (or group of nodes) depends only on other nodes in the Markov blanket
  • Permits a local object-oriented implementation
take home messages
Take-home Messages
  • Bayesian mixture of Gaussians
    • no singularities
    • determines optimal number of components
  • Variational inference
    • effective solution for Bayesian GMM
    • optimizes rigorous bound
    • little computational overhead compared to EM
    • rapid prototyping of probabilistic models
    • graphical specification
viewgraphs tutorials and publications available from

Viewgraphs, tutorials andpublications available from: