Christopher M. Bishop

1 / 101

# Christopher M. Bishop - PowerPoint PPT Presentation

Latent Variables, Mixture Models and EM. Christopher M. Bishop. Microsoft Research, Cambridge. BCS Summer School Exeter, 2003. Overview. K-means clustering Gaussian mixtures Maximum likelihood and EM Probabilistic graphical models Latent variables: EM revisited

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Christopher M. Bishop' - paul

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Latent Variables,Mixture Modelsand EM

Christopher M. Bishop

Microsoft Research, Cambridge

BCS Summer SchoolExeter, 2003

Overview
• K-means clustering
• Gaussian mixtures
• Maximum likelihood and EM
• Probabilistic graphical models
• Latent variables: EM revisited
• Bayesian Mixtures of Gaussians
• Variational Inference
• VIBES
Old Faithful Data Set

Time betweeneruptions (minutes)

Duration of eruption (minutes)

K-means Algorithm
• Goal: represent a data set in terms of K clusters each of which is summarized by a prototype
• Initialize prototypes, then iterate between two phases:
• E-step: assign each data point to nearest prototype
• M-step: update prototypes to be the cluster means
• Simplest version is based on Euclidean distance
• re-scale Old Faithful data
Responsibilities
• Responsibilities assign data points to clusterssuch that
• Example: 5 data points and 3 clusters

data

prototypes

responsibilities

K-means Cost Function
Minimizing the Cost Function
• E-step: minimize w.r.t.
• assigns each data point to nearest prototype
• M-step: minimize w.r.t
• gives
• each prototype set to the mean of points in that cluster
• Convergence guaranteed since there is a finite number of possible settings for the responsibilities
Limitations of K-means
• Hard assignments of data points to clusters – small shift of a data point can flip it to a different cluster
• Not clear how to choose the value of K
• Solution: replace ‘hard’ clustering of K-means with ‘soft’ probabilistic assignments
• Represents the probability distribution of the data as a Gaussian mixture model

covariance

mean

The Gaussian Distribution
• Multivariate Gaussian
• Define precision to be the inverse of the covariance
• In 1-dimension
Likelihood Function
• Data set
• Assume observed data points generated independently
• Viewed as a function of the parameters, this is known as the likelihood function
Maximum Likelihood
• Set the parameters by maximizing the likelihood function
• Equivalently maximize the log likelihood
Maximum Likelihood Solution
• Maximizing w.r.t. the mean gives the sample mean
• Maximizing w.r.t covariance gives the sample covariance
Bias of Maximum Likelihood
• Consider the expectations of the maximum likelihood estimates under the Gaussian distribution
• The maximum likelihood solution systematically under-estimates the covariance
• This is an example of over-fitting
Unbiased Variance Estimate
• Clearly we can remove the bias by usingsince this gives
• Arises naturally in a Bayesian treatment (see later)
• For an infinite data set the two expressions are equal
Gaussian Mixtures
• Linear super-position of Gaussians
• Normalization and positivity require
• Can interpret the mixing coefficients as prior probabilities
Sampling from the Gaussian
• To generate a data point:
• first pick one of the components with probability
• then draw a sample from that component
• Repeat these two steps for each new data point
Fitting the Gaussian Mixture
• We wish to invert this process – given the data set, find the corresponding parameters:
• mixing coefficients
• means
• covariances
• If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster
• Problem: the data set is unlabelled
• We shall refer to the labels as latent (= hidden) variables
Posterior Probabilities
• We can think of the mixing coefficients as prior probabilities for the components
• For a given value of we can evaluate the corresponding posterior probabilities, called responsibilities
• These are given from Bayes’ theorem by
Maximum Likelihood for the GMM
• The log likelihood function takes the form
• Note: sum over components appears inside the log
• There is no closed form solution for maximum likelihood
Over-fitting in Gaussian Mixture Models
• Singularities in likelihood function when a component ‘collapses’ onto a data point:then consider
• Likelihood function gets larger as we add more components (and hence parameters) to the model
• not clear how to choose the number K of components
Problems and Solutions
• How to maximize the log likelihood
• solved by expectation-maximization (EM) algorithm
• How to avoid singularities in the likelihood function
• solved by a Bayesian treatment
• How to choose number K of components
• also solved by a Bayesian treatment
EM Algorithm – Informal Derivation
• Let us proceed by simply differentiating the log likelihood
• Setting derivative with respect to equal to zero givesgivingwhich is simply the weighted mean of the data
EM Algorithm – Informal Derivation
• Similarly for the covariances
• For mixing coefficients use a Lagrange multiplier to give
EM Algorithm – Informal Derivation
• The solutions are not closed form since they are coupled
• Suggests an iterative scheme for solving them:
• Make initial guesses for the parameters
• Alternate between the following two stages:
• E-step: evaluate responsibilities
• M-step: update parameters using ML results
Digression: Probabilistic Graphical Models
• Graphical representation of a probabilistic model
• Each variable corresponds to a node in the graph
• Links in the graph denote relations between variables
• Motivation:
• visualization of models and motivation for new models
• graphical determination of conditional independence
• complex calculations (inference) performed using graphical operations (e.g. forward-backward for HMM)
• Here we consider directed graphs
Example: 3 Variables
• General distribution over 3 variables
• Apply product rule of probability twice
• Express as a directed graph
General Decomposition Formula
• Joint distribution is product of conditionals, conditioned on parent nodes
• Example:
EM – Latent Variable Viewpoint
• Binary latent variables describing which component generated each data point
• Conditional distribution of observed variable
• Prior distribution of latent variables
• Marginalizing over the latent variables we obtain
Latent Variable View of EM
• If we knew the values for the latent variables, we would maximize the complete-data log likelihoodwhich gives a trivial closed-form solution (fit each component to the corresponding set of data points)
• We don’t know the values of the latent variables
• However, for given parameter values we can compute the expected values of the latent variables
Expected Complete-Data Log Likelihood
• Suppose we make a guess for the parameter values (means, covariances and mixing coefficients)
• Use these to evaluate the responsibilities
• Consider expected complete-data log likelihood where responsibilities are computed using
• We are implicitly ‘filling in’ latent variables with best guess
• Keeping the responsibilities fixed and maximizing with respect to the parameters give the previous results
K-means Revisited
• Consider GMM with common covariances
• Take limit
• Responsibilities become binary
• Expected complete-data log likelihood becomes
EM in General
• Consider arbitrary distribution over the latent variables
• The following decomposition always holdswhere
Optimizing the Bound
• E-step: maximize with respect to
• equivalent to minimizing KL divergence
• sets equal to the posterior distribution
• M-step: maximize bound with respect to
• equivalent to maximizing expected complete-data log likelihood
• Each EM cycle must increase incomplete-data likelihood unless already at a (local) maximum
Bayesian Inference
• Include prior distributions over parameters
• Advantages in using conjugate priors
• Example: consider a single Gaussian over one variable
• assume variance is known and mean is unknown
• likelihood function for the mean
• Choose Gaussian prior for mean
Bayesian Inference for a Gaussian
• Posterior (proportional to product of prior and likelihood) will then also be Gaussianwhere
Bayesian Mixture of Gaussians
• Conjugate priors for the parameters:
• Dirichlet prior for mixing coefficients
• Normal-Wishart prior for means and precisionswhere the Wishart distribution is given by
Graphical Representation
• Parameters and latent variables appear on equal footing
Variational Inference
• As with many Bayesian models, exact inference for the mixture of Gaussians is intractable
• Approximate Bayesian inference traditionally based on Laplace’s method (local Gaussian approximation to the posterior) or Markov chain Monte Carlo
• Variational Inference is an alternative, broadly applicable deterministic approximation scheme
General View of Variational Inference
• Consider again the previous decomposition, but where the posterior is over all latent variables and parameterswhere
• Maximizing over would give the true posterior distribution – but this is intractable by definition
Factorized Approximation
• Goal: choose a family of distributions which are:
• sufficiently flexible to give good posterior approximation
• sufficiently simple to remain tractable
• Here we consider factorized distributions
• No further assumptions are required!
• Optimal solution for one factor, keeping the remained fixed
• Coupled solutions so initialize then cyclically update
Lower Bound
• Can also be evaluated
• Useful for maths/code verification
• Also useful for model comparison:
Illustration: Univariate Gaussian
• Likelihood function
• Conjugate priors
• Factorized variational distribution
Exact Solution
• For this very simple example there is an exact solution
• Expected precision given by
• Compare with earlier maximum likelihood solution
Variational Mixture of Gaussians
• Assume factorized posterior distribution
• Gives optimal solution in the formwhere is a Dirichlet, and is a Normal-Wishart
Sufficient Statistics
• Small computational overhead compared to maximum likelihood EM
Sparse Bayes for Gaussian Mixture
• Instead of comparing different values of K, start with a large value and prune out excess components
• Treat mixing coefficients as parameters, and maximize marginal likelihood [Corduneanu + Bishop, AIStats ’01]
• Gives simple re-estimation equations for the mixing coefficients – interleave with variational updates
General Variational Framework
• Currently for each new model:
• derive the variational update equations
• write application-specific code to find the solution
• Both stages are time consuming and error prone
• Can we build a general-purpose inference engine which automates these procedures?
VIBES
• Variational Inference for Bayesian Networks
• Bishop and Winn (1999)
• A general inference engine using variational methods
• Models specified graphically
Local Computation in VIBES
• A key observation is that in the general solutionthe update for a particular node (or group of nodes) depends only on other nodes in the Markov blanket
• Permits a local object-oriented implementation
Take-home Messages
• Bayesian mixture of Gaussians
• no singularities
• determines optimal number of components
• Variational inference
• effective solution for Bayesian GMM
• optimizes rigorous bound
• little computational overhead compared to EM
• VIBES
• rapid prototyping of probabilistic models
• graphical specification

### Viewgraphs, tutorials andpublications available from:

http://research.microsoft.com/~cmbishop