- By
**paul** - Follow User

- 504 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Christopher M. Bishop' - paul

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Viewgraphs, tutorials andpublications available from:

Latent Variables,Mixture Modelsand EM

Christopher M. Bishop

Microsoft Research, Cambridge

BCS Summer SchoolExeter, 2003

Overview

- K-means clustering
- Gaussian mixtures
- Maximum likelihood and EM
- Probabilistic graphical models
- Latent variables: EM revisited
- Bayesian Mixtures of Gaussians
- Variational Inference
- VIBES

K-means Algorithm

- Goal: represent a data set in terms of K clusters each of which is summarized by a prototype
- Initialize prototypes, then iterate between two phases:
- E-step: assign each data point to nearest prototype
- M-step: update prototypes to be the cluster means
- Simplest version is based on Euclidean distance
- re-scale Old Faithful data

Responsibilities

- Responsibilities assign data points to clusterssuch that
- Example: 5 data points and 3 clusters

Minimizing the Cost Function

- E-step: minimize w.r.t.
- assigns each data point to nearest prototype
- M-step: minimize w.r.t
- gives
- each prototype set to the mean of points in that cluster
- Convergence guaranteed since there is a finite number of possible settings for the responsibilities

Limitations of K-means

- Hard assignments of data points to clusters – small shift of a data point can flip it to a different cluster
- Not clear how to choose the value of K
- Solution: replace ‘hard’ clustering of K-means with ‘soft’ probabilistic assignments
- Represents the probability distribution of the data as a Gaussian mixture model

mean

The Gaussian Distribution- Multivariate Gaussian
- Define precision to be the inverse of the covariance
- In 1-dimension

Likelihood Function

- Data set
- Assume observed data points generated independently
- Viewed as a function of the parameters, this is known as the likelihood function

Maximum Likelihood

- Set the parameters by maximizing the likelihood function
- Equivalently maximize the log likelihood

Maximum Likelihood Solution

- Maximizing w.r.t. the mean gives the sample mean
- Maximizing w.r.t covariance gives the sample covariance

Bias of Maximum Likelihood

- Consider the expectations of the maximum likelihood estimates under the Gaussian distribution
- The maximum likelihood solution systematically under-estimates the covariance
- This is an example of over-fitting

Unbiased Variance Estimate

- Clearly we can remove the bias by usingsince this gives
- Arises naturally in a Bayesian treatment (see later)
- For an infinite data set the two expressions are equal

Gaussian Mixtures

- Linear super-position of Gaussians
- Normalization and positivity require
- Can interpret the mixing coefficients as prior probabilities

Sampling from the Gaussian

- To generate a data point:
- first pick one of the components with probability
- then draw a sample from that component
- Repeat these two steps for each new data point

Fitting the Gaussian Mixture

- We wish to invert this process – given the data set, find the corresponding parameters:
- mixing coefficients
- means
- covariances
- If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster
- Problem: the data set is unlabelled
- We shall refer to the labels as latent (= hidden) variables

Posterior Probabilities

- We can think of the mixing coefficients as prior probabilities for the components
- For a given value of we can evaluate the corresponding posterior probabilities, called responsibilities
- These are given from Bayes’ theorem by

Maximum Likelihood for the GMM

- The log likelihood function takes the form
- Note: sum over components appears inside the log
- There is no closed form solution for maximum likelihood

Over-fitting in Gaussian Mixture Models

- Singularities in likelihood function when a component ‘collapses’ onto a data point:then consider
- Likelihood function gets larger as we add more components (and hence parameters) to the model
- not clear how to choose the number K of components

Problems and Solutions

- How to maximize the log likelihood
- solved by expectation-maximization (EM) algorithm
- How to avoid singularities in the likelihood function
- solved by a Bayesian treatment
- How to choose number K of components
- also solved by a Bayesian treatment

EM Algorithm – Informal Derivation

- Let us proceed by simply differentiating the log likelihood
- Setting derivative with respect to equal to zero givesgivingwhich is simply the weighted mean of the data

EM Algorithm – Informal Derivation

- Similarly for the covariances
- For mixing coefficients use a Lagrange multiplier to give

EM Algorithm – Informal Derivation

- The solutions are not closed form since they are coupled
- Suggests an iterative scheme for solving them:
- Make initial guesses for the parameters
- Alternate between the following two stages:
- E-step: evaluate responsibilities
- M-step: update parameters using ML results

Digression: Probabilistic Graphical Models

- Graphical representation of a probabilistic model
- Each variable corresponds to a node in the graph
- Links in the graph denote relations between variables
- Motivation:
- visualization of models and motivation for new models
- graphical determination of conditional independence
- complex calculations (inference) performed using graphical operations (e.g. forward-backward for HMM)
- Here we consider directed graphs

Example: 3 Variables

- General distribution over 3 variables
- Apply product rule of probability twice
- Express as a directed graph

General Decomposition Formula

- Joint distribution is product of conditionals, conditioned on parent nodes
- Example:

EM – Latent Variable Viewpoint

- Binary latent variables describing which component generated each data point
- Conditional distribution of observed variable
- Prior distribution of latent variables
- Marginalizing over the latent variables we obtain

Expected Value of Latent Variable

- From Bayes’ theorem

Latent Variable View of EM

- If we knew the values for the latent variables, we would maximize the complete-data log likelihoodwhich gives a trivial closed-form solution (fit each component to the corresponding set of data points)
- We don’t know the values of the latent variables
- However, for given parameter values we can compute the expected values of the latent variables

Expected Complete-Data Log Likelihood

- Suppose we make a guess for the parameter values (means, covariances and mixing coefficients)
- Use these to evaluate the responsibilities
- Consider expected complete-data log likelihood where responsibilities are computed using
- We are implicitly ‘filling in’ latent variables with best guess
- Keeping the responsibilities fixed and maximizing with respect to the parameters give the previous results

K-means Revisited

- Consider GMM with common covariances
- Take limit
- Responsibilities become binary
- Expected complete-data log likelihood becomes

EM in General

- Consider arbitrary distribution over the latent variables
- The following decomposition always holdswhere

Optimizing the Bound

- E-step: maximize with respect to
- equivalent to minimizing KL divergence
- sets equal to the posterior distribution
- M-step: maximize bound with respect to
- equivalent to maximizing expected complete-data log likelihood
- Each EM cycle must increase incomplete-data likelihood unless already at a (local) maximum

Bayesian Inference

- Include prior distributions over parameters
- Advantages in using conjugate priors
- Example: consider a single Gaussian over one variable
- assume variance is known and mean is unknown
- likelihood function for the mean
- Choose Gaussian prior for mean

Bayesian Inference for a Gaussian

- Posterior (proportional to product of prior and likelihood) will then also be Gaussianwhere

Bayesian Mixture of Gaussians

- Conjugate priors for the parameters:
- Dirichlet prior for mixing coefficients
- Normal-Wishart prior for means and precisionswhere the Wishart distribution is given by

Graphical Representation

- Parameters and latent variables appear on equal footing

Variational Inference

- As with many Bayesian models, exact inference for the mixture of Gaussians is intractable
- Approximate Bayesian inference traditionally based on Laplace’s method (local Gaussian approximation to the posterior) or Markov chain Monte Carlo
- Variational Inference is an alternative, broadly applicable deterministic approximation scheme

General View of Variational Inference

- Consider again the previous decomposition, but where the posterior is over all latent variables and parameterswhere
- Maximizing over would give the true posterior distribution – but this is intractable by definition

Factorized Approximation

- Goal: choose a family of distributions which are:
- sufficiently flexible to give good posterior approximation
- sufficiently simple to remain tractable
- Here we consider factorized distributions
- No further assumptions are required!
- Optimal solution for one factor, keeping the remained fixed
- Coupled solutions so initialize then cyclically update

Lower Bound

- Can also be evaluated
- Useful for maths/code verification
- Also useful for model comparison:

Illustration: Univariate Gaussian

- Likelihood function
- Conjugate priors
- Factorized variational distribution

Exact Solution

- For this very simple example there is an exact solution
- Expected precision given by
- Compare with earlier maximum likelihood solution

Variational Mixture of Gaussians

- Assume factorized posterior distribution
- Gives optimal solution in the formwhere is a Dirichlet, and is a Normal-Wishart

Sufficient Statistics

- Small computational overhead compared to maximum likelihood EM

Sparse Bayes for Gaussian Mixture

- Instead of comparing different values of K, start with a large value and prune out excess components
- Treat mixing coefficients as parameters, and maximize marginal likelihood [Corduneanu + Bishop, AIStats ’01]
- Gives simple re-estimation equations for the mixing coefficients – interleave with variational updates

General Variational Framework

- Currently for each new model:
- derive the variational update equations
- write application-specific code to find the solution
- Both stages are time consuming and error prone
- Can we build a general-purpose inference engine which automates these procedures?

VIBES

- Variational Inference for Bayesian Networks
- Bishop and Winn (1999)
- A general inference engine using variational methods
- Models specified graphically

Local Computation in VIBES

- A key observation is that in the general solutionthe update for a particular node (or group of nodes) depends only on other nodes in the Markov blanket
- Permits a local object-oriented implementation

Take-home Messages

- Bayesian mixture of Gaussians
- no singularities
- determines optimal number of components
- Variational inference
- effective solution for Bayesian GMM
- optimizes rigorous bound
- little computational overhead compared to EM
- VIBES
- rapid prototyping of probabilistic models
- graphical specification

http://research.microsoft.com/~cmbishop

Download Presentation

Connecting to Server..