EM algorithm continued & Fisher kernels for image representation

EM algorithm continued & Fisher kernels for image representation Jakob Verbeek December 11, 2009

Plan for this course • Introduction to machine learning • Clustering techniques • k-means, Gaussian mixture density • Gaussian mixture density continued • Parameter estimation with EM, Fisher kernels • Classification techniques 1 • Introduction, generative methods, semi-supervised • Classification techniques 2 • Discriminative methods, kernels • Decomposition of images • Topic models, …

Clustering with k-means and MoG Hard assignment in k-means is not robust near border of quantization cells Soft assignment in MoG accounts for ambiguity in the assignment Both algorithms sensitive for initialization Run from several initializations Keep best result Nr of clusters need to be set Both algorithm can be generalized to other types of distances or densities Images from [Gemert et al, IEEE TPAMI, 2010]

Clustering with Gaussian mixture density Mixture density is weighted sum of Gaussians Mixing weight: importance of each cluster Density has to integrate to 1, so we require

Clustering with Gaussian mixture density Given: data set of N points xn, n=1,…,N Find mixture of Gaussians (MoG) that best explains data Assigns maximum likelihood to the data Assume data points are drawn independently from MoG Maximize log-likelihood of fixed data set X w.r.t. parameters of MoG As with k-means objective function has local minima Can use Expectation-Maximization (EM) algorithm Similar to the iterative k-means algorithm

Maximum likelihood estimation of MoG Use EM algorithm Initialize MoG: parameters or soft-assign E-step: soft assign of data points to clusters M-step: update the cluster parameters Repeat EM steps, terminate if converged Convergence of parameters or assignments E-step: compute posterior on z given x: M-step: update Gaussians from data points weighted by posterior

Maximum likelihood estimation of MoG Example of several EM iterations

Bound optimization view of EM algorithm EM algorithm is another iterative bound optimization algorithm Goal: Maximize data log-likelihood Bound: uses soft-assign of data points to clusters Bound uses two information theoretic quantities Entropy Kullback-Leibler divergence

Entropy of a distribution Entropy captures uncertainty in a distribution Maximum for uniform distribution Minimum, zero, for delta peak on single value Connection to information coding (Noiseless coding theorem, Shannon 1948) Frequent messages short code, optimal code length is (at least) -log p bits Entropy: expected code length Suppose uniform distribution over 8 outcomes: 3 bit code words Suppose distribution: 1/2,1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64, entropy 2 bits! Code words: 0, 10, 110, 1110, 111100, 111101,111110,111111 Codewords are “self-delimiting”: code is of length 6, or stops after first 0. Low entropy High entropy

Kullback-Leibler divergence Asymmetric quantity between distributions Minimum, zero, if distributions are equal Maximum, infinity, if p has a zero where q is non-zero Interpretation in coding theory Sub-optimality when messages distributed according to q, but coding with codeword lengths derived from p Difference of expected code lengths Suppose distribution q: 1/2,1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64 Coding with uniform 3-bit code, p=uniform Expected code length using p: 3 bits Optimal expected code length, entropy H(q) = 2 bits KL divergence D(q|p) = 1 bit

EM bound on log-likelihood Goal: maximize observed data log-likelihood Hidden/latent/missing variables Z Define p(X) as “marginal” distribution of p(X,Z) Let qn(zn) be arbitrary distribution over latent variable zn Bound log-likelihood by subtracting KL divergence D(q(z) || p(z|x))

Maximizing the EM bound on log-likelihood E-step: fix model parameters, update distributions qn KL divergence zero if distributions are equal Thus set qn(zn) = p(zn|xn) M-step: fix the qn, update model parameters Terms for each Gaussian decoupled from rest:

Maximizing the EM bound on log-likelihood Derive the optimal values for the mixing weights Maximize Take into account that weights sum to one, define Take derivative for mixing weight k>1

Maximizing the EM bound on log-likelihood Derive the optimal values for the MoG parameters Maximize

EM bound on log-likelihood L is bound on data log-likelihood for any q Iterative coordinate ascent on F E-step optimize q M-step optimize parameters

How to set the nr of clusters? Optimization criterion of k-means and MoG always improved by adding more clusters K-means: min distance to closest cluster can not increase by adding a cluster center MoG: can always add the new Gaussian with zero mixing weight, (k+1) component models contain k component models. Optimization criterion cannot be used to select # clusters Model selection by adding penalty term increasing with # clusters Minimum description length (MDL) principle Bayesian information criterion (BIC) Aikaike informaiton criterion (AIC) Cross-validation if used for another task, eg. Image categorization check performance of final system on validation set of labeled images For more details see “Pattern Recognition & Machine Learning”, by C. Bishop, 2006. In particular chapter 9, and section 3.4

How to set the nr of clusters? Bayesian model that treats parameters as missing values Prior distribution over parameters Likelihood of data given by averaging over parameter values Variational Bayesian inference for various nr of clusters Approximate data log-likelihood using the EM bound E-step: distribution q generally too complex to represent exact Use factorizing distribution q, not exact, KL divergence > 0 For models with Many parameters: fits many data sets Few parameters: won’t fit data well The “right” nr. of parameters: good fit Data sets

Plan for this course • Introduction to machine learning • Clustering techniques • k-means, Gaussian mixture density • Gaussian mixture density continued • Parameter estimation with EM • Reading for next week: • Perronnin and Dance “Fisher Kernels on Visual Vocabularies for Image Categorization”, CVPR 2007 • Papers from last week! • Available on course website http://lear.inrialpes.fr/~verbeek/teaching • Fisher kernels + classification techniques 1 • Introduction, generative methods, semi-supervised • Classification techniques 2 • Discriminative methods, kernels • Decomposition of images • Topic models, …

EM algorithm continued & Fisher kernels for image representation