Outline

2010 Winter School on Machine Learning and VisionSponsored byCanadian Institute for Advanced Researchand Microsoft Research IndiaWith additional support fromIndian Institute of Science, Bangaloreand The University of Toronto, Canada

Outline • Approximate inference: Mean field and variationalmethods • Learning generative models of images • Learning ‘epitomes’ of images

Part AApproximate inference: Mean field and variational methods

Line processes for binary images(Geman and Geman 1984) Function, f Patterns with highf 0 0 1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 1 0 Patterns with lowf 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 Under P, “lines” are probable 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

Use tablet to derive variational inference method

Denoising images using line process models

Part BLearning Generative Models of Images Brendan Frey University of Toronto and Canadian Institute for Advanced Research

Generative models • Generative models are trained to explain many different aspects of the input image • Using an objective function like log P(image), a generative model benefits by account for all pixels in the image • Contrast to discriminative models trained in a supervised fashion (eg, object recognition) • Using an objective function like log P(class|image), a discriminative model benefits by accounting for pixel features that distinguish between classes

What constitutes an “image” • Uniform 2-D array of color pixels • Uniform 2-D array of grey-scale pixels • Non-uniform images (eg, retinal images, compressed sampling images) • Features extracted from the image (eg, SIFT features) • Subsets of image pixels selected by the model (must be careful to represent universe) • …

What constitutes a generative model?

Learning Bayesian Networks:Exact and approximate methods

Maximum likelihood learning when all variables are visible (complete data) • Suppose we observe N IID training cases v(1)…v(N) • Let q be the parameters of a model P(v) • Maximum likelihood estimate of q: qML = argmaxqPnP(v(n)|q) = argmaxq log( PnP(v(n)|q) ) = argmaxqSn log P(v(n)|q)

Complete data in Bayes nets • All variables are observed, so P(v|q) = PiP(vi|pai,qi) where pai= parents of vi, qi parameterizes P(vi|pai) • Since argmax () = argmax log (), qiML=argmaxqiSn log P(v(n)|q) = argmaxqiSnSi log P(vi(n)|pai(n),qi) = argmaxqiSn log P(vi(n)|pai(n),qi) Each child-parent module can be learned separately

Example: Learning a Mixture of Gaussians from labeled data • Recall: For cluster k, the probability density of x is The probability of cluster k is p(zk = 1) =pk • Complete data: Each training case is a (zn,xn) pair, let Nk be the number of cases in class k • ML estimation: , That is, just learn one Gaussian for each class of data

Example: Learning from complete data, a continuous child with continuous parents • Estimation becomes a regression-type problem • Eg, linear Gaussian model: P(vi|pai,qi) = N (vi; wi0+Sn:onpaiwinvn,Ci), • mean = linear function of parents • Estimation: Linear regression

Learning fully-observed MRFs • It turns out we can NOT directly estimate each potential using only observations of its variables • P(v|q) = Piϕ(vCi|qi) / (SvPiϕ(vCi|qi)) • Problem: The partition function (denominator)

Learning Bayesian networks when there is missing data

Example: Mixture of K unit-variance Gaussians P(x) =Skpkaexp(-(x-m1)2/2), where a = (2p)-1/2 The log-likelihood to be maximized is log(Skpkaexp(-(x-m1)2/2)) The parameters {pk,mk} that maximize this do not have a simple, closed form solution • One approach: Use nonlinear optimizer • This approach is intractable if the number of components is too large • A different approach…

The expectation maximization (EM) algorithm(Dempster, Laird and Rubin 1977) • Learning was more straightforward when the data was complete • Can we use probabilistic inference (compute P(h|v,q)) to “fill in” the missing data and then use the learning rules for complete data? • YES: This is called the EM algorithm

Expectation maximization (EM) algorithm • Initialize q (randomly or cleverly) • E-Step: Compute Q(n)(h) = P(h|v(n),q) for hidden variables h, given visible variables v • M-Step: Holding Q(n)(h) constant, maximize SnShQ(n)(h) log P(v(n),h|q) wrtq • Repeat E and M steps until convergence • Each iteration increases log P(v|q) = Snlog(ShP(v,h|q)) “Ensemble completion”

EM in Bayesian networks • Recall P(v,h|q) = PiP(xi|pai,qi), x = (v,h) • Then, maximizing SnShQ(n)(h) log P(v(n),h|q) wrtqi becomes equivalent to maximizing, for each xi, SnSxi,paiQ(n)(xi,pai) log P(xi|pai,qi) where Q(..., xk=xk*,…)=0 if xk is observed to be xk* • GIVEN the Q-distributions, the conditional P-distributions can be updated independently

EM in Bayesian networks • E-Step: Compute Q(n)(xi,pai) = P(xi,pai|v(n),q) for each variable xi • M-Step: For each xi, maximize SnSxi,paiQ(n)(xi,pai) log P(xi|pai,qi) wrtqi

Recall: For labeled data,g(znk)=znk EM for a mixture of Gaussians • Initialization: Pick m’s, S’s, p ’s randomly but validly • E Step: For each training case, we need q(z) = p(z|x) = p(x|z)p(z) / (Sz p(x|z)p(z)) Defining = q(znk=1), we need to actually compute: • M Step: Do it in the log-domain!

EM for mixture of Gaussians: E step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, z Images from data set

EM for mixture of Gaussians: E step P(c|z) c=1 0.52 m1= F1= p1= 0.5, c 0.48 c=2 m2= F2= p2= 0.5, z= Images from data set

EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set m1 to the average of zP(c=1|z) z Set m2 to the average of zP(c=2|z)

EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set F1 to the average of diag((z-m1)T (z-m1))P(c=1|z) z Set F2 to the average of diag((z-m2)T (z-m2))P(c=2|z)

… after iterating to convergence: m1= F1= p1= 0.6, c m2= F2= p2= 0.4, z

Why does EM work?

Gibbs free energy • Somehow, we need to move the log() function in the expression log(ShP(h,v)) inside the summation to obtain log P(h,v), which simplifies • We can do this using Jensen’s inequality: Free energy

Properties of free energy • F ≥ - log P(v) • The minimum of F w.r.t Q gives F = - log P(v) Q(h) = P(h|v) = -

Proof that EM maximizes log P(v)(Neal and Hinton 1993) • E-Step: By setting Q(h)=P(h|v), we make the bound tight, so that F = - log P(v) • M-Step: By maximizing ShQ(h) logP(h,v) wrt the parameters of P, we are minimizing F wrt the parameters of P Since -log Pnew(v) ≤ Fnew≤ Fold = -log Pold(v), we have log Pnew(v) ≥ log Pold(v). ☐ = -

Generalized EM • M-Step: Instead of minimizing F wrt P, just decrease F wrt P • E-Step: Instead of minimizing F wrt Q (ie, by setting Q(h)=P(h|v)), just decrease F wrt Q • Approximations • Variational techniques (which decrease F wrt Q) • Loopy belief propagation (note the phrase “loopy”) • Markov chain Monte Carlo (stochastic …) = -

Summary of learning Bayesian networks • Observed variables decouple learning in different conditional PDFs • In contrast, hidden variables couple learning in different conditional PDFs • Learning models with hidden variables entails iteratively filling in hidden variables using exact or approximate probabilistic inference, and updating every child-parent conditional PDF

Back to…Learning Generative Models of Images Brendan Frey University of Toronto and Canadian Institute for Advanced Research

What constitutes an “image” • Uniform 2-D array of color pixels • Uniform 2-D array of grey-scale pixels • Non-uniform images (eg, retinal images, compressed sampling images) • Features extracted from the image (eg, SIFT features) • Subsets of image pixels selected by the model (must be careful to represent universe) • …

Experiment: Fitting a mixture of Gaussians to pixel vectors extracted from complicated images Model size 1 class 2 classes 3 classes 4 classes

Why didn’t it work? • Is there a bug in the software? • I don’t think so, because the log-likelihood monotonically increases and the software works properly for toy data generated from a mixture of Gaussians • Is there a mistake in our mathematical derivation? • The EM algorithm for a mixture of Gaussians has been studied by many people – I think the math is ok

Why didn’t it work? • Are we missing some important hidden variables? • YES: The location of each object

z= T= x= Transformed mixtures of Gaussians (TMG)(Frey and Jojic, 1999-2001) c P(c) =pc c=1 m1= p1= 0.6, diag(F1) = m2= p2= 0.4, diag(F2) = Shift, T z T P(T) P(x|z,T) = N(x; Tz, Y) x Diagonal

Random initialization Experiment: Fitting transformed mixtures of Gaussians to complicated images Model size 1 class 2 classes 3 classes 4 classes

tmgEM.m is available on the web

Accounting for multiple objects in the same image

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: