
2010 Winter School on Machine Learning and VisionSponsored byCanadian Institute for Advanced Researchand Microsoft Research IndiaWith additional support fromIndian Institute of Science, Bangaloreand The University of Toronto, Canada
Outline • Approximate inference: Mean field and variationalmethods • Learning generative models of images • Learning ‘epitomes’ of images
Part AApproximate inference: Mean field and variational methods
Line processes for binary images(Geman and Geman 1984) Function, f Patterns with highf 0 0 1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 1 0 Patterns with lowf 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 Under P, “lines” are probable 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
Part BLearning Generative Models of Images Brendan Frey University of Toronto and Canadian Institute for Advanced Research
Generative models • Generative models are trained to explain many different aspects of the input image • Using an objective function like log P(image), a generative model benefits by account for all pixels in the image • Contrast to discriminative models trained in a supervised fashion (eg, object recognition) • Using an objective function like log P(class|image), a discriminative model benefits by accounting for pixel features that distinguish between classes
What constitutes an “image” • Uniform 2-D array of color pixels • Uniform 2-D array of grey-scale pixels • Non-uniform images (eg, retinal images, compressed sampling images) • Features extracted from the image (eg, SIFT features) • Subsets of image pixels selected by the model (must be careful to represent universe) • …
Maximum likelihood learning when all variables are visible (complete data) • Suppose we observe N IID training cases v(1)…v(N) • Let q be the parameters of a model P(v) • Maximum likelihood estimate of q: qML = argmaxqPnP(v(n)|q) = argmaxq log( PnP(v(n)|q) ) = argmaxqSn log P(v(n)|q)
Complete data in Bayes nets • All variables are observed, so P(v|q) = PiP(vi|pai,qi) where pai= parents of vi, qi parameterizes P(vi|pai) • Since argmax () = argmax log (), qiML=argmaxqiSn log P(v(n)|q) = argmaxqiSnSi log P(vi(n)|pai(n),qi) = argmaxqiSn log P(vi(n)|pai(n),qi) Each child-parent module can be learned separately
Example: Learning a Mixture of Gaussians from labeled data • Recall: For cluster k, the probability density of x is The probability of cluster k is p(zk = 1) =pk • Complete data: Each training case is a (zn,xn) pair, let Nk be the number of cases in class k • ML estimation: , That is, just learn one Gaussian for each class of data
Example: Learning from complete data, a continuous child with continuous parents • Estimation becomes a regression-type problem • Eg, linear Gaussian model: P(vi|pai,qi) = N (vi; wi0+Sn:onpaiwinvn,Ci), • mean = linear function of parents • Estimation: Linear regression
Learning fully-observed MRFs • It turns out we can NOT directly estimate each potential using only observations of its variables • P(v|q) = Piϕ(vCi|qi) / (SvPiϕ(vCi|qi)) • Problem: The partition function (denominator)
Example: Mixture of K unit-variance Gaussians P(x) =Skpkaexp(-(x-m1)2/2), where a = (2p)-1/2 The log-likelihood to be maximized is log(Skpkaexp(-(x-m1)2/2)) The parameters {pk,mk} that maximize this do not have a simple, closed form solution • One approach: Use nonlinear optimizer • This approach is intractable if the number of components is too large • A different approach…
The expectation maximization (EM) algorithm(Dempster, Laird and Rubin 1977) • Learning was more straightforward when the data was complete • Can we use probabilistic inference (compute P(h|v,q)) to “fill in” the missing data and then use the learning rules for complete data? • YES: This is called the EM algorithm
Expectation maximization (EM) algorithm • Initialize q (randomly or cleverly) • E-Step: Compute Q(n)(h) = P(h|v(n),q) for hidden variables h, given visible variables v • M-Step: Holding Q(n)(h) constant, maximize SnShQ(n)(h) log P(v(n),h|q) wrtq • Repeat E and M steps until convergence • Each iteration increases log P(v|q) = Snlog(ShP(v,h|q)) “Ensemble completion”
EM in Bayesian networks • Recall P(v,h|q) = PiP(xi|pai,qi), x = (v,h) • Then, maximizing SnShQ(n)(h) log P(v(n),h|q) wrtqi becomes equivalent to maximizing, for each xi, SnSxi,paiQ(n)(xi,pai) log P(xi|pai,qi) where Q(..., xk=xk*,…)=0 if xk is observed to be xk* • GIVEN the Q-distributions, the conditional P-distributions can be updated independently
EM in Bayesian networks • E-Step: Compute Q(n)(xi,pai) = P(xi,pai|v(n),q) for each variable xi • M-Step: For each xi, maximize SnSxi,paiQ(n)(xi,pai) log P(xi|pai,qi) wrtqi
Recall: For labeled data,g(znk)=znk EM for a mixture of Gaussians • Initialization: Pick m’s, S’s, p ’s randomly but validly • E Step: For each training case, we need q(z) = p(z|x) = p(x|z)p(z) / (Sz p(x|z)p(z)) Defining = q(znk=1), we need to actually compute: • M Step: Do it in the log-domain!
EM for mixture of Gaussians: E step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, z Images from data set
EM for mixture of Gaussians: E step P(c|z) c=1 0.52 m1= F1= p1= 0.5, c 0.48 c=2 m2= F2= p2= 0.5, z= Images from data set
EM for mixture of Gaussians: E step P(c|z) c=1 0.51 m1= F1= p1= 0.5, c 0.49 c=2 m2= F2= p2= 0.5, z= Images from data set
EM for mixture of Gaussians: E step P(c|z) c=1 0.48 m1= F1= p1= 0.5, c 0.52 c=2 m2= F2= p2= 0.5, z= Images from data set
EM for mixture of Gaussians: E step P(c|z) c=1 0.43 m1= F1= p1= 0.5, c 0.57 c=2 m2= F2= p2= 0.5, z= Images from data set
EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set m1 to the average of zP(c=1|z) z Set m2 to the average of zP(c=2|z)
EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set m1 to the average of zP(c=1|z) z Set m2 to the average of zP(c=2|z)
EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set F1 to the average of diag((z-m1)T (z-m1))P(c=1|z) z Set F2 to the average of diag((z-m2)T (z-m2))P(c=2|z)
EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set F1 to the average of diag((z-m1)T (z-m1))P(c=1|z) z Set F2 to the average of diag((z-m2)T (z-m2))P(c=2|z)
… after iterating to convergence: m1= F1= p1= 0.6, c m2= F2= p2= 0.4, z
Gibbs free energy • Somehow, we need to move the log() function in the expression log(ShP(h,v)) inside the summation to obtain log P(h,v), which simplifies • We can do this using Jensen’s inequality: Free energy
Properties of free energy • F ≥ - log P(v) • The minimum of F w.r.t Q gives F = - log P(v) Q(h) = P(h|v) = -
Proof that EM maximizes log P(v)(Neal and Hinton 1993) • E-Step: By setting Q(h)=P(h|v), we make the bound tight, so that F = - log P(v) • M-Step: By maximizing ShQ(h) logP(h,v) wrt the parameters of P, we are minimizing F wrt the parameters of P Since -log Pnew(v) ≤ Fnew≤ Fold = -log Pold(v), we have log Pnew(v) ≥ log Pold(v). ☐ = -
Generalized EM • M-Step: Instead of minimizing F wrt P, just decrease F wrt P • E-Step: Instead of minimizing F wrt Q (ie, by setting Q(h)=P(h|v)), just decrease F wrt Q • Approximations • Variational techniques (which decrease F wrt Q) • Loopy belief propagation (note the phrase “loopy”) • Markov chain Monte Carlo (stochastic …) = -
Summary of learning Bayesian networks • Observed variables decouple learning in different conditional PDFs • In contrast, hidden variables couple learning in different conditional PDFs • Learning models with hidden variables entails iteratively filling in hidden variables using exact or approximate probabilistic inference, and updating every child-parent conditional PDF
Back to…Learning Generative Models of Images Brendan Frey University of Toronto and Canadian Institute for Advanced Research
What constitutes an “image” • Uniform 2-D array of color pixels • Uniform 2-D array of grey-scale pixels • Non-uniform images (eg, retinal images, compressed sampling images) • Features extracted from the image (eg, SIFT features) • Subsets of image pixels selected by the model (must be careful to represent universe) • …
Experiment: Fitting a mixture of Gaussians to pixel vectors extracted from complicated images Model size 1 class 2 classes 3 classes 4 classes
Why didn’t it work? • Is there a bug in the software? • I don’t think so, because the log-likelihood monotonically increases and the software works properly for toy data generated from a mixture of Gaussians • Is there a mistake in our mathematical derivation? • The EM algorithm for a mixture of Gaussians has been studied by many people – I think the math is ok
Why didn’t it work? • Are we missing some important hidden variables? • YES: The location of each object
z= T= x= Transformed mixtures of Gaussians (TMG)(Frey and Jojic, 1999-2001) c P(c) =pc c=1 m1= p1= 0.6, diag(F1) = m2= p2= 0.4, diag(F2) = Shift, T z T P(T) P(x|z,T) = N(x; Tz, Y) x Diagonal
EM for TMG • E step: Compute Q(T)=P(T|x), Q(c)=P(c|x), Q(c,z)=P(z,c|x) and Q(T,z)=P(z,T|x) for each x in data • M step: Set • pc = avg of Q(c) • rT = avg of Q(T) • mc = avg mean of z under Q(z|c) • Fc = avg variance of z under Q(z|c) • Y = avgvar of x-Tz under Q(T,z) c T z x
Random initialization Experiment: Fitting transformed mixtures of Gaussians to complicated images Model size 1 class 2 classes 3 classes 4 classes
Let’s peek into the Bayes net (different movie) P(c|x) margmaxcP(c|x) argmaxTP(T|x) E[z|x] E[Tz|x] x