Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Outline PowerPoint Presentation

Outline

115 Views Download Presentation
Download Presentation

Outline

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. 2010 Winter School on Machine Learning and VisionSponsored byCanadian Institute for Advanced Researchand Microsoft Research IndiaWith additional support fromIndian Institute of Science, Bangaloreand The University of Toronto, Canada

  2. Outline • Approximate inference: Mean field and variationalmethods • Learning generative models of images • Learning ‘epitomes’ of images

  3. Part AApproximate inference: Mean field and variational methods

  4. Line processes for binary images(Geman and Geman 1984) Function, f Patterns with highf 0 0 1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 1 0 Patterns with lowf 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 Under P, “lines” are probable 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

  5. Use tablet to derive variational inference method

  6. Denoising images using line process models

  7. Part BLearning Generative Models of Images Brendan Frey University of Toronto and Canadian Institute for Advanced Research

  8. Generative models • Generative models are trained to explain many different aspects of the input image • Using an objective function like log P(image), a generative model benefits by account for all pixels in the image • Contrast to discriminative models trained in a supervised fashion (eg, object recognition) • Using an objective function like log P(class|image), a discriminative model benefits by accounting for pixel features that distinguish between classes

  9. What constitutes an “image” • Uniform 2-D array of color pixels • Uniform 2-D array of grey-scale pixels • Non-uniform images (eg, retinal images, compressed sampling images) • Features extracted from the image (eg, SIFT features) • Subsets of image pixels selected by the model (must be careful to represent universe) • …

  10. What constitutes a generative model?

  11. Learning Bayesian Networks:Exact and approximate methods

  12. Maximum likelihood learning when all variables are visible (complete data) • Suppose we observe N IID training cases v(1)…v(N) • Let q be the parameters of a model P(v) • Maximum likelihood estimate of q: qML = argmaxqPnP(v(n)|q) = argmaxq log( PnP(v(n)|q) ) = argmaxqSn log P(v(n)|q)

  13. Complete data in Bayes nets • All variables are observed, so P(v|q) = PiP(vi|pai,qi) where pai= parents of vi, qi parameterizes P(vi|pai) • Since argmax () = argmax log (), qiML=argmaxqiSn log P(v(n)|q) = argmaxqiSnSi log P(vi(n)|pai(n),qi) = argmaxqiSn log P(vi(n)|pai(n),qi) Each child-parent module can be learned separately

  14. Example: Learning a Mixture of Gaussians from labeled data • Recall: For cluster k, the probability density of x is The probability of cluster k is p(zk = 1) =pk • Complete data: Each training case is a (zn,xn) pair, let Nk be the number of cases in class k • ML estimation: , That is, just learn one Gaussian for each class of data

  15. Example: Learning from complete data, a continuous child with continuous parents • Estimation becomes a regression-type problem • Eg, linear Gaussian model: P(vi|pai,qi) = N (vi; wi0+Sn:onpaiwinvn,Ci), • mean = linear function of parents • Estimation: Linear regression

  16. Learning fully-observed MRFs • It turns out we can NOT directly estimate each potential using only observations of its variables • P(v|q) = Piϕ(vCi|qi) / (SvPiϕ(vCi|qi)) • Problem: The partition function (denominator)

  17. Learning Bayesian networks when there is missing data

  18. Example: Mixture of K unit-variance Gaussians P(x) =Skpkaexp(-(x-m1)2/2), where a = (2p)-1/2 The log-likelihood to be maximized is log(Skpkaexp(-(x-m1)2/2)) The parameters {pk,mk} that maximize this do not have a simple, closed form solution • One approach: Use nonlinear optimizer • This approach is intractable if the number of components is too large • A different approach…

  19. The expectation maximization (EM) algorithm(Dempster, Laird and Rubin 1977) • Learning was more straightforward when the data was complete • Can we use probabilistic inference (compute P(h|v,q)) to “fill in” the missing data and then use the learning rules for complete data? • YES: This is called the EM algorithm

  20. Expectation maximization (EM) algorithm • Initialize q (randomly or cleverly) • E-Step: Compute Q(n)(h) = P(h|v(n),q) for hidden variables h, given visible variables v • M-Step: Holding Q(n)(h) constant, maximize SnShQ(n)(h) log P(v(n),h|q) wrtq • Repeat E and M steps until convergence • Each iteration increases log P(v|q) = Snlog(ShP(v,h|q)) “Ensemble completion”

  21. EM in Bayesian networks • Recall P(v,h|q) = PiP(xi|pai,qi), x = (v,h) • Then, maximizing SnShQ(n)(h) log P(v(n),h|q) wrtqi becomes equivalent to maximizing, for each xi, SnSxi,paiQ(n)(xi,pai) log P(xi|pai,qi) where Q(..., xk=xk*,…)=0 if xk is observed to be xk* • GIVEN the Q-distributions, the conditional P-distributions can be updated independently

  22. EM in Bayesian networks • E-Step: Compute Q(n)(xi,pai) = P(xi,pai|v(n),q) for each variable xi • M-Step: For each xi, maximize SnSxi,paiQ(n)(xi,pai) log P(xi|pai,qi) wrtqi

  23. Recall: For labeled data,g(znk)=znk EM for a mixture of Gaussians • Initialization: Pick m’s, S’s, p ’s randomly but validly • E Step: For each training case, we need q(z) = p(z|x) = p(x|z)p(z) / (Sz p(x|z)p(z)) Defining = q(znk=1), we need to actually compute: • M Step: Do it in the log-domain!

  24. EM for mixture of Gaussians: E step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, z Images from data set

  25. EM for mixture of Gaussians: E step P(c|z) c=1 0.52 m1= F1= p1= 0.5, c 0.48 c=2 m2= F2= p2= 0.5, z= Images from data set

  26. EM for mixture of Gaussians: E step P(c|z) c=1 0.51 m1= F1= p1= 0.5, c 0.49 c=2 m2= F2= p2= 0.5, z= Images from data set

  27. EM for mixture of Gaussians: E step P(c|z) c=1 0.48 m1= F1= p1= 0.5, c 0.52 c=2 m2= F2= p2= 0.5, z= Images from data set

  28. EM for mixture of Gaussians: E step P(c|z) c=1 0.43 m1= F1= p1= 0.5, c 0.57 c=2 m2= F2= p2= 0.5, z= Images from data set

  29. EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set m1 to the average of zP(c=1|z) z Set m2 to the average of zP(c=2|z)

  30. EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set m1 to the average of zP(c=1|z) z Set m2 to the average of zP(c=2|z)

  31. EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set F1 to the average of diag((z-m1)T (z-m1))P(c=1|z) z Set F2 to the average of diag((z-m2)T (z-m2))P(c=2|z)

  32. EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set F1 to the average of diag((z-m1)T (z-m1))P(c=1|z) z Set F2 to the average of diag((z-m2)T (z-m2))P(c=2|z)

  33. … after iterating to convergence: m1= F1= p1= 0.6, c m2= F2= p2= 0.4, z

  34. Why does EM work?

  35. Gibbs free energy • Somehow, we need to move the log() function in the expression log(ShP(h,v)) inside the summation to obtain log P(h,v), which simplifies • We can do this using Jensen’s inequality: Free energy

  36. Properties of free energy • F ≥ - log P(v) • The minimum of F w.r.t Q gives F = - log P(v) Q(h) = P(h|v) = -

  37. Proof that EM maximizes log P(v)(Neal and Hinton 1993) • E-Step: By setting Q(h)=P(h|v), we make the bound tight, so that F = - log P(v) • M-Step: By maximizing ShQ(h) logP(h,v) wrt the parameters of P, we are minimizing F wrt the parameters of P Since -log Pnew(v) ≤ Fnew≤ Fold = -log Pold(v), we have log Pnew(v) ≥ log Pold(v). ☐ = -

  38. Generalized EM • M-Step: Instead of minimizing F wrt P, just decrease F wrt P • E-Step: Instead of minimizing F wrt Q (ie, by setting Q(h)=P(h|v)), just decrease F wrt Q • Approximations • Variational techniques (which decrease F wrt Q) • Loopy belief propagation (note the phrase “loopy”) • Markov chain Monte Carlo (stochastic …) = -

  39. Summary of learning Bayesian networks • Observed variables decouple learning in different conditional PDFs • In contrast, hidden variables couple learning in different conditional PDFs • Learning models with hidden variables entails iteratively filling in hidden variables using exact or approximate probabilistic inference, and updating every child-parent conditional PDF

  40. Back to…Learning Generative Models of Images Brendan Frey University of Toronto and Canadian Institute for Advanced Research

  41. What constitutes an “image” • Uniform 2-D array of color pixels • Uniform 2-D array of grey-scale pixels • Non-uniform images (eg, retinal images, compressed sampling images) • Features extracted from the image (eg, SIFT features) • Subsets of image pixels selected by the model (must be careful to represent universe) • …

  42. Experiment: Fitting a mixture of Gaussians to pixel vectors extracted from complicated images Model size 1 class 2 classes 3 classes 4 classes

  43. Why didn’t it work? • Is there a bug in the software? • I don’t think so, because the log-likelihood monotonically increases and the software works properly for toy data generated from a mixture of Gaussians • Is there a mistake in our mathematical derivation? • The EM algorithm for a mixture of Gaussians has been studied by many people – I think the math is ok

  44. Why didn’t it work? • Are we missing some important hidden variables? • YES: The location of each object

  45. z= T= x= Transformed mixtures of Gaussians (TMG)(Frey and Jojic, 1999-2001) c P(c) =pc c=1 m1= p1= 0.6, diag(F1) = m2= p2= 0.4, diag(F2) = Shift, T z T P(T) P(x|z,T) = N(x; Tz, Y) x Diagonal

  46. EM for TMG • E step: Compute Q(T)=P(T|x), Q(c)=P(c|x), Q(c,z)=P(z,c|x) and Q(T,z)=P(z,T|x) for each x in data • M step: Set • pc = avg of Q(c) • rT = avg of Q(T) • mc = avg mean of z under Q(z|c) • Fc = avg variance of z under Q(z|c) • Y = avgvar of x-Tz under Q(T,z) c T z x

  47. Random initialization Experiment: Fitting transformed mixtures of Gaussians to complicated images Model size 1 class 2 classes 3 classes 4 classes

  48. Let’s peek into the Bayes net (different movie) P(c|x) margmaxcP(c|x) argmaxTP(T|x) E[z|x] E[Tz|x] x

  49. tmgEM.m is available on the web

  50. Accounting for multiple objects in the same image