1 / 24

Sparse Approximations to Bayesian Gaussian Processes

Sparse Approximations to Bayesian Gaussian Processes. Matthias Seeger University of Edinburgh. Collaborators. Neil Lawrence (Sheffield) Chris Williams (Edinburgh) Ralf Herbrich (MSR Cambridge). Overview of the Talk. Gaussian processes and approximations

prema
Download Presentation

Sparse Approximations to Bayesian Gaussian Processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse Approximations toBayesian Gaussian Processes Matthias Seeger University of Edinburgh

  2. Collaborators • Neil Lawrence (Sheffield) • Chris Williams (Edinburgh) • Ralf Herbrich (MSR Cambridge)

  3. Overview of the Talk • Gaussian processes and approximations • Understanding sparse schemes aslikelihood approximations • Two schemes and their relationships • Fast greedy selection for the projected latent variables scheme (GP regression)

  4. Why Sparse Approximations? • GPs lead to very powerful Bayesian methods for function fitting, classification, etc. Yet: (Almost) Nobody uses them! • Reason: Horrible scaling O(n3) • If sparse approximations work, there is a host of applications, e.g. as building blocks in Bayesian networks, etc.

  5. y1 y2 y3 u1 u2 u3 x1 x2 x3 Gaussian Process Models Target y separated by latent u from all other variables Inference a finite problem Gaussian prior(dense),kernel K

  6. Conditional GP (Prior) n-dim. Gaussian Parameterisation Data D = {(xi,yi) | i=1,…,n}.Latent outputs u = (u1,…,un). Approximate posterior process P(u(¢) | D)by GP Q(u(¢) | D)

  7. GP Approximations • Most (non-MCMC) GP approximations use this representation • Exact computation of Q(u | D) intractable, needs • Attractive for sparse approximations:Sequential fitting of Q(u | D) to P(u | D)

  8. Assumed Density Filtering Update (ADF step):

  9. Towards Sparsity • ADF = Bayesian Online [Opper].Multiple updates: Cavity method [Opper, Winther], EP [Minka] • Generalizations: EP [Minka], ADATAP [Csato,Opper,Winther: COW] • Sequential updates suitable for sparse online or greedy methods

  10. Depends on uI only Likelihood Approximations Active set: I ½ {1,…,n}, |I| = d¿ n Several sparse schemes can be understood aslikelihood approximations

  11. y2 y3 u2 u3 x2 x3 Likelihood Approximations (II) y1 y4 u1 u4 x1 x4 Active Set I = {2,3}

  12. Likelihood Approximations (III) For such sparse schemes: • O(d2) parameters at most • Prediction in O(d2), O(d) for mean only • Approximations to marginal likelihood (variational lower bound, ADATAP [COW]), PAC bounds [Seeger], etc., become cheap as well!

  13. Two Schemes • IVM [Lawrence, Seeger, Herbrich: LSH]ADF with fast greedy forward selection • Sparse Greedy GPR [Smola, Bartlett: SB]Greedy, expensive. Can be sped up:Projected Latent Variables [Seeger, Lawrence, Williams]. More general:Sparse batch ADATAP [COW] • Not here: Sparse Online GP [Csato, Opper]

  14. Only d are non-zero Informative Vector Machine • ADF, stopped after dinclusions [could do deletions, exchanges] • Fast greedy forward selection using criteria known in active learning • Faster than SVM on hard MNIST binary tasks, yet probabilistic (error bars, etc.)

  15. Why So Simple? • Locality Property of ADF:Marginal Qnew(ui) in O(1) from Q(ui) • Locality Property and Gaussianity:Relations like: Fast evaluation of differential criteria

  16. KL-Optimal Projections • Csato/Opper observed:

  17. KL-Optimal Projections (II) • For Gaussian likelihood: • Can be used online or batch • A bit unfortunate: We use relative entropy both ways around!

  18. Projected Latent Variables • Full GPR samples uI» P(uI), uR» P(uR | uI), y» N(y | u, s2I). • Instead: y» N(y | E[u | uI], s2I). Latent variables uR replaced by projections in likelihood [SB] (without interpret.) • Note: Sparse batch ADATAP [COW] more general (non-Gaussian likelihoods)

  19. Fast Greedy Selections • With this likelihood approximation, typical forward selection criteria (MAP [SB]; diff. entropy, info-gain [LSH]) are too expensive • Problem: Upon inclusion, latent ui is coupled with all targets y • Cheap criterion: Ignore most couplings for score evaluation (not for inclusion!)

  20. Yet Another Approximation • To score xi, we approximate Qnew(u | D) after inclusion of i by • Example: Information gain

  21. Fast Greedy Selections (II) • Leads to O(1) criteria.Cost of searching over all remaining points dominated by cost for inclusion • Can easily be generalized to allow for couplings between ui and some targets, if desired • Can be done for sparse batch ADATAP as well

  22. Marginal Likelihood • The marginal likelihood is • Can be optimized efficiently w.r.t. s and kernel parameters, O(n d (d+p)) per gradient, p number of parameters • Keep I fixed during line searches, reselect for search directions

  23. Conclusions • Most sparse approximations can be understood as likelihood approximations • Several schemes available, all O(n d2), yet constants do matter here! • Fast information-theoretic criteria effective for classification Extension to active learning straightforward

  24. Conclusions (II) • Missing: Experimental comparison, esp. to test effectiveness of marginal likelihood optimization • Extensions: • C classes: Easy in O(n d2 C2), maybe in O(n d2 C) • Integrate with Bayesian networks[Friedman, Nachman]

More Related