1 / 17

Sparse Approximations to Bayesian Gaussian Processes

Sparse Approximations to Bayesian Gaussian Processes. Matthias Seeger University of Edinburgh. Joint Work With. Christopher Williams (Edinburgh) Neil Lawrence (Sheffield) Builds on prior work by: Lehel Csato, Manfred Opper (Birmingham). Overview of the Talk.

bambi
Download Presentation

Sparse Approximations to Bayesian Gaussian Processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse Approximations toBayesian Gaussian Processes Matthias Seeger University of Edinburgh

  2. Joint Work With • Christopher Williams (Edinburgh) • Neil Lawrence (Sheffield) Builds on prior work by: • Lehel Csato, Manfred Opper (Birmingham)

  3. Overview of the Talk • Gaussian processes and approximations • Understanding sparse schemes aslikelihood approximations • Fast greedy selection • Model selection

  4. The Recipe • Goal: Probabilistic approximation to GP inference. Scaling (in principle) O(n) • Ingredients: • Gaussian approximations • m-projections (moment matching) • e-projections (mean field)

  5. y1 y2 y3 Gaussian prior(dense),kernel K u1 u2 u3 x1 x2 x3 Gaussian Process Models Given ui: yi independent of rest!

  6. Roadmap Non-Gaussian Posterior Process m-project., EP GP Approximation Finite Gaussian Approximation,Feasible Fitting Scheme Likelihood approx. by e-project. Sparse Scheme Sparse Gaussian Approximation,leading to sparse predictor

  7. Step 1: Infinite  Finite • Gaussian process approximation Q(u(¢) | y) of posterior P(u(¢) | y) by m-projection • Dataconstrains u=(u1,…,un) onlyQ determined by finite GaussianQ(u | y) and prior GP • Optimal Gaussian Q(u | y) hard to find, not sparse

  8. Step 2: Expectation Propagation • Behind EP: Approx. variational principle(e-projections) with “weak marginalisation” (moment) constraints (m-projections) • Replace likelihood terms P(yi | ui) byGaussian-like sites ti(ui) / N(ui | mi,pi-1) • Update: Change ti(ui)P(yi | ui), m-project to Gaussian, extract new ti(ui) • ti(ui): role of Shafer/Shenoy update factors

  9. y1 y2 y3 y4 u1 u2 u3 u4 x1 x2 x3 x4 Active Set I = {2,3}, d = |I| = 2 Likelihood Approximations P(y | u) = P(y | uI) Sparse approximation!

  10. Step 3: KL-optimal Projections • If P(u | y) / N(m | u,P-1) P(u) ,e-projection to I-LH-approx. family givesQ(u | y) / N(m | EP[u | uI],P-1) P(u)[Csato/Opper] • Here: • Good news: EP[u | uI] = PITuI requires small inversion KI-1 only! O(n3) scaling can be circumvented

  11. Sparse Approximation Scheme • Iterate between: • Select new i and include into I • EP updates (m-projection), followed by e-projection to I-LH-approx. family[skip EP if likelihood is Gaussian] • Exchange moves possible (unstable?) • But how to select inclusion candidates i using a fast score?

  12. Fast Selection Scores • Criteria like Information Gain D[Qnew || Q] (Qnew after i-inclusion) too expensive:ui immediately coupled with all n sites! • Approximate criteria by removing most couplings in Qnew O(|H| d2 + 1) H {1,…,n} \ (H[ i) i Sites: Latents: uI ui

  13. Model Selection • Gaussian likelihood (regression):Sparse approximation Q(y) to marginal likelihoodP(y) by plugging in LH approx. Iterate between descent in log Q(y) and re-selection of I • General case: minimise variational criterion behind EP (ADATAP) Similar to EM, using Q(u | y) instead of posterior

  14. Related Work • Csato/Opper: Same approximation, but uses online-like scheme of including/removing points instead of greedy forward selection • Smola/Bartlett: Restricted to regression with Gaussian noise. Expensive selection heuristic [ O(n d) ]  high training cost

  15. Experiments • Regression with Gaussian noise, simplest selection score approximation (H=;) See paper for details • Promising: hard, low-noise task with many irrelevant attributes. Sparse scheme matches performance of full GPR in <1/10 time. Methods with isotropic kernel fail poorly Model selection essential here

  16. Conclusions • Sparse approximations overcome the severe scaling problem of GP methods • Greedy selection based on “active learning” criteria can yield very sparse solutions with errors close to or better than for full GPs • Sparse inference is inner loop for model selection  fast selection scores are essential for greedy schemes

  17. Conclusions (II) • Controllable sparsity and training time • Staying as close as possible to the “gold standard” (EP), given resource constraints  transfer of properties (errors bars,model selection, embedding in other models, …) • Fast, flexible C++ implementation will be made available

More Related