Sparse Approximations to Bayesian Gaussian Processes

Sparse Approximations toBayesian Gaussian Processes Matthias Seeger University of Edinburgh

Joint Work With • Christopher Williams (Edinburgh) • Neil Lawrence (Sheffield) Builds on prior work by: • Lehel Csato, Manfred Opper (Birmingham)

Overview of the Talk • Gaussian processes and approximations • Understanding sparse schemes aslikelihood approximations • Fast greedy selection • Model selection

The Recipe • Goal: Probabilistic approximation to GP inference. Scaling (in principle) O(n) • Ingredients: • Gaussian approximations • m-projections (moment matching) • e-projections (mean field)

y1 y2 y3 Gaussian prior(dense),kernel K u1 u2 u3 x1 x2 x3 Gaussian Process Models Given ui: yi independent of rest!

Roadmap Non-Gaussian Posterior Process m-project., EP GP Approximation Finite Gaussian Approximation,Feasible Fitting Scheme Likelihood approx. by e-project. Sparse Scheme Sparse Gaussian Approximation,leading to sparse predictor

Step 1: Infinite  Finite • Gaussian process approximation Q(u(¢) | y) of posterior P(u(¢) | y) by m-projection • Dataconstrains u=(u1,…,un) onlyQ determined by finite GaussianQ(u | y) and prior GP • Optimal Gaussian Q(u | y) hard to find, not sparse

Step 2: Expectation Propagation • Behind EP: Approx. variational principle(e-projections) with “weak marginalisation” (moment) constraints (m-projections) • Replace likelihood terms P(yi | ui) byGaussian-like sites ti(ui) / N(ui | mi,pi-1) • Update: Change ti(ui)P(yi | ui), m-project to Gaussian, extract new ti(ui) • ti(ui): role of Shafer/Shenoy update factors

y1 y2 y3 y4 u1 u2 u3 u4 x1 x2 x3 x4 Active Set I = {2,3}, d = |I| = 2 Likelihood Approximations P(y | u) = P(y | uI) Sparse approximation!

Step 3: KL-optimal Projections • If P(u | y) / N(m | u,P-1) P(u) ,e-projection to I-LH-approx. family givesQ(u | y) / N(m | EP[u | uI],P-1) P(u)[Csato/Opper] • Here: • Good news: EP[u | uI] = PITuI requires small inversion KI-1 only! O(n3) scaling can be circumvented

Sparse Approximation Scheme • Iterate between: • Select new i and include into I • EP updates (m-projection), followed by e-projection to I-LH-approx. family[skip EP if likelihood is Gaussian] • Exchange moves possible (unstable?) • But how to select inclusion candidates i using a fast score?

Fast Selection Scores • Criteria like Information Gain D[Qnew || Q] (Qnew after i-inclusion) too expensive:ui immediately coupled with all n sites! • Approximate criteria by removing most couplings in Qnew O(|H| d2 + 1) H {1,…,n} \ (H[ i) i Sites: Latents: uI ui

Model Selection • Gaussian likelihood (regression):Sparse approximation Q(y) to marginal likelihoodP(y) by plugging in LH approx. Iterate between descent in log Q(y) and re-selection of I • General case: minimise variational criterion behind EP (ADATAP) Similar to EM, using Q(u | y) instead of posterior

Related Work • Csato/Opper: Same approximation, but uses online-like scheme of including/removing points instead of greedy forward selection • Smola/Bartlett: Restricted to regression with Gaussian noise. Expensive selection heuristic [ O(n d) ]  high training cost

Experiments • Regression with Gaussian noise, simplest selection score approximation (H=;) See paper for details • Promising: hard, low-noise task with many irrelevant attributes. Sparse scheme matches performance of full GPR in <1/10 time. Methods with isotropic kernel fail poorly Model selection essential here

Conclusions • Sparse approximations overcome the severe scaling problem of GP methods • Greedy selection based on “active learning” criteria can yield very sparse solutions with errors close to or better than for full GPs • Sparse inference is inner loop for model selection  fast selection scores are essential for greedy schemes

Conclusions (II) • Controllable sparsity and training time • Staying as close as possible to the “gold standard” (EP), given resource constraints  transfer of properties (errors bars,model selection, embedding in other models, …) • Fast, flexible C++ implementation will be made available

Sparse Approximations to Bayesian Gaussian Processes