1 / 26

260 likes | 435 Views

PAC-Bayesian Theorems for Gaussian Process Classifications. Matthias Seeger University of Edinburgh. Overview. PAC-Bayesian theorem for Gibbs classifiers Application to Gaussian process classification Experiments Conclusions. What Is a PAC Bound?. Sample S= {( x i ,t i ) | i=1,…,n}.

Download Presentation
## PAC-Bayesian Theorems for Gaussian Process Classifications

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**PAC-Bayesian Theorems forGaussian Process Classifications**Matthias Seeger University of Edinburgh**Overview**• PAC-Bayesian theorem for Gibbs classifiers • Application to Gaussian process classification • Experiments • Conclusions**What Is a PAC Bound?**Sample S= {(xi,ti) | i=1,…,n} Unknown P* • Algorithm: Sa Predictor t* from x*Generalisation error: gen(S) • PAC/distribution free bound: i.i.d.**Nonuniform PAC Bounds**• A PAC bound has tohold independent of correctnessof prior knowledge • It does not have tobe independentof prior knowledge • Unfortunately, most standard VC bounds are only vaguely dependent on prior/model they are applied to lack tightness**w**y1 y2 y3 t1 t2 t3 Gibbs Classifiers • Bayes classifier: • Gibbs classifier:New independent w for each prediction R3 2{-1,+1}**PAC-Bayesian Theorem**Result for Gibbs classifiers • Prior P(w), independent of S • Posterior Q(w), may depend on S • Expected generalisation error: • Expected empirical error:**PAC-Bayesian Theorem (II)**McAllester (1999): • D[Q || P]: Relative entropyIf Q(w) feasible approximation to Bayesian posterior, we can compute D[Q || P]**The Proof Idea**Step 1: Inequality for a dumb classifier Let .Large deviation bound holds for fixed w (use Asymptotic Equipartition Property). Since P(w) independent of S, bound holds also “on average”**The Proof Idea (II)**Could use Jensen’s inequality: But so what?? P is fixed a-priori, giving a pretty dumb classifier! • Can we exchange P for Q? Yes! • What do we have to pay? n-1 D[Q || P]**Convex Duality**• Could finish proof using tricks and Jensen.Let’s see what’s behind instead! • Convex (Legendre) Duality:A very simple, but powerful concept:Parameterise linear lower bounds to a convex function • Behind the scenes (almost) everywhere:EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theorem**The Proof Idea (III)**• Works just as well for spaces of functions and distributions. • For our purpose:is convex and has the dual**The Proof Idea (IV)**• This gives the boundfor all Q, l • Set l(w) = n D(w). Then:Have already bounded 2nd term right.And on the left (Jensen again):**Comments**• PAC-Bayesian technique generic:Use specific large deviation bounds for the Q-independent term • Choice of Q:Trade-off between emp(S,Q) and divergence D[Q || P].Bayesian posterior a good candidate**Gaussian Process Classification**• Recall yesterday:We approximate true posterior process by a Gaussian one:**The Relative Entropy**• But, then the relative entropy is just: • Straightforward to compute for all GPC approximations in this class**Concrete GPC Methods**We considered so far: • Laplace GPC [Barber/Williams] • Sparse greedy GPC (IVM) [Csato/Opper, Lawrence/Seeger/Herbrich] Setup:Downsampled MNIST (2s vs. 3s). RBF kernels. Model selection using independent holdout sets (no ML-II allowed here!)**Results Sparse Greedy GPC**Extremely tight for a kernel classifier bound Note: These results are for Gibbs classifiers.Bayes classifiers do better, but the (original)PAC-Bayesian theorem does not hold**Comparison Compression Bound**• Compression bound for sparse greedy GPC (Bayes version, not Gibbs) • Problem: Bound not configurable by prior knowledge, not specific to the algorithm**Comparison With SVM**• Compression bound (best we could find!) • Note: Bound values lower than for sparse GPC onlybecause of sparser solution:Bound does not depend on algorithm!**The Bayes Classifier**• Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type classifiers • Uses recent Rademacher complexity bounds together with convex duality argument • Can be applied to GP classification as well (not yet done)**Conclusions**• PAC-Bayesian technique (convex duality) leads to tighter bounds than previously available for Bayes-type classifiers (to our knowledge) • Easy extension to multi-class scenarios • Application to GP classification:Tighter bounds than previously available for kernel machines (to our knowledge)**Conclusions (II)**• Value in practice: Bound holds for any posterior approximation, not just the true posterior itself • Some open problems: • Unbounded loss functions • Characterize the slack in the bound • Incorporating ML-II model selection over continuous hyperparameter space

More Related