1 / 26

PAC-Bayesian Theorems for Gaussian Process Classifications

PAC-Bayesian Theorems for Gaussian Process Classifications. Matthias Seeger University of Edinburgh. Overview. PAC-Bayesian theorem for Gibbs classifiers Application to Gaussian process classification Experiments Conclusions. What Is a PAC Bound?. Sample S= {( x i ,t i ) | i=1,…,n}.

gamada
Download Presentation

PAC-Bayesian Theorems for Gaussian Process Classifications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PAC-Bayesian Theorems forGaussian Process Classifications Matthias Seeger University of Edinburgh

  2. Overview • PAC-Bayesian theorem for Gibbs classifiers • Application to Gaussian process classification • Experiments • Conclusions

  3. What Is a PAC Bound? Sample S= {(xi,ti) | i=1,…,n} Unknown P* • Algorithm: Sa Predictor t* from x*Generalisation error: gen(S) • PAC/distribution free bound: i.i.d.

  4. Nonuniform PAC Bounds • A PAC bound has tohold independent of correctnessof prior knowledge • It does not have tobe independentof prior knowledge • Unfortunately, most standard VC bounds are only vaguely dependent on prior/model they are applied to lack tightness

  5. w y1 y2 y3 t1 t2 t3 Gibbs Classifiers • Bayes classifier: • Gibbs classifier:New independent w for each prediction R3 2{-1,+1}

  6. PAC-Bayesian Theorem Result for Gibbs classifiers • Prior P(w), independent of S • Posterior Q(w), may depend on S • Expected generalisation error: • Expected empirical error:

  7. PAC-Bayesian Theorem (II) McAllester (1999): • D[Q || P]: Relative entropyIf Q(w) feasible approximation to Bayesian posterior, we can compute D[Q || P]

  8. The Proof Idea Step 1: Inequality for a dumb classifier Let .Large deviation bound holds for fixed w (use Asymptotic Equipartition Property). Since P(w) independent of S, bound holds also “on average”

  9. The Proof Idea (II) Could use Jensen’s inequality: But so what?? P is fixed a-priori, giving a pretty dumb classifier! • Can we exchange P for Q? Yes! • What do we have to pay? n-1 D[Q || P]

  10. Convex Duality • Could finish proof using tricks and Jensen.Let’s see what’s behind instead! • Convex (Legendre) Duality:A very simple, but powerful concept:Parameterise linear lower bounds to a convex function • Behind the scenes (almost) everywhere:EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theorem

  11. Convex Duality (II)

  12. Convex Duality (III)

  13. The Proof Idea (III) • Works just as well for spaces of functions and distributions. • For our purpose:is convex and has the dual

  14. The Proof Idea (IV) • This gives the boundfor all Q, l • Set l(w) = n D(w). Then:Have already bounded 2nd term right.And on the left (Jensen again):

  15. Comments • PAC-Bayesian technique generic:Use specific large deviation bounds for the Q-independent term • Choice of Q:Trade-off between emp(S,Q) and divergence D[Q || P].Bayesian posterior a good candidate

  16. Gaussian Process Classification • Recall yesterday:We approximate true posterior process by a Gaussian one:

  17. The Relative Entropy • But, then the relative entropy is just: • Straightforward to compute for all GPC approximations in this class

  18. Concrete GPC Methods We considered so far: • Laplace GPC [Barber/Williams] • Sparse greedy GPC (IVM) [Csato/Opper, Lawrence/Seeger/Herbrich] Setup:Downsampled MNIST (2s vs. 3s). RBF kernels. Model selection using independent holdout sets (no ML-II allowed here!)

  19. Results for Laplace GPC

  20. Results Sparse Greedy GPC Extremely tight for a kernel classifier bound Note: These results are for Gibbs classifiers.Bayes classifiers do better, but the (original)PAC-Bayesian theorem does not hold

  21. Comparison Compression Bound • Compression bound for sparse greedy GPC (Bayes version, not Gibbs) • Problem: Bound not configurable by prior knowledge, not specific to the algorithm

  22. Comparison With SVM • Compression bound (best we could find!) • Note: Bound values lower than for sparse GPC onlybecause of sparser solution:Bound does not depend on algorithm!

  23. Model Selection

  24. The Bayes Classifier • Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type classifiers • Uses recent Rademacher complexity bounds together with convex duality argument • Can be applied to GP classification as well (not yet done)

  25. Conclusions • PAC-Bayesian technique (convex duality) leads to tighter bounds than previously available for Bayes-type classifiers (to our knowledge) • Easy extension to multi-class scenarios • Application to GP classification:Tighter bounds than previously available for kernel machines (to our knowledge)

  26. Conclusions (II) • Value in practice: Bound holds for any posterior approximation, not just the true posterior itself • Some open problems: • Unbounded loss functions • Characterize the slack in the bound • Incorporating ML-II model selection over continuous hyperparameter space

More Related