1 / 47

Suboptimality of Bayes and MDL in Classification

Suboptimality of Bayes and MDL in Classification. Peter Gr ü nwald CWI/EURANDOM www.grunwald.nl. joint work with John Langford, TTI Chicago,. Preliminary version appeared in Proceedings17 th annual Conference On Learning Theory ( COLT 2004 ). Our Result.

hedin
Download Presentation

Suboptimality of Bayes and MDL in Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM www.grunwald.nl joint work with John Langford, TTI Chicago, Preliminary version appeared in Proceedings17th annual Conference On Learning Theory (COLT 2004)

  2. Our Result • We study Bayesian and Minimum Description Length (MDL) inference in classification problems • Bayes and MDL should automatically deal with overfitting • We show there exist classification domains where Bayes and MDL… when applied in a standard manner …perform suboptimally (overfit!) even if sample size tends to infinity

  3. Why is this interesting? • Practical viewpoint: • Bayesian methods • used a lot in practice • sometimes claimed to be ‘universally optimal’ • MDL methods • even designed to deal with overfitting • Yet MDL and Bayes can ‘fail’ even with infinite data • Theoretical viewpoint • How can result be reconciled with various strong Bayesian consistency theorems?

  4. Menu • Classification • Abstract statement of main result • Precise statement of result • Discussion

  5. Classification • Given: • Feature space • Label space • Sample • Set of hypotheses (classifiers) • Goal: find a that makes few mistakes on future data from the same source • We say ‘c has small generalization error/classification risk’

  6. Classification Models • Types of Classifiers • hard classifiers: (-1/1-output) • decision trees, stumps, forests • soft classifiers (real-valued output) • support vector machines • neural networks • probabilistic classifiers • Naïve Bayes/Bayesian network classifiers • Logistic regression initial focus

  7. Generalization Error • As is customary in statistical learning theory, we analyze classification by postulating some (unknown) distribution D on joint (input,label)-space • Performance of a classifier measured in terms of its generalization error (classification risk) defined as

  8. Learning Algorithms • A learning algorithm LA based on set of candidate classifiers , is a function that, for each sample S of arbitrary length, outputs classifier :

  9. Consistent Learning Algorithms • Suppose are i.i.d. • A learning algorithm is consistent or asymptotically optimal if, no matter what the ‘true’ distributionDis, in D – probability, as .

  10. Consistent Learning Algorithms • Suppose are i.i.d. • A learning algorithm is consistent or asymptotically optimal if, no matter what the ‘true’ distributionDis, in D – probability, as . ‘learned’ classifier where is ‘best’ classifier in

  11. Main Result • There exists • input domain • prior P , non-zero on a countable set of classifiers • ‘true’ distribution D • a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal:

  12. Main Result • There exists • input domain • prior , non-zero on a countable set of classifiers • ‘true’ distribution D • a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal: • Same holds for MDL learning algorithm

  13. Remainder of Talk • How is “Bayes learning algorithm” defined? • What is scenario? • how do , ‘true’ distr. D and prior P look like? • How dramatic is result? • How large is K? • How strange are choices for ? • How bad can Bayes get? • Why is result surprising? • can it be reconciled with Bayesian consistency results?

  14. Bayesian Learning of Classifiers • Problem: Bayesian inference defined for models that are sets of probability distributions • In our scenario, models are sets of classifiers , i.e. functions • How can we find a posterior over classifiers using Bayes rule? • Standard answer: convert each to a corresponding distribution and apply Bayes to the set of distributions thus obtained

  15. classifiers probability distrs. • Standard conversion method from to : logistic (sigmoid) transformation • For each and , set • Define priors on and on and set

  16. Logistic transformation - intuition • Consider ‘hard’ classifiers • For each , • Here is empirical error that c makes on data, and is number of mistakesc makes on data

  17. Logistic transformation - intuition • For fixed • log-likelihood is linear function of number of mistakes c makes on data • (log-) likelihood maximized for c that is optimal for observed data • For fixed c, • Maximizing likelihood over also makes sense

  18. Logistic transformation - intuition • In Bayesian practice, logistic transformation is standard tool, nowadays performed without giving any motivation or explanation • We did not find it in Bayesian textbooks, … • …, but tested it with three well-known Bayesians! • Analogous to turning set of predictors with squared error into conditional distributions with normally distributed noise expresses where Z is independent noise bit

  19. Main Result Grünwald & Langford, COLT 2004 • There exists • input domain • prior P on a countable set of classifiers • ‘true’ distribution D • a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal: holds both for full Bayes and for Bayes (S)MAP

  20. Definition of . • Posterior: • Predictive Distribution: • “Full Bayes” learning algorithm:

  21. Issues/Remainder of Talk • How is “Bayes learning algorithm” defined? • What is scenario? • how do , ‘true’ distr. D and prior P look like? • How dramatic is result? • How large is K? • How strange are choices for ? • How bad can Bayes get? • Why is result surprising? • can it be reconciled with Bayesian consistency results?

  22. Scenario • Definition of Y, X and : • Definition of prior: • for some small , for all large n, • can be any strictly positive smooth prior (or discrete prior with sufficient precision)

  23. Scenario – II: Definition of true D • Toss fair coin to determine value of Y . • Toss coin Z with bias • If (easy example) then for all , set • If (hard example) then set and for all , independently set

  24. Result: • All features are informative of , but is more informative than all the others, so is best classifier: • Nevertheless, with ‘true’ D- probability 1, as (but note: for each fixedj, )

  25. Issues/Remainder of Talk • How is “Bayes learning algorithm” defined? • What is scenario? • how do , ‘true’ distr. D and prior P look like? • How dramatic is result? • How large is K? • How strange are choices for ? • How bad can Bayes get? • Why is result surprising? • can it be reconciled with Bayesian consistency results?

  26. Theorem 1 Grünwald & Langford, COLT 2004 • There exists • input domain • prior P on a countable set of classifiers • ‘true’ distribution D • a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal: holds both for full Bayes and for Bayes MAP

  27. Theorem 1, extended • X-axis: • = maximum Bayes MAP/MDL • = maximum full Bayes (binary entropy) • Maximum difference achieved at

  28. How ‘natural’ is scenario? • Basic scenario is quite unnatural • We chose it because we could prove something about it! But: • Priors are natural (take e.g. Rissanen’s universal prior) • Clarke (2002) reports practical evidence that Bayes performs suboptimally with large yet misspecified models in a regression context • Bayesian inference is consistent under very weak conditions. So even if unnatural, result is still interesting!

  29. Issues/Remainder of Talk • How is “Bayes learning algorithm” defined? • What is scenario? • how do , ‘true’ distr. D and prior P look like? • How dramatic is result? • How large is K? • How strange are choices for ? • How bad can Bayes get? • Why is result surprising? • can it be reconciled with Bayesian consistency results?

  30. Bayesian Consistency Results • Doob (1949, special case): Suppose • Countable • Contains ‘true’ conditional distribution Then with D -probability 1, weakly/in Hellinger distance

  31. Bayesian Consistency Results • If …then we must also have • Our result says that this does not happen in our scenario. Hence the (countable!) we constructed must be misspecified: • Model homoskedastic, ‘true’ Dheteroskedastic!

  32. Bayesian consistency under misspecification • Suppose we use Bayesian inference based on ‘model’ • If , then under ‘mild’ generality conditions, Bayes still converges to distribution that is closest to in KL-divergence (relative entropy). • The logistic transformation ensures that achieved for c that also achieves

  33. Bayesian consistency under misspecification • In our case, Bayesian posterior does not converge to distribution with smallest classification generalization error, so it also does not converge to distribution closest to true D in KL-divergence • Apparently, ‘mild’ generality conditions for ‘Bayesian consistency under misspecification’ are violated • Conditions for consistency under misspecification are much stronger than conditions for standard consistency! • must either be convex or “simple” (e.g. parametric)

  34. Is consistency achievable at all? • Methods for avoiding overfitting proposed in statistical and computational learning theory literature are consistent • Vapnik’s methods (based on VC-dimension etc.) • McAllester’s PAC-Bayes methods • These methods invariably punish ‘complex’ (low prior) classifiers much more than ordinary Bayes – in the simplest version of PAC-Bayes,

  35. Consistency and Data Compression - I • Our inconsistency result also holds for (various incarnations of) MDL learning algorithm • MDL is a learning method based on data compression; in practice it closely resembles Bayesian inference with certain special priors • ….however…

  36. Consistency and Data Compression - II • There already exist (in)famous inconsistency results for Bayesian inference by Diaconis and Freedman • For some highly non-parametric , even if “true” D is in , Bayes may not converge to it • These type of inconsistency results do not apply to MDL, since Diaconis and Freedman use priors that do not compress the data • With MDL priors, if true D is in , then consistency is guaranteed under no further conditions at all (Barron ’98)

  37. Issues/Remainder of Talk • How is “Bayes learning algorithm” defined? • What is scenario? • how do , ‘true’ distr. D and prior P look like? • How dramatic is result? • How large is K? • How strange are choices for ? • How bad can Bayes get? (& “what happens”) • Why is result surprising? • can it be reconciled with Bayesian consistency results?

  38. Thm 2: full Bayes result is ‘tight’ • X-axis: • = maximum Bayes MAP/MDL • = maximum full Bayes (binary entropy) • Maximum difference achieved at

  39. Theorem 2

  40. Proof Sketch • Log loss of Bayes upper bounds 0/1-loss: • For every sequence • Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term

  41. Proof Sketch

  42. Proof Sketch • Log loss of Bayes upper bounds 0/1-loss: • For every sequence • Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term

  43. Proof Sketch • Log loss of Bayes upper bounds 0/1-loss: • For every sequence • Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term (Law of large nrs/Hoeffding)

  44. Wait a minute… • Accumulated log loss of sequential Bayesian predictions is always within of accumulated log loss of optimal • So Bayes is ‘good’ with respect to log loss/KL-div. • But Bayes is ‘bad’ with respect to 0/1-loss • How is this possible? • The Bayesian posterior effectively becomes a mixture of ‘bad’ distributions (different mixture at different m) • Mixture is closer to true distribution D than in KL-divergence/log loss prediction • But performs worse than in terms of 0/1 error

  45. Bayes predicts too well • Let be a set of distribution, and let be defined with respect to a prior that makes a universal data-compressor wrt • One can show that the only true distributions D for which Bayes can ever become inconsistent in KL-divergence sense… • …are those under which the posterior predictive distribution becomes closer in KL-divergence to Dthan the best single distribution in

  46. Conclusion • Our result applies to hard classifiers and (equivalently) to probabilistic classifiers under slight misspecification • Bayesian may argue that the Bayesian machinery was never intended for misspecified models • Yet, computational resources and human imagination being limited, in practice Bayesian inference is applied to misspecified modelsall the time. • In this case, Bayes may overfit even in the limit for an infinite amount of data

  47. Thank you for your attention!

More Related