130 likes | 292 Views
Generalization to Unseen Cases: (No) Free Lunches and Good-Turing estimation. Teemu Roos Helsinki Institute for Information Technology. joint work with Peter Grunwald (CWI) , Petri Myllymaki (HIIT) and H. Tirri (HIIT). (NIPS 2005). Off-Training Set Error. Classification Setting:
 
                
                E N D
Generalization to Unseen Cases:(No) Free Lunches and Good-Turing estimation Teemu Roos Helsinki Institute for Information Technology joint work with Peter Grunwald (CWI), Petri Myllymaki (HIIT) and H. Tirri (HIIT). (NIPS 2005)
Off-Training Set Error • Classification Setting: • Feature space • Label space • Sample i.i.d. • Classifier is a function • Ordinary generalization error of classifier h • Off-Training Set Error (Wolpert ’92, ’94, Shaffer ’93) where
Off-Training Set (OTS) Error • Probability of making a mistake on a test instance conditioned on not having seen it before. • An intuitive way of measuring generalization error • Textbooks: inductive generalization is about performance on unseen cases. • rules out memorization as a form of learning. • Can be significantly different from standard gen. error in practice, even with continuous-valued features.
Generalization Bounds? • No known generalization bounds for OTS error. • Wolpert (2001) suggests we can’t get them: • We take up Wolpert’s challenge and give a data-dependent bound on the difference between OTS and IID error that is relevant for practical data sets. The primary reason that the conventional frameworks allow overlap between training set and test set is that much of their research has been driven by the mathematical tools their practitioners are well-versed in rather than by consideration what the important issues in supervised learning are. Unfortunately, those tools are ill-suited for investigating off-training set behaviour.
Intuition • Lemma: uniformly for all functions sample coverage = 1 – missing mass
Intuition • Lemma: uniformly for all functions If missing mass close to 1, then uniformly for all hypotheses, OTS must be close to standard generalization error. sample coverage = 1 – missing mass
Good, Turing and OTS • If missing mass close to 1, then uniformly for all hypotheses, OTS must be close to standard generalization error. • Then any generalization bound for any hypothesis space carries over to almost equally good bound for OTS error. • We bound the missing mass based on the training set… • Alan Turing and I.J. Good developed estimators for the missing mass already during WWII (when cracking the Enigma code). • Schapire, McAllester (COLT 2000) analysed the rate of convergence of Good-Turing estimators.  Probabilistic bounds on the missing mass. • Our bounds are stronger than the existing ones for small samples with relatively few repetitions (typical in machine learning).
Intuition • Consider the event that you observe several repeated X-values (xi=xj for ij)while at the same time you see only little of the total probability mass. • Lemma: No matter what the true distribution is, the probability of this event is small. • Therefore: If we see few or no repetitions in the training set, this indicates that the missing mass is large.
Main Theorem • Theorem (data-dependent OTS generalization bound)
Intuition • Lemma: uniformly for all functions If missing mass close to 0, then uniformly for all hypotheses, OTS must be close to standard generalization error. sample coverage = 1 – missing mass
Relation to Wolpert’s No Free Lunch Theorems • Wolpert: • Proper study of supervised learning should be based on OTS error. • NFL Theorem (correct): With OTS error and a uniform prior over all hypotheses, no learning algorithm performs better on the average than random guessing. • Consequence (false): Error maximization is ‘as good as’ error minimization; anti-cross-validation is as good as cross-validation. • Infuential! (i.e. textbook Duda, Hart, Stork 2000)
A Free Lunch Theorem • COLT defense: standard distribution free bounds show that empirical error minimization is better than error maximization. • The ‘uniform prior’ must be leading to very misleading conclusions; there do exist inherent differences between learning algorithms. • NFL reply: you use standard generalization error, and that’s no good. • We show that learning bounds transfer to OTS error, therefore making the COLT defense a valid one!