1 / 33

Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory. Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004. Motivation. Supervised Learning learn functional relationships from a finite set of labelled training examples Generalization

Download Presentation

Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi:General Conditions for Predictivity in Learning Theory Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

  2. Motivation • SupervisedLearning • learn functional relationships from a finite set of labelled training examples • Generalization • How well does the learned function perform on unseen test examples? • Central question in supervised learning

  3. What you will hear • New Idea: Stability implies predictivity • learning algorithm is stable if small pertubations of training set do not change hypothesis much • Conditions for generalization on learning map rather than hypothesis space • in contrast to VC-analysis

  4. Agenda • Introduction • Problem Definition • Classical Results • Stability Criteria • Conclusion

  5. Some Definitions 1/2 • Training Data: S = {z1=(x1,y1), ..., zn=(xn, yn)} • Z = X  Y • Unknown Distribution (x, y) • Hypothesis Space: H • Hypothesis fS  H: X  Y • Learning Algorithm: • Regression: fS is real-valued / Classification: fS is binary • symmetric learning algorithm (ordering irrelevant)

  6. Some Definitions 2/2 • Loss Function: V(f, z) • e.g. V(f, z) = (f(x) – y)2 • Assume that V is bounded • Empirical Error (Training Error) • Expected Error (True Error)

  7. Generalization and Consistency • Convergence in Probability • Generalization • Performance on training examples must be a good indicator of performance on future examples • Consistency • Expected error converges to most accurate one in H

  8. Agenda • Introduction • Problem Definition • Classical Results • Stability Criteria • Conclusion

  9. Empirical Risk Minimization (ERM) • Focus of classical learning theory research • exact and almost ERM • Minimize training error over H: • take best hypothesis on training data • For ERM: Generalization  Consistency

  10. What algorithms are ERM? • All these belong to class of ERM algorithms • Least Squares Regression • Decision Trees • ANN Backpropagation (?) • ... • Are all learning algorithms ERM? • NO! • Support Vector Machines • k-Nearest Neighbour • Bagging, Boosting • Regularization • ...

  11. Vapnik asked What property must the hypothesis space H have to ensure good generalization of ERM?

  12. Classical Results for ERM1 • Theorem: A necessary and sufficient condition for generalization and consistency of ERM is that H is a uniform Glivenko-Cantelli(uGC)class: • convergence of empirical mean to true expected value • uniform convergence in probability of loss functions induced by H and V 1 e.g. Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

  13. VC-Dimension • Binary functions f: X{0, 1} • VC-dim(H) = size of largest finite set in X that can be shattered by H • e.g. linear separation in 2D yields VC-dim = 3 • Theorem: Let H be a class of binary valued hypotheses, then H is a uGC-class if and only if VC-dim(H) is finite1. 1 Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

  14. Achievements of Classical Learning Theory • Complete characterization of necessary and sufficient conditions for generalization and consistency of ERM • Remaining questions: • What about non-ERM algorithms? • Can we establish criteria not only for the hypothesis space?

  15. Agenda • Introduction • Problem Definition • Classical Results • Stability Criteria • Conclusion

  16. Poggio et.al. asked What property must the learning map L have for good generalization of general algorithms? Can a new theory subsume the classical results for ERM?

  17. Stability • Small pertubations of the training set should not change the hypothesis much • especially deleting one training example • Si = S \ {zi} • How can this be mathematically defined? Original Training Set S Perturbed Training Set Si Learning Map Hypothesis Space

  18. Uniform Stability1 • A learning algorithm L is uniformly stableif • After deleting one training sample the change must be small at all points z  Z • Uniform stability implies generalization • Requirement is too strong • Most algorithms (e.g. ERM) are not uniformly stable 1 Bousquet, Elisseeff: Stability and Generalization, JMLR 2, 2001

  19. CVloo stability1 • Cross-Validation leave-one-out stability • considers only errors at removed training points • strictly weaker than uniform stability remove zi error at xi 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

  20. Equivalence for ERM1 • Theorem: For good loss functions the following statements are equivalent for ERM: • L is distribution-independent CVloo stable • ERM generalizes and is universally consistent • H is a uGC class • Question: Does CVloo stability ensure generalization for all learning algorithms? 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

  21. CVloo Counterexample1 • X be uniform on [0, 1] • Y  {-1, +1} • Target f *(x) = 1 • Learning algorithm L: • No change at removed training point  CVloo stable • Algorithm does not generalize at all! 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

  22. Additional Stability Criteria • Error (Eloo) stability • Empirical Error (EEloo) stability • Weak conditions, satisfied by most reasonable learning algorithms (e.g. ERM) • Not sufficient for generalization

  23. CVEEEloo Stability • Learning Map L is CVEEEloo stable if it is • CVloo stable • and Eloo stable • and EEloo stable • Question: • Does this imply generalization for all L?

  24. CVEEEloo implies Generalization1 • Theorem: If L is CVEEEloo stable and the loss function is bounded, then fSgeneralizes • Remarks: • Neither condition (CV, E, EE) itself is sufficient • Eloo and EEloo stability are not sufficient • For ERM CVloo stability alone is necessary and sufficient for generalization and consistency 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

  25. Consistency • CVEEEloo stability in general does NOT guarantee consistency • Good generalization does NOT necessarily mean good prediction • but poor expected performance is indicated by poor training performance

  26. CVEEEloo stable algorithms • Support Vector Machines and Regularization • k-Nearest Neighbour (k increasing with n) • Bagging (number of regressors increasing with n) • More results to come (e.g. AdaBoost) • For some of these algorithms a ´VC-style´ analysis is impossible (e.g. k-NN) • For all these algorithms generalization is guaranteed by the shown theorems!

  27. Agenda • Introduction • Problem Definition • Classical Results • Stability Criteria • Conclusion

  28. Implications • Classical „VC-style“ conditions • Occams Razor: prefer simple hypotheses • CVloo stability • Incremental Change • online-algorithms • Inverse Problems: stability  well-posedness • condition numbers characterize stability • Stability-based learning may have more direct connections with brain‘s learning mechanisms • condition on learning machinery

  29. Language Learning • Goal: learn grammars from sentences • Hypothesis Space: class of all learnable grammars • What is easier to characterize and gives more insight into real language learning? • Language learning algorithm • or Class of all learnable grammars? • Focus on algorithms shift focus to stability

  30. Conclusion • Stability implies generalization • intuitive (CVloo) and technical (Eloo, EEloo) criteria • Theory subsumes classical ERM results • Generalization criteria also for non-ERM algorithms • Restrictions on learning map rather than hypothesis space • New approach for designing learning algorithms

  31. Open Questions • Easier / other necessary and sufficient conditions for generalization • Conditions for general consistency • Tight bounds for sample complexity • Applications of the theory for new algorithms • Stability proofs for existing algorithms

  32. Thank you!

  33. Sources • T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General conditions for predictivity in learning theory, Nature Vol. 428, S. 419-422, 2004 • S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin: Statistical Learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, AI Memo 2002-024, MIT, 2003 • T. Mitchell: Machine Learning, McGraw-Hill, 1997 • C. Tomasi: Past Performance and future results, Nature Vol. 428, S. 378, 2004 • N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scale-sensitive Dimensions, Uniform Convergence, and Learnability, Journal of ACM 44(4), 1997

More Related