1 / 16

1. Stat 231. A.L. Yuille. Fall 2004

1. Stat 231. A.L. Yuille. Fall 2004. PAC Learning and Generalizability . Margin Errors. Structural Risk Minimization. 2. Induction: History. Francis Bacon described empiricism. Formulate hypotheses and test by experiments. English Empiricist School of Philosophy.

avery
Download Presentation

1. Stat 231. A.L. Yuille. Fall 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1. Stat 231. A.L. Yuille. Fall 2004 • PAC Learning and Generalizability. • Margin Errors. • Structural Risk Minimization. Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  2. 2. Induction: History. • Francis Bacon described empiricism. Formulate hypotheses and test by experiments. English Empiricist School of Philosophy. • David Hume. Scottish. Scepticism. “Why should the Sun rise tomorrow just because it always has”? • Karl Popper. The Logic of Scientific Induction. Falsifiability Principle. “A hypothesis is useless unless it can be disproven”. Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  3. 3. Risk and Empirical Risk • Risk • Specialize: Two classes: M=2. Loss Function is the number of misclassifications. I.e. Empirical Risk: dataset -- set of learning machines (e.g. all thresholded hyperplanes). Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  4. 4. Risk and Empirical Risk • Key Concept: the Vapnik-Chervonenkis (VC) dimension h. • The VC dimension is a function of the set of classifiers • It is independent of the distribution P(x,y) of the dataset. • The VC dimension is a measure of the “degrees of freedom” • of the set of classifiers. Intuitively, the size of the dataset n must be larger than the VC dimension before you can learn. • E.G. Cover’s theorem. Hyperplanes in d space must have at least 2(d+1) samples to prevent the chance of finding a chance dichotomy. Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  5. 5. PAC. • Probably Approximately Correct (PAC). • If h < n, is the VC dimension of the classifier set, then with probability at least • where • For hyperplanes h = d+1. Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  6. 6. PAC • Generalizability: Small empirical risk implies, with high probability, small risk provided is small. • Probably Approximately Correct (PAC). • Because we can never be completely sure that we havn’t been mislead by rare samples. • In practice, require h/n to be small with small Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  7. 7. PAC • This is the basic Machine Learning result. There are a number of variants. • VC dimension is one measure of the capacity of the set of classifiers. Other measures give tighter bounds but are harder to compute: annealed VC entropy, and growth function. • VC dimension is d+1 for thresholded hyperplanes. It can also be bounded nicely for separable kernels. (Later this lecture). • Forthcoming lecture will sketch the derivation of PAC. It makes use of probability of rare events (e.g. Cramer’s theorem, Sanov’s theorem). Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  8. 8. VC for Margins • VC is the largest number of data points which can be shattered by the classifier set. • Shattered means that all possible dichotomies of the dataset can be expressed by a classifier in the set. (c.f. Cover’s hyperplane) • VC dimension is (d+1) for thresholded hyperplanes in d dimensions. • But we can tighter VC dimensions by considering the margins. These bounds can be extended directly to kernel hyperplanes. Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  9. 9. VC Margin Hyperplanes • Hyperplanes The are normalized wrt data • Then the set of classifiers • satisfying, has VC dimension satisfying: • where is the radius of the smallest sphere containing the datapoints. Recall is the margin. (Margin >. • Enforcing a large margin effectively limits the VC dimension. Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  10. 10. VC Margin: Kernels. • Same technique applies to kernels. • Claim: finding the minimum sphere R than encloses the data depends only on the feature vectors by the kernel (kernel trick). • Primal: minimize • Lagrange multipliers. • Dual: maximize • s.t. • Depends on dot-product only! Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  11. 11. Generalizability for Kernels • The capacity term is a monotonic function of h. • Use the Margin VC bound to decide which kernels will do best for learning the US Post Office handwritten dataset. • For each kernel choice, solve the dual problem to estimate R. • Assume that the empirical risk is negligible – because it is possible to classify digits correctly using kernels (but not linear). • This predicts that the fourth order kernel has the best generalization – this compares nicely with the results of the classifiers when tested. Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  12. 12. Generalization for Kernels Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  13. 13. Structural Risk Minimization • Standard Learning says: pick • Traditional: use cross-validation to determine if • is generalizing. • VC theory says, evaluate the bound • Ensure there are sufficient number of samples to ensure that is is small. • Alternative: Structural Risk Minimization. Divide the set of classifiers into a hierarchy of sets .,... with corresponding VC-dims ... Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  14. 14. Structural Risk Minimization • Select classifiers to minimize: • Empirical Risk + Capacity Term. • Capacity Term determines the “generalizability” of the classifier. • Increasing the amount of training data allows you to increase p • and use a richer class of classifiers. • Is the bound tight enough? Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  15. 15. Structural Risk Minimization Lecture notes for Stat 231: Pattern Recognition and Machine Learning

  16. 16 Summary • PAC Learning and the VC dimension. • The VC dimension is a measure of capacity of the set of classifiers. • The risk is bounded by the empirical risk plus a capacity term. • VC dimensions can be bounded for linear and kernels by the margin concept. • This can predict which filters are best able to generalize. • Structured Risk Minimization – penalize classifiers that have poor bounds for generalization. Lecture notes for Stat 231: Pattern Recognition and Machine Learning

More Related