1 / 19

Announcements

Announcements. Get to work on the MP! No official class Wednesday 11/29 To make up for Jordan’s talk We will be here for MP questions. The Bayes Optimal Classifier Getting away from generative models Our first ensemble method!. H is a parameterized hypothesis space

Download Presentation

Announcements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Announcements • Get to work on the MP! • No official class Wednesday 11/29 • To make up for Jordan’s talk • We will be here for MP questions CS446-Fall ’06

  2. The Bayes Optimal ClassifierGetting away from generative modelsOur first ensemble method! • H is a parameterized hypothesis space • hML is the maximum likelihood hypothesis given some data. • Can we do better (higher expected accuracy) than hML? • Yes! We expect hMAP to outperform hML IF… • There is an interesting prior P(h) (ie, not uniform) • Can we do better (higher expected accuracy) than hMAP? • Yes! Bayes Optimal will outperform hMAP IF…(some assumptions) CS446-Fall ’06

  3. Bayes Optimal Classifier Getting another Doctor’s second opinion, another’s third opinion… One doctor is most confident. He is hML One doctor is most reliable / accurate. She is hMAP But she may only be a little more trustworthy than the others. What if hMAP says “+” but *all* other hH say “-”? • If P(hMAP |D) < 0.5, perhaps we should prefer “-” • Think of each hi as casting a weighted vote • Weight each hi by how likely it is to be correct given the training data. • Not just by P(h) which is already reflected in hMAP • Rather by P(h|D) • The most reliable joint opinion may contradict hMAP CS446-Fall ’06

  4. Bayes Optimal Classifier: Example • Assume a space of 3 hypotheses • Given a new instance, assume that h1(x) = 1 h2(x) = 0 h3(x) = 0 • In this case, P(f(x) =1 ) = 0.4 P(f(x) = 0) = 0.6 but hmap(x) =1 • We want to determine the most probable classification by combining the prediction of all hypotheses • We can weight each by its posterior probabilities(there are additional lurking assumptions…) CS446-Fall ’06

  5. Bayes Optimal Classifier: Example(2) • Let V be a set of possible classifications • Bayes Optimal Classification: • In the example: • and the optimal prediction is indeed 0. CS446-Fall ’06

  6. What Assumptions are We Making? (1) • Will this always work? • Hint: we are finding a linear combination of h’s • What if several doctors shared training:med school, classes, instructors, internships... ? • What if some “doctors” were really phone / web site referrals to a single doctor? More generally… • Significant covariance among doctors (h’s) not due to accuracy, indicates interdependent redundancies • We over-weigh these opinions • Bayes optimal looks like marginalization over H (is it?) CS446-Fall ’06

  7. What Assumptions (2) • What does it mean “to work”? • As |D| grows without bound Bayes optimal classification should converge to the best answer. Does it? • Consider the weight vector w as |D|  • “best answer” means better than any other • Assume no perfect tie so there is a best answer • Best w… • …will be all zeros except for a single one (for the best h) • In general must this happen? Why? How can we force it to happen? CS446-Fall ’06

  8. Bayes Optimal Classifier • Without additional information we can do no better than Bayes optimal. • Bayes optimal classifier in general is not a member of H(!) • Bayes optimal classifiers make a strong assumption about “independence” (or nonredundancy) among h’s mistakes are uncorrelated • Mistakes are uncorrelated – kind of naïve Bayes but now among H • Another strong assumption: some hH is correct; NOT agnostic. • View as combining expertise; finding a linear combination (ensemble) of experts CS446-Fall ’06

  9. Gibbs Classifier • Bayes optimal classifiers can be expensive to train • Must calculate posteriors for all hH • Train and classify according to the current posterior over H • Training: • Assume some prior (perhaps uniform) over H • Draw an h • Classify with that h • Update posterior of that h • Repeat • Multiple passes through training set; can draw & update several h’s at once • Effort is focused on “problem” h’s (mistaken high accuracy) • Training mistakes tend lower posteriors of offending h’sthrough normalization, raises posteriors of everything else • Tends (eventually) to exercise h’s that are more accurate • Converges • Expected behavior at worst twice error rate of Bayes optimal CS446-Fall ’06

  10. Bagging: Bootstrap AGGregatING • Variance Reduction • Problem of unstable classifiers • Overfitting as low statistical confidence choices • Find specious patterns in the data • “Average” over a number of classifiers • Bootstrap: data resampling • Generate multiple training sets • Resample the original training data • With replacement • Data sets have different “specious” patterns • Learn a number of classifiers • Specious patterns will not correlate • Underlying true pattern will be common to many • Combine the classifiers: Label new test examples by a majority vote among classifiers CS446-Fall ’06

  11. Bagging • Recall logistic regression can overfit • Linearly separable • Overly steep probability fits • Consider a bagging approach… • With many features (dimensions) • extreme steepness in some dimension may be common • But these will not be systematic • So averaging tends to diminish their effect • Generate a collection of trained but “random” classifiers • Sometimes resampling isn’t even necessary – consider iterative algorithms • Resampling reduces the information of the training set • Properly done, the effect can be small… • But it can be enough to permute the data • No reduction in information / evidence of the training set • Decision “stumps” • A decision tree with just one level (split) • Sometimes a few levels to capture some nonlinearity among features • Bagging decision stumps • Often works surprisingly well • A good first thing to try CS446-Fall ’06

  12. Boosting: Weak to Strong Learning • A weak learner: • given a set of weighted training examples produces a hypothesis • with high probability • of accuracy at least *slightly* better than random guessing • over any distribution • Given a training set Z and a hypothesis space H • Learn a (sequence of) linear combinations of weak classifiers • At each iteration, add a classifier hi H • Weigh hi by performance on weighted Z • Each new h is trained on same Z but reweighted so hard zj count more • Two sets of weights – one set for data and one set for weak learners • Classify using weighted vote of classifiers • Builds a “strong” learner: arbitrarily high accuracies can be achieved CS446-Fall ’06

  13. Boosting • A “meta” learning algorithm • Any learning algorithm that builds “weak learners” can be “boosted” • Boosting yields performance as high as desired(!) • Continued training improves performance even after perfectly classifying training set(!) • Seems not to overfit • Can be overly sensitive to outliers (and noisy data) • Popular & practical is AdaBoost (for adaptive boosting) CS446-Fall ’06

  14. AdaBoost (from Freund and Schapire tutorial) CS446-Fall ’06

  15. TestError TrainingError Boosting Iteration (size of classifier) Boosted C4.5 (DTs) Somewhat surprising behavior! Why surprising? CS446-Fall ’06

  16. Boosted Stumps / Boosted Decision Trees • Error scatter plots; problems from the UCI repository What does that line represent? Above? Below? • Stumps (left) are more efficient to learn • DTs (right) reduce Boosting’s outlier sensitivity CS446-Fall ’06

  17. Why does boosting work? • It’s complicated • Many approaches to boosting (AdaBoost is a standard) • Heuristic? No – provably effective • Heuristic? Yes – invented and refined, not derived • Accepted intuition • Empirically identifies support vectors • Finds a “large margin” strong classifier • Better: improves margin distribution rather than margin CS446-Fall ’06

  18. Margin Distributionnumber of points as distance from classifier increases Best Margin BetterMarginDistribution CS446-Fall ’06

  19. TrainingError TestError Boosting Iteration (size of classifier) Boosting as Improving Margin Distribution Cumulative margin distribution for 5 (dotted), 100 (dashed), 1000 (solid) boosting iterations CS446-Fall ’06

More Related