1 / 31

Ensemble Learning

Ensemble Learning. Textbook, Learning From Examples, Section 10-12 (pp. 56-66). From the book Guesstimation. How much domestic trash and recycling is collected each year in the US (in tons)? Individual answers Confidence-weighted ensemble. Answer: 245 million tons. Ensemble learning.

reya
Download Presentation

Ensemble Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ensemble Learning Textbook, Learning From Examples, Section 10-12 (pp. 56-66).

  2. From the book Guesstimation • How much domestic trash and recycling is collected each year in the US (in tons)? • Individual answers • Confidence-weighted ensemble

  3. Answer: 245 million tons

  4. Ensemble learning Training sets Hypotheses Ensemble hypothesis S1h1 S2h2 . . . SN hN H

  5. Advantages of ensemble learning • Can be very effective at reducing generalization error! (E.g., by voting.) • Ideal case: the hi have independent errors

  6. Example Given three hypotheses, h1, h2, h3, with hi(x)  {−1,1} Suppose each hi has 60% generalization accuracy, and assume errors are independent. Now suppose H(x) is the majority vote of h1, h2, and h3 . What is probability that H is correct?

  7. Another Example Again,given three hypotheses, h1, h2, h3. Suppose each hi has 40% generalization accuracy, and assume errors are independent. Now suppose we classify x as the majority vote of h1, h2, and h3 . What is probability that the classification is correct?

  8. General case In general, if hypotheses h1, ..., hMall have generalization accuracy A, what is probability that a majority vote will be correct?

  9. Possible problems with ensemble learning • Errors are typically not independent • Training time and classification time are increased by a factor of M. • Hard to explain how ensemble hypothesis does classification. • How to get enough data to create M separate data sets, S1, ..., SM?

  10. Three popular methods: • Voting: • Train classifier on M different training sets Si to obtain M different classifiers hi. • For a new instance x, define H(x) as: where αi is a confidence measure for classifier hi

  11. Bagging (Breiman, 1990s): • To create Si, create “bootstrap replicates” of original training set S • Boosting (Schapire & Freund, 1990s) • To create Si, reweight examples in original training set S as a function of whether or not they were misclassified on the previous round.

  12. Adaptive Boosting (Adaboost) A method for combining different weak hypotheses (training error close to but less than 50%) to produce a strong hypothesis (training error close to 0%)

  13. Sketch of algorithm Given examples Sand learning algorithm L, with | S | = N • Initialize probability distribution over examples w1(i) = 1/N . • Repeatedly run L on training sets St Sto produce h1, h2, ... , hK. • At each step, derive St from S by choosing examples probabilistically according to probability distribution wt . Use St to learn ht. • At each step, derive wt+1 by giving more probability to examples that were misclassified at step t. • The final ensemble classifier H is a weighted sum of the ht’s, with each weight being a function of the corresponding ht’s error on its training set.

  14. Adaboost algorithm • Given S= {(x1, y1), ..., (xN, yN)} where x X,yi {+1, −1} • Initialize w1(i) = 1/N. (Uniform distribution over data)

  15. For t = 1, ..., K: • Select new training set St from S with replacement, according to wt • Train L on St to obtain hypothesis ht • Compute the training error tof ht on S : • Compute coefficient

  16. Compute new weights on data: For i = 1 to N where Zt is a normalization factor chosen so that wt+1 will be a probability distribution:

  17. At the end of K iterations of this algorithm, we have h1, h2, . . . , hK We also have 1, 2, . . . ,K, where • Ensemble classifier: • Note that hypotheses with higher accuracy on their training sets are weighted more strongly.

  18. A Hypothetical Example where { x1, x2, x3, x4 } are class +1 {x5, x6, x7, x8 } are class −1 t = 1 : w1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} S1 = {x1, x2, x2, x5,x5,x6, x7, x8} (notice some repeats) Train classifier on S1 to get h1 Run h1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1} • Calculate error:

  19. Calculate ’s: Calculate new w’s:

  20. t = 2 • w2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102} • S2 = {x1, x2, x2, x3,x4,x4,x7, x8} • Run classifier on S2 to get h2 • Run h2 on S. Suppose classifications are: {1, 1, 1, 1, 1, 1, 1, 1} • Calculate error:

  21. Calculate ’s: Calculate w’s:

  22. t =3 • w3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125} • S3 = {x2, x3, x3,x3,x5,x6,x7, x8} • Run classifier on S3 to get h3 • Run h3 on S. Suppose classifications are: {1, 1, −1, 1, −1, −1, 1, −1} • Calculate error:

  23. Calculate ’s: • Ensemble classifier:

  24. Recall the training set: What is the accuracy of H on the training data? • where { x1, x2, x3, x4 } are class +1 • {x5, x6, x7, x8 } are class −1

  25. Sources of Classifier Error: Bias, Variance, and Noise • Bias: • Classifier cannot learn the correct hypothesis (no matter what training data is given), and so incorrect hypothesis h is learned. The bias is the average error of h over all possible training sets. • Variance: • Training data is not representative enough of all data, so the learned classifier h varies from one training set to another. • Noise: • Training data contains errors, so incorrect hypothesis h is learned.

  26. Adaboost seems to reduce both bias and variance. Adaboost does not seem to overfit for increasing K.

More Related