1 / 23

Boosting and Additive Trees (Part 1)

Boosting and Additive Trees (Part 1). Ch. 10 Presented by Tal Blum. Overview. Ensemble methods and motivations Describing Adaboost.M1 algorithm Show that Adaboost maximizes the exponential loss Other loss functions for classification and regression. Ensemble Learning – Additive Models.

zonta
Download Presentation

Boosting and Additive Trees (Part 1)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boosting and Additive Trees(Part 1) Ch. 10 Presented by Tal Blum

  2. Overview • Ensemble methods and motivations • Describing Adaboost.M1 algorithm • Show that Adaboost maximizes the exponential loss • Other loss functions for classification and regression

  3. Ensemble Learning – Additive Models • INTUITION: Combining Predictions of an ensemble is more accurate than a single classifier. • Justification: ( Several reasons) • easy to find quite correct “rules of thumb” however hard to find single highly accurate prediction rule. • If the training examples are few and the hypothesis space is large then there are several equally accurate classifiers. (model uncertainty) • Hypothesis space does not contain the true function, but a linear combination of hypotheses might. • Exhaustive global search in the hypothesis space is expensive so we can combine the predictions of several locally accurate classifiers. • Examples: Bagging, HME, Splines

  4. Boosting (explaining)

  5. Example learning curve for Y = 1 if  X2j > 210(0.5) 0 otherwise

  6. Adaboost.M1 Algorithn • W(x) is the distribution of weights over the N training points ∑ W(xi)=1 • Initially assign uniform weights W0(x) = 1/Nfor all x. • At each iteration k: • Find best weak classifier Ck(x) using weights Wk(x) • Compute εk the error rate as εk= [ ∑ W(xi )∙ I(yi ≠ Ck(xi )) ] / [ ∑ W(xi )] • weight αk the classifier Ck‘s weight in the final hypothesis Set αk = log ((1 – εk )/εk ) • For each xi , Wk+1(xi ) = Wk(xi ) ∙ exp[αk∙I(yi ≠ Ck(xi ))] • CFINAL(x) =sign [ ∑ αk Ck (x) ]

  7. Boosting asan Additive Model • The final prediction in boosting f(x) can be expressed as an additive expansion of individual classifiers • The process is iterative and can be expressed as follows. • Typically we would try to minimize a loss function on the training examples

  8. Forward Stepwise Additive Modeling - algorithm • Initialize f0(x)=0 • For m = 1 to M • Compute • Set

  9. Forward Stepwise Additive Modeling • Sequentially adding new basis functions without adjusting the parameters of the previously chosen functions • Simple case: Squared-error loss • Forward stage-wise modeling amounts to just fitting the residuals from previous iteration. • Squared-error loss not robust for classification

  10. Exponential Lossand Adaboost • AdaBoost for Classification: • L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function

  11. Exponential Lossand Adaboost • Assuming  0:

  12. N å ( m ) × ¹ [ w I ( y G ( x ) )] i i i b - b - b - = + i 1 arg min ( e e ) e N å G ( m ) w i = i 1 b - b - b = - × + = b arg min ( e e ) err e H ( ) m G Finding the best 

  13. Historical Notes • Adaboost was first presented in ML theory as a way to boost a week classifier • At first people thought it defies the “no free lunch theorem” and doesn’t overfitt. • Connection between Adaboost and stepwise additive modeling was only recently discovered.

  14. Why Exponential Loss? • Mainly Computational • Derivatives are easy to compute • Optimal classifiers minimizes the weighted sample • Under mild assumptions the instances weights decrease exponentially fast. • Statistical • Exp. loss is not necessary for success of boosting – On Boosting and exponential loss (Wyner) • We will see in the next slides

  15. Why Exponential Loss? • Population minimizer (Friedman 2000): • This justifies using its sign as a classification rule.

  16. Why Exponential Loss? • For exponential loss: • Interpreting f as a logit transform • The population maximizers and are the same

  17. Loss Functions and Robustness • For a finite dataset exp. loss and binomial deviance are not the same. • Both criterion are monotonic decreasing functions of the margin. • Examples with negative margin y*f(x)<0 are classified incorrectly.

  18. Loss Functions and Robustness • The problem: Classification error is not differentiable and with derivative 0 where it is differentiable. • We want a criterion which is efficient and as close as possible to the true classification lost. • Any loss criterion used for classification should give higher weights to misclassified examples. • Therefore the square loss function is not appropriate for classification.

  19. Loss Functions and Robustness • Both functions can be though of as a continuous approximation to the misclassification loss • Exponential lost grows exponentially fast for instances with high margin • Such instances weight increases exponentially • This makes Adaboost very sensitive to mislabeled examples • Deviation generalizes to K classes, exp loss not.

  20. Robust Loss FunctionsFor Regression • The relationship between square loss and absolute loss is analogous to that of exp. loss and deviance. • The solutions are the mean and median. • Absolute loss is more robust. • For regression MSE leads to Adaboost for regression • For Gaussian errors and robustness to outliers • Huber loss:

  21. Sample of UCI datasets Comparison

  22. Next Presentation

More Related