1 / 43

Weak Models: Bagging, Boosting, Bootstrap Aggregation

Improve the stability and accuracy of machine learning algorithms through bagging and boosting techniques. Reduce variance and avoid overfitting. Can be applied to different methods. Bagging is a special case of model averaging. Boosting reduces bias.

rbetty
Download Presentation

Weak Models: Bagging, Boosting, Bootstrap Aggregation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Weak Models: Bagging, Boosting, Bootstrap Aggregation Peter Fox Data Analytics – ITWS-4600/ITWS-6600/MATP-4450 Group 3 Module 11, March 26, 2018

  2. Bootstrap aggregation (bagging) • Improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. • Also reduces variance and helps to avoid overfitting. • Usually applied to decision tree methods, but can be used with any type of method. • Bagging is a special case of the model averaging approach. • Harder to interpret – why?

  3. Cf. Random Forest • “Averages” over the trees… i.e. a different form of model averaging • But the trees are “dimension-reduced” and provide immediate “prescriptive” capability • Local partitioning – but in a different way that bagging is applied • Let’s see how…

  4. Ozone library(ipred) data(Ozone,package=“mlbench”) l <- length(Ozone[,1]) sub <- sample(1:l,2*l/3) OZ.bagging <- bagging(V4 ~., data=Ozone[,-1], mfinal=30,control=rpart.control(maxdepth=5)) OZ.bagging.pred <-predict(OZ.bagging, newdata=Ozone[-sub,-4])

  5. Ozone What other local models? Splines. +more next few modules 10 of 100 bootstrap samples average

  6. Example reading… • http://amunategui.github.io/bagging-in-R/ • Note comment about “competitions” • https://www.r-bloggers.com/improve-predictive-performance-in-r-with-bagging/

  7. Shows improvements for unstable procedures (Breiman, 1996): e.g. neural nets, classification and regression trees, and subset selection in linear regression • … can mildly degrade the performance of stable methods such as K-nearest neighbors

  8. Bagging (bootstrapping aggregation)* library(mlbench) # library(adabag) – requires a number of others data(BreastCancer) l <- length(BreastCancer[,1]) sub <- sample(1:l,2*l/3) BC.bagging <- bagging(Class ~., data=BreastCancer[,-1], mfinal=20, control=rpart.control(maxdepth=3)) # rpart BC.bagging.pred <-predict.bagging( BC.bagging, newdata=BreastCancer[-sub,-1]) BC.bagging.pred$confusion Observed Class Predicted Class benign malignant benign 142 2 malignant 8 81 BC.bagging.pred$error [1] 0.04291845

  9. A “little later” - randomized > data(BreastCancer) > l <- length(BreastCancer[,1]) > sub <- sample(1:l,2*l/3) > BC.bagging <- bagging(Class ~.,data=BreastCancer[,-1],mfinal=20, + control=rpart.control(maxdepth=3)) > BC.bagging.pred <- predict.bagging(BC.bagging,newdata=BreastCancer[-sub,-1]) > BC.bagging.pred$confusion Observed Class Predicted Class benign malignant benign 147 1 malignant 7 78 > BC.bagging.pred$error [1] 0.03433476 Observed Class Predicted Class benign malignant benign 142 2 malignant 8 81 BC.bagging.pred$error [1] 0.04291845

  10. Bagging (Vehicle) > data(Vehicle) > l <- length(Vehicle[,1]) > sub <- sample(1:l,2*l/3) > Vehicle.bagging <- bagging(Class ~.,data=Vehicle[sub, ],mfinal=40, + control=rpart.control(maxdepth=5)) > Vehicle.bagging.pred <- predict.bagging(Vehicle.bagging, newdata=Vehicle[-sub, ]) > Vehicle.bagging.pred$confusion Observed Class Predicted Class bus opelsaab van bus 63 10 8 0 opel 1 42 27 0 saab 0 18 30 0 van 5 7 9 62 > Vehicle.bagging.pred$error [1] 0.3014184

  11. Up to now • Strong models • Direct use of variables (independent) • Some or all • Averaging to reduce overfitting* • Guided by statistical significant (R2, p-value, other measures, error rate) • Strong models + “weaker” models • PCA – identifying dominant dimensions • Factor analysis – cross correlations down to r=.3 and combing variables into factors • Aimed at explaining variance

  12. Weak models … • A weak learner: a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing) • A strong learner: a classifier that is arbitrarily well-correlated with the true classification. • Can a set of weak learners create a single strong learner (not called latent but same idea)?

  13. Boosting • … reducing bias in supervised learning • most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. • typically weighted in some way that is usually related to the weak learners' accuracy. • After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight • Thus, future weak learners focus more on the examples that previous weak learners misclassified.

  14. Diamonds (lab this week) Compare the identification of this variable under boosting versus using strong learners

  15. Using diamonds… boost (glm) > mglmboost<-glmboost(as.factor(Expensive) ~ ., data=diamonds, family=Binomial(link="logit")) > summary(mglmboost) Generalized Linear Models Fitted via Gradient Boosting Call: glmboost.formula(formula = as.factor(Expensive) ~ ., data = diamonds, family = Binomial(link = "logit")) Negative Binomial Likelihood Loss function: { f <- pmin(abs(f), 36) * sign(f) p <- exp(f)/(exp(f) + exp(-f)) y <- (y + 1)/2 -y * log(p) - (1 - y) * log(1 - p) }

  16. Using diamonds… boost (glm) > summary(mglmboost) #continued Number of boosting iterations: mstop = 100 Step size: 0.1 Offset: -1.339537 Coefficients: NOTE: Coefficients from a Binomial model are half the size of coefficients from a model fitted via glm(... , family = 'binomial'). See Warning section in ?coef.mboost (Intercept) carat clarity.L -1.5156330 1.5388715 0.1823241 attr(,"offset") [1] -1.339537 Selection frequencies: carat (Intercept) clarity.L 0.50 0.42 0.08 #add up to 1.0

  17. Cluster boosting • Assessment of the clusterwise stability of a clustering of data, which can be cases x variables or dissimilarity data. • The data is resampled using several schemes (bootstrap, subsetting, jittering, replacement of points by noise) and the Jaccard similarities of the original clusters to the most similar clusters in the resampled data are computed. • The mean over these similarities is used as an index of the stability of a cluster (other statistics can be computed as well).

  18. Cluster boosting • Quite general clustering methods are possible, i.e. methods estimating or fixing the number of clusters, methods producing overlapping clusters or not assigning all cases to clusters (but declaring them as "noise"). • In R – clustermethod = X is used to select the method, e.g. Kmeans • Lab this week … (iris, etc..)

  19. Example - bodyfat • The response variable is the body fat measured by DXA (DEXfat), which can be seen as the gold standard to measure body fat. • However, DXA measurements are too expensive and complicated for a broad use. • Anthropometric measurements as waist or hip circumferences are in comparison very easy to measure in a standard screening. • A prediction formula only based on these measures could therefore be a valuable alternative with high clinical relevance for daily usage. • Tutorial (lab): https://cran.r-project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf

  20. bodyfat ## regular linear model using three variables lm1 <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) ## Estimate same model by glmboost glm1 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) # We consider all available variables as potential predictors. glm2 <- glmboost(DEXfat ~ ., data = bodyfat) # or one could essentially call: preds <- names(bodyfat[, names(bodyfat) != "DEXfat"]) ## names of predictors fm <- as.formula(paste("DEXfat ~", paste(preds, collapse = "+"))) ## build formula

  21. Compare linear models > coef(lm1) (Intercept) hipcirckneebreadth anthro3a -75.2347840 0.5115264 1.9019904 8.9096375 > coef(glm1, off2int=TRUE) ## off2int adds the offset to the intercept (Intercept) hipcirckneebreadth anthro3a -75.2073365 0.5114861 1.9005386 8.9071301 Conclusion?

  22. > fm DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth + anthro3a + anthro3b + anthro3c + anthro4 > coef(glm2, which = "") ## select all. (Intercept) age waistcirchipcircelbowbreadthkneebreadth anthro3a anthro3b anthro3c -98.8166077 0.0136017 0.1897156 0.3516258 -0.3841399 1.7365888 3.3268603 3.6565240 0.5953626 anthro4 0.0000000 attr(,"offset") [1] 30.78282

  23. plot(glm2, off2int = TRUE)

  24. plot(glm2, ylim = range(coef(glm2, which = preds)))

  25. Adaboost • Preparation for lab: http://math.mit.edu/~rothvoss/18.304.3PM/Presentations/1-Eric-Boosting304FinalRpdf.pdf

  26. Other forms of boosting • Gamboost = Generalized Additive Model - Gradient boosting for optimizing arbitrary loss functions, where component-wise smoothing procedures are utilized as (univariate) base-learners.

  27. > gam1 <- gamboost(DEXfat ~ bbs(hipcirc) + bbs(kneebreadth) + bbs(anthro3a),data = bodyfat) > #Using plot() on a gamboost object delivers automatically the partial effects of the different base-learners: > par(mfrow = c(1,3)) ## 3 plots in one frame > plot(gam1) ## get the partial effects # bbs, bols, btree..

  28. Compare to rpart > fattree<-rpart(DEXfat ~ ., data=bodyfat) > plot(fattree) > text(fattree) > labels(fattree) [1] "root" "waistcirc< 88.4" "anthro3c< 3.42" "anthro3c>=3.42" "hipcirc< 101.3" "hipcirc>=101.3" [7] "waistcirc>=88.4" "hipcirc< 109.9" "hipcirc>=109.9"

  29. Variants on boosting – loss fn cars.gb <- blackboost(dist ~ speed, data = cars, control = boost_control(mstop = 50)) ### plot fit plot(dist ~ speed, data = cars) lines(cars$speed, predict(cars.gb), col = "red")

  30. Blackboosting (cf. brown) Gradient boosting for optimizing arbitrary loss functions where regression trees are utilized as base-learners. > cars.gb Model-based Boosting Call: blackboost(formula = dist ~ speed, data = cars, control = boost_control(mstop = 50)) Squared Error (Regression) Loss function: (y - f)^2 Number of boosting iterations: mstop = 50 Step size: 0.1 Offset: 42.98 Number of baselearners: 1

  31. Cars - gamboost “Localized” Note characteristics of model, cf. blackboosting

  32. iris

  33. cars

  34. library(mboost)

  35. Cars

  36. Sparse matrix example > coef(mod, which = which(beta > 0)) V306 V1052 V1090 V3501 V4808 V5473 V7929 V8333 V8799 V9191 2.1657532 0.0000000 4.8756163 4.7068006 0.4429911 5.4029763 3.6435648 0.0000000 3.7843504 0.4038770 attr(,"offset") [1] 2.90198

  37. Aside: Boosting and SVM… • Remember “margins” from the SVM? Partitioning the “linear” or transformed space? • In boosting we are effectively (not explicitly) attempting to maximize the minimum margin of any training example

  38. Assignment 7 • E.g. https://rpubs.com/chengjiun/52658 • A7 is available on LMS • Lab this week (group3/lab4…) • Group 4 next week – cross validation ++ in relation to ~ all other methods

More Related