Topic 10 ensemble methods
Download
1 / 36

Topic 10 - Ensemble Methods - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

Topic 10 - Ensemble Methods. Ensemble Models - Motivation. Remember this picture? Always looking for balance between low complexity (‘good on average’ but bad for prediction) and high complexity (‘good for specific cases’ but might overfit)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Topic 10 - Ensemble Methods' - chailyn


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Topic 10 ensemble methods l.jpg

Topic 10 - Ensemble Methods

Data Mining - Volinsky - 2011 - Columbia University


Ensemble models motivation l.jpg
Ensemble Models - Motivation

  • Remember this picture?

  • Always looking for balance between low complexity (‘good on average’ but bad for prediction) and high complexity (‘good for specific cases’ but might overfit)

  • By combining many different models, ensembles make it easier to hit the ‘sweet spot’ of modelling.

  • Best for models to draw from diverse, independent opinions

    • Wisdom Of Crowds

Stest(q)

Strain(q)

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods motivation l.jpg
Ensemble Methods - Motivation

  • Models are just models.

    • Usually not true!

    • The truth is often much more complex than any single model can capture.

    • Combinations of simple models can be arbitrarily complex. (e.g. spam/robots models, neural nets, splines)

  • Notion: An average of several measurements is often more accurate and stable than a single measurement

    Accuracy: how well the model does for estimation and prediction

    Stability: small changes in inputs have little effect on outputs

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods how they work l.jpg
Ensemble Methods – How They Work

  • The ensemble predicts a target value as an average or a vote of the predictions (of several individual models)...

    • Each model is fit independently of the others

    • Final prediction is a combination of the independent predictions of all models

  • For an continuous target, an ensemble averages predictions

    • Usually weighted

  • For a categorical target (classification), an ensemble may average the probabilities of the target values…or may use ‘voting’.

    • Voting classifies a case into the class that was selected most by individual models

Data Mining - Volinsky - 2011 - Columbia University


Ensemble models why they work l.jpg
Ensemble Models – Why they work

  • Voting example

    • 5 independent classifiers

    • 70% accuracy for each

    • Use voting…

    • What is the probability that the ensemble model is correct?

      • Lets simulate it

    • What about 100 examples?

    • (not a realistic example, why?)

Data Mining - Volinsky - 2011 - Columbia University


Ensemble schemes l.jpg
Ensemble Schemes

  • The beauty is that you can average together models of any kind!!!

  • Don’t need fancy schemes – just average!

  • But there are fancy schemes: each one has various ways of fitting many models to the same data, and use voting or averaging

    • Stacking (Wolpert 92): fit many leave-1-out models

    • Bagging (Breiman 96) build models on many permutations of original data

    • Boosting (Freund & Shapire 96): iteratively re-model, using re-weighted data based on errors from previous models…

    • Arcing (Breiman 98), Bumping (Tibshirani 97), Crumpling (Anderson & Elder 98) , Born-Again (Breiman 98):

    • Bayesian Model Averaging - near to my heart…

  • We’ll explore BMA, bagging and boosting…

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods bayesian model averaging l.jpg
Ensemble Methods – Bayesian Model Averaging

Data Mining - Volinsky - 2011 - Columbia University


Model averaging l.jpg
Model Averaging

  • Idea: account for inherent variance of the model selection process

  • Posterior Variance = Within-Model Variance + Between-Model Variance

  • Data-driven model selection is risky: “Part of the evidence is spent specify the model” (Leamer, 1978)

  • Model-based inferences can be over-precise

Data Mining - Volinsky - 2011 - Columbia University


Model averaging9 l.jpg
Model Averaging

  • For some quantity of interest D: avg over all Models M, given the data D:

    To calculate the first term properly, you need to integrate out model parameters q,

    Where q is the MLE.

    For the second term, note that

^

Data Mining - Volinsky - 2011 - Columbia University


Bayesian model averaging l.jpg
Bayesian Model Averaging

  • The approximations on the previous page allow you to calculate many posterior model probabilities quickly, and gives you the weights to use for averaging.

  • But, how do you know which models to average over?

    • Example, regression with p parameters

    • Each subset of p is a ‘model’

    • 2p possible models!

  • Idea:

Data Mining - Volinsky - 2011 - Columbia University


Model averaging11 l.jpg
Model Averaging

  • But how to find the best models without fitting all models?

  • Solution: Leaps and Bounds algorithm can find the best model without fitting all models

    • Goal: find the single best model for each model size

Don’t need to traverse this part of the tree since there is no way it can beat AB

Data Mining - Volinsky - 2011 - Columbia University


Bma example l.jpg
BMA - Example

PMP = Posterior Model Probability

Best Models

Score on holdout data: BMA wins

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods boosting l.jpg
Ensemble Methods - Boosting

Data Mining - Volinsky - 2011 - Columbia University


Boosting l.jpg
Boosting…

  • Different approach to model ensembles – mostly for classification

  • Observed: when model predictions are not highly correlated, combining does well

  • Big idea: can we fit models specifically to the “difficult” parts of the data?

Data Mining - Volinsky - 2011 - Columbia University


Boosting algorithm l.jpg
Boosting— Algorithm

From HTF p. 339

Data Mining - Volinsky - 2011 - Columbia University


Example l.jpg
Example

  • Courtesy M. Littman

Data Mining - Volinsky - 2011 - Columbia University


Example17 l.jpg
Example

  • Courtesy M. Littman

Data Mining - Volinsky - 2011 - Columbia University


Example18 l.jpg
Example

  • Courtesy M. Littman

Data Mining - Volinsky - 2011 - Columbia University


Boosting advantages l.jpg
Boosting - Advantages

  • Fast algorithms - AdaBoost

  • Flexible – can work with any classification algorithm

  • Individual models don’t have to be good

    • In fact, the method works best with bad models!

    • (bad = slightly better than random guessing)

    • Most common model – “boosted stumps”

Data Mining - Volinsky - 2011 - Columbia University


Slide20 l.jpg

Boosting Example from HTF p. 302

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods bagging stacking l.jpg
Ensemble Methods – Bagging / Stacking

Data Mining - Volinsky - 2011 - Columbia University


Bagging for combining classifiers l.jpg
Bagging for Combining Classifiers

Bagging = Boostrap aggregating

  • Big Idea:

    • To avoid overfitting of specific dataset, fit model to “bootstrapped” random sets of the data

  • Bootstrap

    • Random sample, with replacement, from the data set

    • Size of sample = size of data

    • X= (1,2,3,4,5,6,7,8,9,10)

    • B1=(1,2,3,3,4,5,6,6,7,8)

    • B2=(1,1,1,1,2,2,2,5,6,8)

  • Bootstrap sample have the same statistical properties as original data

  • By creating similar datasets you can see how much stability there is in your data. If there is a lack of stability, averaging helps.

Data Mining - Volinsky - 2011 - Columbia University


Bagging l.jpg
Bagging

  • Training data sets of size N

  • Generate B “bootstrap” sampled data sets of size N

  • Build B models (e.g., trees), one for each bootstrap sample

    • Intuition is that the bootstrapping “perturbs” the data enough to make the models more resistant to true variability

    • Note: only ~62% of data included in any bootstrap sample

      • Can use the rest as an out-of-sample estimate!

  • For prediction, combine the predictions from the B models

    • Voting or averaging based on“out-of-bag” sample

    • Plus: generally improves accuracy on models such as trees

    • Negative: lose interpretability

Data Mining - Volinsky - 2011 - Columbia University


Slide24 l.jpg

HTF Bagging Example p 285

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods random forests l.jpg
Ensemble Methods – Random Forests

Data Mining - Volinsky - 2011 - Columbia University


Random forests l.jpg
Random Forests

  • Trees are great, but

    • As we’ve seen, they are “unstable”

    • Also, trees are sensitive to the primary split, which can lead the tree in inappropriate directions

    • one way to see this: fit a tree on a random sample, or a bootstrapped sample of the data -

Data Mining - Volinsky - 2011 - Columbia University


Example of tree instability l.jpg
Example of Tree Instability

Data Mining - Volinsky - 2011 - Columbia University

from G. Ridgeway, 2003


Random forests28 l.jpg
Random Forests

  • Solution:

    • random forests: an ensemble of decision trees

    • Similar to bagging: inject randomness to overcome instability

    • each tree is built on a random subset of the training data

      • Boostrapped version of data

    • at each split point, only a random subset of predictors are considered

    • Use “out-of-bag” hold out sample to estimate size of each tree

    • prediction is simply majority vote of the trees ( or mean prediction of the trees).

  • Randomizing the variables used is the key

    • Reduces correlation between models!

  • Has the advantage of trees, with more robustness, and a smoother decision rule.

Data Mining - Volinsky - 2011 - Columbia University


Slide29 l.jpg

HTF Example p 589

Data Mining - Volinsky - 2011 - Columbia University


Slide30 l.jpg

Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32

Data Mining - Volinsky - 2011 - Columbia University


Random forests how big a tree l.jpg
Random Forests – How Big A Tree (1), 5-32

  • Breiman’s original algorithm said: “to keep bias low, trees are to be grown to maximum depth”

  • However, empirical evidence typically shows that “stumps” do best

Data Mining - Volinsky - 2011 - Columbia University


Ensembles main points l.jpg
Ensembles – Main Points (1), 5-32

  • Averaging models together has been shown to be effective for prediction

  • Many weird names:

    • See papers by Leo Breiman (e.g. “Bagging Predictors”, Arcing the Edge”, and “Random Forests” for more detail

  • Key points

    • Models average well if they are uncorrelated

    • Can inject randomness to insure uncorrelated models

    • Averaging small models better than large ones

  • Also, can give more insight into variables than simple tree

    • Variables that show up again and again must be good

Data Mining - Volinsky - 2011 - Columbia University


Visualizing forests l.jpg
Visualizing Forests (1), 5-32

  • Data: Wisconsin Breast Cancer

    • Courtesy S. Urbanek

Data Mining - Volinsky - 2011 - Columbia University




References l.jpg
References (1), 5-32

  • Random Forests from Leo Breiman himself

  • Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32

  • Hastie, Tibshirani, Friedman (HTF)

    • Chapters 8,10,15,16

Data Mining - Volinsky - 2011 - Columbia University


ad