Topic 10 ensemble methods l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 36

Topic 10 - Ensemble Methods PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

Topic 10 - Ensemble Methods. Ensemble Models - Motivation. Remember this picture? Always looking for balance between low complexity (‘good on average’ but bad for prediction) and high complexity (‘good for specific cases’ but might overfit)

Download Presentation

Topic 10 - Ensemble Methods

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Topic 10 ensemble methods l.jpg

Topic 10 - Ensemble Methods

Data Mining - Volinsky - 2011 - Columbia University


Ensemble models motivation l.jpg

Ensemble Models - Motivation

  • Remember this picture?

  • Always looking for balance between low complexity (‘good on average’ but bad for prediction) and high complexity (‘good for specific cases’ but might overfit)

  • By combining many different models, ensembles make it easier to hit the ‘sweet spot’ of modelling.

  • Best for models to draw from diverse, independent opinions

    • Wisdom Of Crowds

Stest(q)

Strain(q)

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods motivation l.jpg

Ensemble Methods - Motivation

  • Models are just models.

    • Usually not true!

    • The truth is often much more complex than any single model can capture.

    • Combinations of simple models can be arbitrarily complex. (e.g. spam/robots models, neural nets, splines)

  • Notion: An average of several measurements is often more accurate and stable than a single measurement

    Accuracy: how well the model does for estimation and prediction

    Stability: small changes in inputs have little effect on outputs

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods how they work l.jpg

Ensemble Methods – How They Work

  • The ensemble predicts a target value as an average or a vote of the predictions (of several individual models)...

    • Each model is fit independently of the others

    • Final prediction is a combination of the independent predictions of all models

  • For an continuous target, an ensemble averages predictions

    • Usually weighted

  • For a categorical target (classification), an ensemble may average the probabilities of the target values…or may use ‘voting’.

    • Voting classifies a case into the class that was selected most by individual models

Data Mining - Volinsky - 2011 - Columbia University


Ensemble models why they work l.jpg

Ensemble Models – Why they work

  • Voting example

    • 5 independent classifiers

    • 70% accuracy for each

    • Use voting…

    • What is the probability that the ensemble model is correct?

      • Lets simulate it

    • What about 100 examples?

    • (not a realistic example, why?)

Data Mining - Volinsky - 2011 - Columbia University


Ensemble schemes l.jpg

Ensemble Schemes

  • The beauty is that you can average together models of any kind!!!

  • Don’t need fancy schemes – just average!

  • But there are fancy schemes: each one has various ways of fitting many models to the same data, and use voting or averaging

    • Stacking (Wolpert 92): fit many leave-1-out models

    • Bagging (Breiman 96) build models on many permutations of original data

    • Boosting (Freund & Shapire 96): iteratively re-model, using re-weighted data based on errors from previous models…

    • Arcing (Breiman 98), Bumping (Tibshirani 97), Crumpling (Anderson & Elder 98) , Born-Again (Breiman 98):

    • Bayesian Model Averaging - near to my heart…

  • We’ll explore BMA, bagging and boosting…

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods bayesian model averaging l.jpg

Ensemble Methods – Bayesian Model Averaging

Data Mining - Volinsky - 2011 - Columbia University


Model averaging l.jpg

Model Averaging

  • Idea: account for inherent variance of the model selection process

  • Posterior Variance = Within-Model Variance + Between-Model Variance

  • Data-driven model selection is risky: “Part of the evidence is spent specify the model” (Leamer, 1978)

  • Model-based inferences can be over-precise

Data Mining - Volinsky - 2011 - Columbia University


Model averaging9 l.jpg

Model Averaging

  • For some quantity of interest D: avg over all Models M, given the data D:

    To calculate the first term properly, you need to integrate out model parameters q,

    Where q is the MLE.

    For the second term, note that

^

Data Mining - Volinsky - 2011 - Columbia University


Bayesian model averaging l.jpg

Bayesian Model Averaging

  • The approximations on the previous page allow you to calculate many posterior model probabilities quickly, and gives you the weights to use for averaging.

  • But, how do you know which models to average over?

    • Example, regression with p parameters

    • Each subset of p is a ‘model’

    • 2p possible models!

  • Idea:

Data Mining - Volinsky - 2011 - Columbia University


Model averaging11 l.jpg

Model Averaging

  • But how to find the best models without fitting all models?

  • Solution: Leaps and Bounds algorithm can find the best model without fitting all models

    • Goal: find the single best model for each model size

Don’t need to traverse this part of the tree since there is no way it can beat AB

Data Mining - Volinsky - 2011 - Columbia University


Bma example l.jpg

BMA - Example

PMP = Posterior Model Probability

Best Models

Score on holdout data: BMA wins

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods boosting l.jpg

Ensemble Methods - Boosting

Data Mining - Volinsky - 2011 - Columbia University


Boosting l.jpg

Boosting…

  • Different approach to model ensembles – mostly for classification

  • Observed: when model predictions are not highly correlated, combining does well

  • Big idea: can we fit models specifically to the “difficult” parts of the data?

Data Mining - Volinsky - 2011 - Columbia University


Boosting algorithm l.jpg

Boosting— Algorithm

From HTF p. 339

Data Mining - Volinsky - 2011 - Columbia University


Example l.jpg

Example

  • Courtesy M. Littman

Data Mining - Volinsky - 2011 - Columbia University


Example17 l.jpg

Example

  • Courtesy M. Littman

Data Mining - Volinsky - 2011 - Columbia University


Example18 l.jpg

Example

  • Courtesy M. Littman

Data Mining - Volinsky - 2011 - Columbia University


Boosting advantages l.jpg

Boosting - Advantages

  • Fast algorithms - AdaBoost

  • Flexible – can work with any classification algorithm

  • Individual models don’t have to be good

    • In fact, the method works best with bad models!

    • (bad = slightly better than random guessing)

    • Most common model – “boosted stumps”

Data Mining - Volinsky - 2011 - Columbia University


Slide20 l.jpg

Boosting Example from HTF p. 302

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods bagging stacking l.jpg

Ensemble Methods – Bagging / Stacking

Data Mining - Volinsky - 2011 - Columbia University


Bagging for combining classifiers l.jpg

Bagging for Combining Classifiers

Bagging = Boostrap aggregating

  • Big Idea:

    • To avoid overfitting of specific dataset, fit model to “bootstrapped” random sets of the data

  • Bootstrap

    • Random sample, with replacement, from the data set

    • Size of sample = size of data

    • X= (1,2,3,4,5,6,7,8,9,10)

    • B1=(1,2,3,3,4,5,6,6,7,8)

    • B2=(1,1,1,1,2,2,2,5,6,8)

  • Bootstrap sample have the same statistical properties as original data

  • By creating similar datasets you can see how much stability there is in your data. If there is a lack of stability, averaging helps.

Data Mining - Volinsky - 2011 - Columbia University


Bagging l.jpg

Bagging

  • Training data sets of size N

  • Generate B “bootstrap” sampled data sets of size N

  • Build B models (e.g., trees), one for each bootstrap sample

    • Intuition is that the bootstrapping “perturbs” the data enough to make the models more resistant to true variability

    • Note: only ~62% of data included in any bootstrap sample

      • Can use the rest as an out-of-sample estimate!

  • For prediction, combine the predictions from the B models

    • Voting or averaging based on“out-of-bag” sample

    • Plus: generally improves accuracy on models such as trees

    • Negative: lose interpretability

Data Mining - Volinsky - 2011 - Columbia University


Slide24 l.jpg

HTF Bagging Example p 285

Data Mining - Volinsky - 2011 - Columbia University


Ensemble methods random forests l.jpg

Ensemble Methods – Random Forests

Data Mining - Volinsky - 2011 - Columbia University


Random forests l.jpg

Random Forests

  • Trees are great, but

    • As we’ve seen, they are “unstable”

    • Also, trees are sensitive to the primary split, which can lead the tree in inappropriate directions

    • one way to see this: fit a tree on a random sample, or a bootstrapped sample of the data -

Data Mining - Volinsky - 2011 - Columbia University


Example of tree instability l.jpg

Example of Tree Instability

Data Mining - Volinsky - 2011 - Columbia University

from G. Ridgeway, 2003


Random forests28 l.jpg

Random Forests

  • Solution:

    • random forests: an ensemble of decision trees

    • Similar to bagging: inject randomness to overcome instability

    • each tree is built on a random subset of the training data

      • Boostrapped version of data

    • at each split point, only a random subset of predictors are considered

    • Use “out-of-bag” hold out sample to estimate size of each tree

    • prediction is simply majority vote of the trees ( or mean prediction of the trees).

  • Randomizing the variables used is the key

    • Reduces correlation between models!

  • Has the advantage of trees, with more robustness, and a smoother decision rule.

Data Mining - Volinsky - 2011 - Columbia University


Slide29 l.jpg

HTF Example p 589

Data Mining - Volinsky - 2011 - Columbia University


Slide30 l.jpg

Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32

Data Mining - Volinsky - 2011 - Columbia University


Random forests how big a tree l.jpg

Random Forests – How Big A Tree

  • Breiman’s original algorithm said: “to keep bias low, trees are to be grown to maximum depth”

  • However, empirical evidence typically shows that “stumps” do best

Data Mining - Volinsky - 2011 - Columbia University


Ensembles main points l.jpg

Ensembles – Main Points

  • Averaging models together has been shown to be effective for prediction

  • Many weird names:

    • See papers by Leo Breiman (e.g. “Bagging Predictors”, Arcing the Edge”, and “Random Forests” for more detail

  • Key points

    • Models average well if they are uncorrelated

    • Can inject randomness to insure uncorrelated models

    • Averaging small models better than large ones

  • Also, can give more insight into variables than simple tree

    • Variables that show up again and again must be good

Data Mining - Volinsky - 2011 - Columbia University


Visualizing forests l.jpg

Visualizing Forests

  • Data: Wisconsin Breast Cancer

    • Courtesy S. Urbanek

Data Mining - Volinsky - 2011 - Columbia University


Slide34 l.jpg

Data Mining - Volinsky - 2011 - Columbia University


Slide35 l.jpg

Data Mining - Volinsky - 2011 - Columbia University


References l.jpg

References

  • Random Forests from Leo Breiman himself

  • Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32

  • Hastie, Tibshirani, Friedman (HTF)

    • Chapters 8,10,15,16

Data Mining - Volinsky - 2011 - Columbia University


  • Login