turning bayesian model averaging into bayesian model combination n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Turning Bayesian Model Averaging Into Bayesian Model Combination PowerPoint Presentation
Download Presentation
Turning Bayesian Model Averaging Into Bayesian Model Combination

Loading in 2 Seconds...

play fullscreen
1 / 121

Turning Bayesian Model Averaging Into Bayesian Model Combination - PowerPoint PPT Presentation


  • 235 Views
  • Uploaded on

Turning Bayesian Model Averaging Into Bayesian Model Combination. Kristine Monteith , James L. Carroll, Kevin Seppi , Tony Martinez Presented by James L. Carroll At LANL CNLS 2011, and AMS 2011 LA-UR 11-05664. Abstract.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Turning Bayesian Model Averaging Into Bayesian Model Combination


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Turning Bayesian Model Averaging Into Bayesian Model Combination Kristine Monteith, James L. Carroll, Kevin Seppi, Tony Martinez Presented by James L. Carroll At LANL CNLS 2011, and AMS 2011 LA-UR 11-05664

    2. Abstract Bayesian methods are theoretically optimal in many situations. Bayesian model averaging is generally considered the standard model for creating ensembles of learners using Bayesian methods, but this technique is often outperformed by more ad hoc methods in empirical studies. The reason for this failure has important theoretical implications for our understanding of why ensembles work. It has been proposed that Bayesian model averaging struggles in practice because it accounts for uncertainty about which model is correct but still operates under the assumption that only one of them is. In order to more effectively access the benefits inherent in ensembles, Bayesian strategies should therefore be directed more towards model combination rather than the model selection implicit in Bayesian model averaging. This work provides empirical verification for this hypothesis using several different Bayesian model combination approaches tested on a wide variety of classification problems. We show that even the most simplistic of Bayesian model combination strategies outperforms the traditional ad hoc techniques of bagging and boosting, as well as outperforming BMA over a wide variety of cases. This suggests that the power of ensembles does not come from their ability to account for model uncertainty, but instead comes from the changes in representational and preferential bias inherent in the process of combining several different models.

    3. Supervised UBDTM . .. DTest X F X . .. D Y Y DTrain O End Use Utility

    4. The Mathematics of Learning • Learning about F:

    5. The Mathematics of Learning • Learning about F: Repeat to get

    6. The Mathematics of Learning • Learning about F: • Classification or Regression:

    7. The Mathematics of Learning • Learning about F: • Classification or Regression: • Decision making:

    8. The Mathematics of Learning • Learning about F: • Classification or Regression: • 0-1 Loss Decision making:

    9. Realistic Machine Learning: Training Using Training Data Input: Unlabeled Instance Learning Algorithm hypothesis hypothesis Output: Class label

    10. Using a Learner • Info about • sepal length • sepal width • petal length • petal width hypothesis Iris Setosa Iris Virginica Iris Versicolor

    11. Using a Learner • Info about • sepal length • sepal width • petal length • petal width hypothesis Iris Setosa Iris Virginica Iris Versicolor

    12. Using a Learner • Info about • sepal length • sepal width • petal length • petal width hypothesis Can we do better? Iris Setosa Iris Virginica Iris Versicolor

    13. Ensembles • Multiple learners vs.

    14. Creating Ensemble Diversity: data1 hypothesis1 data2 hypothesis2 Learning Algorithm Training Data data3 hypothesis3 data4 hypothesis4 data5 hypothesis5

    15. Creating Ensemble Diversity : algorithm1 hypothesis1 algorithm2 hypothesis2 Training Data algorithm3 hypothesis3 algorithm4 hypothesis4 algorithm5 hypothesis5

    16. CLASSIFYING AN INSTANCE: Input: Unlabeled Instance h1 h2 h3 h4 h5 Iris Setosa: 0.1 Iris Virginica: 0.3 Iris Versicolor: 0.6 Iris Setosa: 0.3 Iris Virginica: 0.3 Iris Versicolor: 0.4 Iris Setosa: 0.4 Iris Virginica: 0.5 Iris Versicolor: 0.1 Iris Setosa: Iris Virginica: Output: Class label

    17. CLASSIFYING AN INSTANCE: Input: Unlabeled Instance h1 h2 h3 h4 h5 Iris Setosa: 0.1 Iris Virginica: 0.3 Iris Versicolor: 0.6 Iris Setosa: 0.3 Iris Virginica: 0.3 Iris Versicolor: 0.4 Iris Setosa: 0.4 Iris Virginica: 0.5 Iris Versicolor: 0.1 Iris Setosa: Iris Virginica: Output: Class label

    18. Possible OPTIONS FOR COMBINING HYPOTHESES: • Bagging: One hypothesis, One vote • Boosting: Weight by predictive accuracy on the training set • BAYESIAN MODEL AVERAGING (BMA): Weight by the formal probability that each hypothesis is correct given all the data xi: Unlabeled Instance yi: Probability of class label

    19. BMA Two Steps: • Step 0 • Train learners • Step 1 • Grade learners • Step 2 • Use learners

    20. BMA Two Steps: • Step 0 • Train learners • Step 1 • Grade learners • Step 2 • Use learners • Optimal Solution:

    21. BMA Two Steps: • Step 0 • Train learners • Step 1 • Grade learners • Step 2 • Use learners

    22. BAYESIAN TECHNIQUES “Bayes is right, and everything else is wrong, or is a (potentially useful) approximation.” -James Carroll Please compare your algorithm to Bayesian Model Averaging - Reviewer for a conference where Kristine submitted her thesis research on ensemble learning

    23. BMA is the “Optimal” Ensemble Technique? “Given the ‘correct’ model space and prior distribution, Bayesian model averaging is the optimal method for making predictions; in other words, no other approach can consistently achieve lower error rates than it does.” - Pedro Domingos

    24. DOMINGOs’ Experiments • Domingos decided to put this theory to the test. • 2000 empirical study of ensemble methods: • J48 • Bagging • BMA

    25. DOMINGOs’ Experiments

    26. DOMINGOs’ Experiments

    27. DOMINGOs’ Experiments

    28. Domingos’s Observation: • Bayesian Model Averaging gives too much weight to the “maximum likelihood” hypothesis

    29. Domingos’s Observation: • Bayesian Model Averaging gives too much weight to the “maximum likelihood” hypothesis Compare two classifiers with 100 data points: One with 95% predictive accuracy and one with 94% predictive accuracy Bayesian Model Averaging weights the first classifier as 17 TIMES more likely!

    30. Clarke’s Experiments • 2003, comparison between BMA and stacking. • Similar results to Domingos • BMA is vastly outperformed by stacking

    31. Clarke’s Claim: h1 h2 h3

    32. Clarke’s Claim: DGM h1 h2 h3

    33. Clarke’s Claim: DGM h1 Projection h2 h3

    34. Clarke’s Claim: h1 DGM h2 h3

    35. Clarke’s Claim: • BMA converges to model closest to the Data Generating Model (DGM) instead of converging to the combination closest to DGM! h1 DGM h2 h3

    36. Clarke’s Claim: • BMA converges to model closest to the Data Generating Model (DGM) instead of converging to the combination closest to DGM! h1 DGM h2 h3

    37. Clarke’s Claim: • BMA converges to model closest to the Data Generating Model (DGM) instead of converging to the combination closest to DGM! h1 DGM h2 h3

    38. Clarke’s Claim: • BMA converges to model closest to the Data Generating Model (DGM) instead of converging to the combination closest to DGM! h1 DGM h2 h3

    39. Clarke’s Claim: • BMA converges to model closest to the Data Generating Model (DGM) instead of converging to the combination closest to DGM! h1 DGM h2 h3

    40. Clarke’s Claim: • BMA converges to model closest to the Data Generating Model (DGM) instead of converging to the combination closest to DGM! h1 DGM h2 h3

    41. Clarke’s Claim: • BMA converges to model closest to the Data Generating Model (DGM) instead of converging to the combination closest to DGM! h1 DGM h2 h3

    42. Clarke’s Claim: • BMA converges to model closest to the Data Generating Model (DGM) instead of converging to the combination closest to DGM! h1 DGM h2 h3

    43. Is Clark Correct?

    44. Is Clark Correct?

    45. Is Clark Correct?

    46. Is Clark Correct?

    47. Is Clark Correct? • 5 samples

    48. Is Clark Correct? • 10 samples

    49. Is Clark Correct? • 15 samples

    50. Is Clark Correct? • 20 samples