End of Chapter 8

1 / 29

# End of Chapter 8 - PowerPoint PPT Presentation

End of Chapter 8. Neil Weisenfeld March 28, 2005. Outline. 8.6 MCMC for Sampling from the Posterior 8.7 Bagging 8.7.1 Examples: Trees with Simulated Data 8.8 Model Averaging and Stacking 8.9 Stochastic Search: Bumping. MCMC for Sampling from the Posterior. Markov chain Monte Carlo method

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' End of Chapter 8' - booth

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### End of Chapter 8

Neil Weisenfeld

March 28, 2005

Outline
• 8.6 MCMC for Sampling from the Posterior
• 8.7 Bagging
• 8.7.1 Examples: Trees with Simulated Data
• 8.8 Model Averaging and Stacking
• 8.9 Stochastic Search: Bumping
MCMC for Sampling from the Posterior
• Markov chain Monte Carlo method
• Estimate parameters given a Bayesian model and sampling from the posterior distribution
• Gibbs sampling, a form of MCMC, is like EM only sample from conditional dist rather than maximizing
Gibbs Sampling
• Wish to draw a sample from the joint distribution
• If this is difficult, but it’s easy to simulate conditional distributions
• Gibbs sampler simulates each of these
• Process produces a Markov chain with stationary distribution equal to desired joint disttribution
Algorithm 8.3: Gibbs Sampler
• Take some initial values
• for t=1,2,…:
• for k=1,2,…,K generate from:
• Continue step 2 until joint distribution of

does not change

Gibbs Sampling
• Only need to be able to sample from conditional distribution, but if it is known, then:

is a better estimate

Gibbs sampling for mixtures
• Consider latent data from EM procedure to be another parameter:
• See algorithm (next slide), same as EM except sample instead of maximize
Algorithm 8.4: Gibbs sampling for mixtures
• Take some initial values
• Repeat for t=1,2,…,
• For I=1,2,…,N generate
• Set
• Continue step 2 until the joint distribution of

doesn’t change.

Figure 8.8: Gibbs Sampling from Mixtures

Simplified case with fixed variances and mixing proportion

Outline
• 8.6 MCMC for Sampling from the Posterior
• 8.7 Bagging
• 8.7.1 Examples: Trees with Simulated Data
• 8.8 Model Averaging and Stacking
• 8.9 Stochastic Search: Bumping
8.7 Bagging
• Using bootstrap to improve the estimate itself
• Bootstrap mean approximately posterior average
• Consider regression problem:
• Bagging averages estimates over bootstrap samples to produce:
Bagging, cnt’d
• Point is to reduce variance of the estimate while leaving bias unchanged
• Monte-Carlo estimate of “true” bagging estimate, approaching as
• Bagged estimate will differ from the original estimate only when latter is adaptive or non-linear function of the data
Bagging B-Spline Example
• Bagging would average the curves in the lower left-hand corner at each x value.
Quick Tree Intro
• Can’t do.
• Recursive subdivision.
• Tree.
• f-hat.
Bagging Trees
• Each run produces different trees
• Each tree may have different terminal nodes
• Bagged estimate is the average prediction at x from the B trees. Prediction can be a 0/1 indicator function, in which case bagging gives a pkproportion of trees predicting class k at x.
8.7.1: Example Trees with Simulated Data
• Original and 5 bootstrap-grown trees
• Two classes, five features, Gaussian distribution
• Y from
• Bayes error 0.2
• Trees fit to 200 bootstrap samples
Example Performance
• High variance among trees because features have pairwise correlation 0.95.
• Bagging successfully smooths out vairance and reduces test error.
Where Bagging Doesn’t Help
• Classifier is a single axis-oriented split.
• Split is chosen along either x1or x2 in order to minimize training error.
• Boosting is shown on the right.
Outline
• 8.6 MCMC for Sampling from the Posterior
• 8.7 Bagging
• 8.7.1 Examples: Trees with Simulated Data
• 8.8 Model Averaging and Stacking
• 8.9 Stochastic Search: Bumping
Model Averaging and Stacking
• More general Bayesian model averaging
• Given candidate models Mm, m =1…M and a training set Z and
• Bayesian prediction is weighted avg of indiv predictions with weights proportional to posterior of each model
Other Averaging Strategies
• Simple unweighted average of predictions (each model equally likely)
• BIC: use to estimate posterior model probabilities: weight each model depending on fit and how many parameters it uses
• Full Bayesian strategy:
Frequentist Viewpoint of Averaging
• Given a set of predictions from M models, we seek optimal weights w:
• Input x is fixed and N observations in Z are distributed according to P. Solution is the linear regression of Y on the vector of model predictions:
Notes of Frequentist Viewpoint
• At the population level, adding models with arbitrary weights can only help.
• But the population is, of course, not available
• Regression over training set can be used, but this may not be ideal: model complexity not taken into account…
Stacked Generalization, Stacking
• Cross validated predictions avoid unfairly high weight to models with high complexity
• If w restricted to vectors with one unit weight and the rest zero, model choice has smallest leave-one-out cross validation
• In practice we use combined models with optimal weights: better prediction, but less interpretability
Outline
• 8.6 MCMC for Sampling from the Posterior
• 8.7 Bagging
• 8.7.1 Examples: Trees with Simulated Data
• 8.8 Model Averaging and Stacking
• 8.9 Stochastic Search: Bumping
Stochastic Search: Bumping
• Rather than average models, try to find a better single model.
• Good for avoiding local minima in the fitting method.
• Like bagging, draw bootstrap samples and fit model to each, but choose model that best fits the training data
Stochastic Search: Bumping
• Given B bootstrap samples Z*1,…, Z*B, fitting model to each yields predictions:
• For squared error, choose model from bootstrap sample:
• Bumping tries to move around the model space by perturbing the data.
A contrived case where bumping helps
• Greedy tree-based algorithm tries to split on each dimension separately, first one, then the other.
• Bumping stumbles upon the right answer.