- By
**booth** - Follow User

- 100 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' End of Chapter 8' - booth

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

Outline

- 8.6 MCMC for Sampling from the Posterior
- 8.7 Bagging
- 8.7.1 Examples: Trees with Simulated Data

- 8.8 Model Averaging and Stacking
- 8.9 Stochastic Search: Bumping

MCMC for Sampling from the Posterior

- Markov chain Monte Carlo method
- Estimate parameters given a Bayesian model and sampling from the posterior distribution
- Gibbs sampling, a form of MCMC, is like EM only sample from conditional dist rather than maximizing

Gibbs Sampling

- Wish to draw a sample from the joint distribution
- If this is difficult, but it’s easy to simulate conditional distributions
- Gibbs sampler simulates each of these
- Process produces a Markov chain with stationary distribution equal to desired joint disttribution

Algorithm 8.3: Gibbs Sampler

- Take some initial values
- for t=1,2,…:
- for k=1,2,…,K generate from:

- Continue step 2 until joint distribution of
does not change

Gibbs Sampling

- Only need to be able to sample from conditional distribution, but if it is known, then:
is a better estimate

Gibbs sampling for mixtures

- Consider latent data from EM procedure to be another parameter:
- See algorithm (next slide), same as EM except sample instead of maximize
- Additional steps can be added to include other informative priors

Algorithm 8.4: Gibbs sampling for mixtures

- Take some initial values
- Repeat for t=1,2,…,
- For I=1,2,…,N generate
- Set

- Continue step 2 until the joint distribution of
doesn’t change.

Figure 8.8: Gibbs Sampling from Mixtures

Simplified case with fixed variances and mixing proportion

Outline

- 8.6 MCMC for Sampling from the Posterior
- 8.7 Bagging
- 8.7.1 Examples: Trees with Simulated Data

- 8.8 Model Averaging and Stacking
- 8.9 Stochastic Search: Bumping

8.7 Bagging

- Using bootstrap to improve the estimate itself
- Bootstrap mean approximately posterior average
- Consider regression problem:
- Bagging averages estimates over bootstrap samples to produce:

Bagging, cnt’d

- Point is to reduce variance of the estimate while leaving bias unchanged
- Monte-Carlo estimate of “true” bagging estimate, approaching as
- Bagged estimate will differ from the original estimate only when latter is adaptive or non-linear function of the data

Bagging B-Spline Example

- Bagging would average the curves in the lower left-hand corner at each x value.

Quick Tree Intro

- Can’t do.
- Recursive subdivision.
- Tree.
- f-hat.

Bagging Trees

- Each run produces different trees
- Each tree may have different terminal nodes
- Bagged estimate is the average prediction at x from the B trees. Prediction can be a 0/1 indicator function, in which case bagging gives a pkproportion of trees predicting class k at x.

8.7.1: Example Trees with Simulated Data

- Original and 5 bootstrap-grown trees
- Two classes, five features, Gaussian distribution
- Y from
- Bayes error 0.2
- Trees fit to 200 bootstrap samples

Example Performance

- High variance among trees because features have pairwise correlation 0.95.
- Bagging successfully smooths out vairance and reduces test error.

Where Bagging Doesn’t Help

- Classifier is a single axis-oriented split.
- Split is chosen along either x1or x2 in order to minimize training error.
- Boosting is shown on the right.

Outline

- 8.6 MCMC for Sampling from the Posterior
- 8.7 Bagging
- 8.7.1 Examples: Trees with Simulated Data

- 8.8 Model Averaging and Stacking
- 8.9 Stochastic Search: Bumping

Model Averaging and Stacking

- More general Bayesian model averaging
- Given candidate models Mm, m =1…M and a training set Z and
- Bayesian prediction is weighted avg of indiv predictions with weights proportional to posterior of each model

Other Averaging Strategies

- Simple unweighted average of predictions (each model equally likely)
- BIC: use to estimate posterior model probabilities: weight each model depending on fit and how many parameters it uses
- Full Bayesian strategy:

Frequentist Viewpoint of Averaging

- Given a set of predictions from M models, we seek optimal weights w:
- Input x is fixed and N observations in Z are distributed according to P. Solution is the linear regression of Y on the vector of model predictions:

Notes of Frequentist Viewpoint

- At the population level, adding models with arbitrary weights can only help.
- But the population is, of course, not available
- Regression over training set can be used, but this may not be ideal: model complexity not taken into account…

Stacked Generalization, Stacking

- Cross validated predictions avoid unfairly high weight to models with high complexity
- If w restricted to vectors with one unit weight and the rest zero, model choice has smallest leave-one-out cross validation
- In practice we use combined models with optimal weights: better prediction, but less interpretability

- 8.6 MCMC for Sampling from the Posterior
- 8.7 Bagging
- 8.7.1 Examples: Trees with Simulated Data

- 8.8 Model Averaging and Stacking
- 8.9 Stochastic Search: Bumping

Stochastic Search: Bumping

- Rather than average models, try to find a better single model.
- Good for avoiding local minima in the fitting method.
- Like bagging, draw bootstrap samples and fit model to each, but choose model that best fits the training data

Stochastic Search: Bumping

- Given B bootstrap samples Z*1,…, Z*B, fitting model to each yields predictions:
- For squared error, choose model from bootstrap sample:
- Bumping tries to move around the model space by perturbing the data.

A contrived case where bumping helps

- Greedy tree-based algorithm tries to split on each dimension separately, first one, then the other.
- Bumping stumbles upon the right answer.

Download Presentation

Connecting to Server..