No CLT – No Problem? Enter the Bootstrap! - PowerPoint PPT Presentation

rosetta
no clt no problem enter the bootstrap l.
Skip this Video
Loading SlideShow in 5 Seconds..
No CLT – No Problem? Enter the Bootstrap! PowerPoint Presentation
Download Presentation
No CLT – No Problem? Enter the Bootstrap!

play fullscreen
1 / 37
Download Presentation
No CLT – No Problem? Enter the Bootstrap!
111 Views
Download Presentation

No CLT – No Problem? Enter the Bootstrap!

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. No CLT – No Problem?Enter the Bootstrap! John McGready Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~jmcgread

  2. Slide #2

  3. Goals of Inferential Statistics • Much of what we do in statistics involves trying to talk about true characteristics of a process, using an imperfect subset of information from the process Population Information (what we WANT) Sample Information (what we have)

  4. Medical Expenditures • Suppose we want to study the FY 2005 medical expenditures for 13,000 + employees in a particular company • However, the benefits administrator will only give us one random sample of 200 employees

  5. Medical Expenditures (True) mean = 2.3 (True) sd =5.0 Median = 0.59, Mean = 2.3, sd = 5.0 (Sample) mean = 1.9 (Sample) sd =4.0 Median = 0.57, Mean = 2.0, sd = 4.3

  6. Medical Expenditures • Given the right skew, our first choice for estimating the center of the distribution is to work with the median • We can only estimate the true median using the sample median from our 200 observations

  7. Medical Expenditures • We are interested in how “good a guess” the sample median is of the true median • We would also like to estimate a range of possibilities for the true median (ie: a confidence interval)

  8. Medical Expenditures • In order to understand how a sample median from 200 observations relates to the true mean, let’s call our administrator and see if we can get 1,000 more random samples of size 200 • This way, we can compute 1,000 more sample medians and see how variable they are

  9. Making the Call

  10. The Response No Way!

  11. What to Do Now?? • Well, it seems we are out of luck • Let’s just estimate the mean instead, and use the Central Limit Theorem to estimate a range of possible values for the true mean

  12. Review: Sampling Behavior via the CLT Standard error (spread) =

  13. Sampling Behavior via the CLT • Most (95%) of the sample means we could get from samples of 200 would fall between the 2.5th and 97.5% of this distribution • These percentiles correspond to true mean +/- 1.96 standard errors

  14. Sampling Behavior via the CLT

  15. Sampling Behavior via the CLT • Rub #1 • If we knew the true mean, we wouldn’t care about possible mean values • However, taking this one step further implies that 95% of the samples we could get will fall within a know range of the truth

  16. Sampling Behavior via the CLT

  17. Sampling Behavior via the CLT

  18. Sampling Behavior via the CLT • Rub #2 • If we only have one sample, we don’t know true sampling distribution • However, CLT says it will be normal • We spread from our sample data, and center it at our sample mean

  19. Sampling Behavior via the CLT • Our Sample info • Sample mean : 2.0 (thousand $) • Sample standard deviation: 4.3 (thousand $) • Sample estimate of standard error (spread of sampling distribution (thousand $)

  20. Sampling Behavior via the CLT

  21. Sampling Behavior via the CLT

  22. Sampling Behavior via the CLT • True 95% CI • Sample mean +/- 1.96*(true standard error) • (1.3,2.7) • Estimated 95% CI • Sample mean +/- 1.97*(estimated standard error) • (1.4, 2.6)

  23. Another Approach to Estimating Sampling Distribution • Instead of relying on CLT, how about we simulate sampling distribution using just our sample of 200? • Treat our sample as “truth” • Resample multiple times (say 1000) taking random draws of 200 with replacement

  24. Resampling With Replacement Original sample (n=4): Potential resample of same size: S1 S1 S2 S2 S3 S3 S4

  25. Re-Sampling

  26. Bootstrap Estimate of Sampling Distribution • Take 1,000 resamples • Compute the mean of each re-sample • Plot a distribution of the means

  27. Bootstrap Estimate of Sampling Distribution

  28. Bootstrap Estimate of Sampling Distribution

  29. Bootstrap 95% CIs • How to get a 95% CI from the bootstrap dist • Assume normality (normal bootstrap method) • But estimate standard error from bootstrap distribution • Pick off 2.5th, 97.5th percentiles (bootstrap percentile method) • Pick off “adjusted” percentile (bias-corrected acclerated –BCa - method)

  30. 95% CIs • True Mean 2.3 Method 95% CI CLT Estimate 1.40 - 2.60 Bootstrap Normal 1.39 - 2.60 Bootstrap Percentile 1.41 - 2.58 BCa 1.47 - 2.68

  31. We Could Do with 10,000 Resamples

  32. Bootstrap 95% CIs : Mean • Empirical Coverage Probabilities1 Method 1K resamps 10K resamps CLT Estimate 2 93.4% Bootstrap Normal 2 93.2% 92.5% Bootstrap Percentile 92.4% 91.6% BCa 92.3% 93.4% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values

  33. What’s The Big Deal? • Why not just use CLT? • For many statistics, we do not have a CLT (or good CLT) based approach • Median • Ratio of mean to sd • Correlation coefficients

  34. Getting a 95% CI for A Median

  35. 95% CIs For Median • True Median 0.59 Method 95% CI (1,00 Reps) CLT Estimate NA Bootstrap Normal 0.44 - 0.71 Bootstrap Percentile 0.39 - 0.68 BCa 0.39 - 0.68

  36. Bootstrap 95% CIs : Median • Empirical Coverage Probabilities1 Method 1K resamps 10K resamps Bootstrap Normal2 94.1% 94.4% Bootstrap Percentile 93.9% 95.0% BCa 94.0% 95.2% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values

  37. Wrap Up • Pros/Cons of boostrap • Theoretical Justicifaction