No CLT – No Problem? Enter the Bootstrap!

111 Views

Download Presentation
## No CLT – No Problem? Enter the Bootstrap!

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**No CLT – No Problem?Enter the Bootstrap!**John McGready Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~jmcgread**Goals of Inferential Statistics**• Much of what we do in statistics involves trying to talk about true characteristics of a process, using an imperfect subset of information from the process Population Information (what we WANT) Sample Information (what we have)**Medical Expenditures**• Suppose we want to study the FY 2005 medical expenditures for 13,000 + employees in a particular company • However, the benefits administrator will only give us one random sample of 200 employees**Medical Expenditures**(True) mean = 2.3 (True) sd =5.0 Median = 0.59, Mean = 2.3, sd = 5.0 (Sample) mean = 1.9 (Sample) sd =4.0 Median = 0.57, Mean = 2.0, sd = 4.3**Medical Expenditures**• Given the right skew, our first choice for estimating the center of the distribution is to work with the median • We can only estimate the true median using the sample median from our 200 observations**Medical Expenditures**• We are interested in how “good a guess” the sample median is of the true median • We would also like to estimate a range of possibilities for the true median (ie: a confidence interval)**Medical Expenditures**• In order to understand how a sample median from 200 observations relates to the true mean, let’s call our administrator and see if we can get 1,000 more random samples of size 200 • This way, we can compute 1,000 more sample medians and see how variable they are**The Response**No Way!**What to Do Now??**• Well, it seems we are out of luck • Let’s just estimate the mean instead, and use the Central Limit Theorem to estimate a range of possible values for the true mean**Review: Sampling Behavior via the CLT**Standard error (spread) =**Sampling Behavior via the CLT**• Most (95%) of the sample means we could get from samples of 200 would fall between the 2.5th and 97.5% of this distribution • These percentiles correspond to true mean +/- 1.96 standard errors**Sampling Behavior via the CLT**• Rub #1 • If we knew the true mean, we wouldn’t care about possible mean values • However, taking this one step further implies that 95% of the samples we could get will fall within a know range of the truth**Sampling Behavior via the CLT**• Rub #2 • If we only have one sample, we don’t know true sampling distribution • However, CLT says it will be normal • We spread from our sample data, and center it at our sample mean**Sampling Behavior via the CLT**• Our Sample info • Sample mean : 2.0 (thousand $) • Sample standard deviation: 4.3 (thousand $) • Sample estimate of standard error (spread of sampling distribution (thousand $)**Sampling Behavior via the CLT**• True 95% CI • Sample mean +/- 1.96*(true standard error) • (1.3,2.7) • Estimated 95% CI • Sample mean +/- 1.97*(estimated standard error) • (1.4, 2.6)**Another Approach to Estimating Sampling Distribution**• Instead of relying on CLT, how about we simulate sampling distribution using just our sample of 200? • Treat our sample as “truth” • Resample multiple times (say 1000) taking random draws of 200 with replacement**Resampling With Replacement**Original sample (n=4): Potential resample of same size: S1 S1 S2 S2 S3 S3 S4**Bootstrap Estimate of Sampling Distribution**• Take 1,000 resamples • Compute the mean of each re-sample • Plot a distribution of the means**Bootstrap 95% CIs**• How to get a 95% CI from the bootstrap dist • Assume normality (normal bootstrap method) • But estimate standard error from bootstrap distribution • Pick off 2.5th, 97.5th percentiles (bootstrap percentile method) • Pick off “adjusted” percentile (bias-corrected acclerated –BCa - method)**95% CIs**• True Mean 2.3 Method 95% CI CLT Estimate 1.40 - 2.60 Bootstrap Normal 1.39 - 2.60 Bootstrap Percentile 1.41 - 2.58 BCa 1.47 - 2.68**Bootstrap 95% CIs : Mean**• Empirical Coverage Probabilities1 Method 1K resamps 10K resamps CLT Estimate 2 93.4% Bootstrap Normal 2 93.2% 92.5% Bootstrap Percentile 92.4% 91.6% BCa 92.3% 93.4% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values**What’s The Big Deal?**• Why not just use CLT? • For many statistics, we do not have a CLT (or good CLT) based approach • Median • Ratio of mean to sd • Correlation coefficients**95% CIs For Median**• True Median 0.59 Method 95% CI (1,00 Reps) CLT Estimate NA Bootstrap Normal 0.44 - 0.71 Bootstrap Percentile 0.39 - 0.68 BCa 0.39 - 0.68**Bootstrap 95% CIs : Median**• Empirical Coverage Probabilities1 Method 1K resamps 10K resamps Bootstrap Normal2 94.1% 94.4% Bootstrap Percentile 93.9% 95.0% BCa 94.0% 95.2% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values**Wrap Up**• Pros/Cons of boostrap • Theoretical Justicifaction