94 Views

Download Presentation
## Stats 120A

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Stats 120A**Review of CIs, hypothesis tests and more**Sample/Population**• Last time we collected height/armspan data. Is this a sample or a population?**Gallup Poll, 1/9/07**"As you may know, the Bush administration is considering a temporary but significant increase in the number of U.S. troops in Iraq to help stabilize the situation there. Would you favor or oppose this?"**Results**• Results based on 1004 randomly selected adults (> 18 years) interviewed Jan 5-7, 2007. • 61% are opposed. • "For results based on this sample, one can say with 95% confidence that the maximum error attributable to sampling and other random effects is ±3 percentage points. "**Pop Quiz**• Is the value 61% a statistic or a parameter? • The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling**Sampling paradigm**• In the U.S., the proportion of adults who are opposed to a surge is p, (or p*100%). • We take a random sample of n = 1004. • The proportion of our sample ("p hat") is an estimate of the proportion in the population.**A simulation:**• Choose a value to serve as p (say p = .6) • Our "data" consist of 1004 numbers: 0's represent those in favor, 1's are those opposed. • x = 589 out of 1004 say "opposed", so p-hat = 589/1004 = .5866 • mean(x) = .5866 • sd(x) = .4926**How do we know sample proportion is a good estimate of**population proportion? • Law of Large Numbers: sample averages (and proportions) converge on population values •implying that for finite values, the sample proportion might be close if the sample size is large**So if we stop earlier, say n = 10**p-hat = .60**Which raises the question:**• If we stop early, how far away will our sample proportion be from the true value? • Or, in a survey setting, if we take a finite sample of n=1004, how far off from the population proportion are we likely to be?**A simulation might help:**• Assume p = .60 (population proportion) • Take sample of n = 1004 and find p-hat. • Save this value • Repeat above 3 steps 10000 times.**The R code (for the record)**• phat <- c() for (i in 1:10000){ x <- sample(c(0,1),1004,replace=T,prob=c(.4, .6)) temp <- sum(x)/1004 phat <- c(phat,temp)} • hist(phat)**Observe that...**• sample proportions are centered on the true population value: p = .60 • variability is not great: smallest is .54, biggest is .66 • distribution is bell-shaped**We've just witnessed the Central Limit Theorem**If samples are independent and random and sufficiently large • means (and proportions) follow a nearly Normal distribution • the mean of the Normal is the mean of the population • the SD of the Normal (aka the standard error) is the population SD divided by sqrt(n)**CLT applied to sample proportions**• phat is distributed with an approx Normal • mean is p • SE is sqrt(p*(1-p)/n) • For our simulation, p = .60 so our p-hats will be centered on .6 with a SD of sqrt(.6*.4/1004) = 0.0155**We saw**• Normal • mean(phat) = 0.600(expected .6) • sd(phat) = 0.01554(expected 0.0155)**In practice, we don't know p**but we can get a good approximation to the standard error using sqrt(phat * (1-phat)/n) rather than sqrt(p*(1-p)/n)**So if we take a random sample of n = 1004**and we see p-hat = .61, we know that: • The true value of p can't be far away. SE = sqrt(.61*.39/1004) = 0.0154 •So 68% of the time we do this, p will be within 0.0154 of phat •And 95% of the time it will be with 2*.0154 = 0.03**Which leads us to conclude**that the true proportion of the population that opposes a surge is somewhere in the interval.61 - .03 = 0.58 to .61+.03 = 0.64**Confidence intervals**• This is an example of a 95% confidence interval. • Because 95% of all samples will produce a p-hat that is within 2 standard errors of the true value, we are 95% confident that ours is a "good" interval.**Formula**A 95% CI for a proportion is estimate +/- 2 * (Standard Error) p-hat +/- 2*sqrt(phat*(1-phat)/n) 0.61 +/- 2*sqrt(.61*.39/1004) (.58, .64) note: our replacing phat for p in SE means we get an approximate value**What does 95% mean?**• If we repeat this infinitely many times: • take a sample of n = 1004 from population • calculate sample proportion • find an interval using +/- 2 * SE • then 95% of these CIs will contain the truth and 5% will not. • We see only one: (.58, .64). It is either good or bad, but we are confident it is good.**Where did the 95% come from?**• It came from the normal curve. • The CLT told us that p-hat followed a (approx) normal distribution. • For Normal's, 68% of probability is within 1 standard deviation of mean, 95% within 2, 99.7% within 3. • A normal table gives other probabilities**Change confidence level by changing the width of margin of**error -0.015 +.015 1 SE 68% 2 SEs 95% 3 SEs 99.7% 90% 1.6 SE phat =0.61**The CLT applies to**• any linear combination of the observations • assuming observations are randomly sampled, and independent • it does NOT matter what the distribution of the population looks like • if n is small, the distribution will be only approximately normal, and this might be a very poor approximation**the CLT does NOT apply to**• non-linear combinations, such as the sample median or the standard deviation • non-random samples • samples that are dependent**simulation**• http://onlinestatbook.com/stat_sim/sampling_dist/index.html**Summary**• Confidence Level is a statement about the sampling process, not the sample • Margin of error is determined to achieve the desired confidence level • We can calculate the confidence level only if we know the sampling distribution: the probability distribution of the sample**Pop Quiz**• Is the value 61% a statistic or a parameter? • The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling**Pop Quiz**• Is the value 61% a statistic or a parameter? • The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling**For next time:**• In WWII, German army produced tanks with sequential serial numbers. The allies captured a few tanks, and wanted to infer the total number of tanks produced. • Suppose you had captured 10 tanks. Come up with three estimators for the total number of tanks. • Data: 911 5146 6083 944 11944 9365 6087 6647 7076 12275