Stats 120A

Stats 120A Review of CIs, hypothesis tests and more

Sample/Population • Last time we collected height/armspan data. Is this a sample or a population?

Gallup Poll, 1/9/07 "As you may know, the Bush administration is considering a temporary but significant increase in the number of U.S. troops in Iraq to help stabilize the situation there. Would you favor or oppose this?"

Results • Results based on 1004 randomly selected adults (> 18 years) interviewed Jan 5-7, 2007. • 61% are opposed. • "For results based on this sample, one can say with 95% confidence that the maximum error attributable to sampling and other random effects is ±3 percentage points. "

Pop Quiz • Is the value 61% a statistic or a parameter? • The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling

Sampling paradigm • In the U.S., the proportion of adults who are opposed to a surge is p, (or p*100%). • We take a random sample of n = 1004. • The proportion of our sample ("p hat") is an estimate of the proportion in the population.

A simulation: • Choose a value to serve as p (say p = .6) • Our "data" consist of 1004 numbers: 0's represent those in favor, 1's are those opposed. • x = 589 out of 1004 say "opposed", so p-hat = 589/1004 = .5866 • mean(x) = .5866 • sd(x) = .4926

xbar=.5866, s = .493

How do we know sample proportion is a good estimate of population proportion? • Law of Large Numbers: sample averages (and proportions) converge on population values •implying that for finite values, the sample proportion might be close if the sample size is large

Coin flips: sample proportion "settles down" to 0.5

So if we stop earlier, say n = 10 p-hat = .60

Which raises the question: • If we stop early, how far away will our sample proportion be from the true value? • Or, in a survey setting, if we take a finite sample of n=1004, how far off from the population proportion are we likely to be?

A simulation might help: • Assume p = .60 (population proportion) • Take sample of n = 1004 and find p-hat. • Save this value • Repeat above 3 steps 10000 times.

The R code (for the record) • phat <- c() for (i in 1:10000){ x <- sample(c(0,1),1004,replace=T,prob=c(.4, .6)) temp <- sum(x)/1004 phat <- c(phat,temp)} • hist(phat)

each dot represents one survey of 1004 people

10,000 sample proportions, n = 1004

Observe that... • sample proportions are centered on the true population value: p = .60 • variability is not great: smallest is .54, biggest is .66 • distribution is bell-shaped

We've just witnessed the Central Limit Theorem If samples are independent and random and sufficiently large • means (and proportions) follow a nearly Normal distribution • the mean of the Normal is the mean of the population • the SD of the Normal (aka the standard error) is the population SD divided by sqrt(n)

CLT applied to sample proportions • phat is distributed with an approx Normal • mean is p • SE is sqrt(p*(1-p)/n) • For our simulation, p = .60 so our p-hats will be centered on .6 with a SD of sqrt(.6*.4/1004) = 0.0155

We saw • Normal • mean(phat) = 0.600(expected .6) • sd(phat) = 0.01554(expected 0.0155)

In practice, we don't know p but we can get a good approximation to the standard error using sqrt(phat * (1-phat)/n) rather than sqrt(p*(1-p)/n)

So if we take a random sample of n = 1004 and we see p-hat = .61, we know that: • The true value of p can't be far away. SE = sqrt(.61*.39/1004) = 0.0154 •So 68% of the time we do this, p will be within 0.0154 of phat •And 95% of the time it will be with 2*.0154 = 0.03

Which leads us to conclude that the true proportion of the population that opposes a surge is somewhere in the interval.61 - .03 = 0.58 to .61+.03 = 0.64

Confidence intervals • This is an example of a 95% confidence interval. • Because 95% of all samples will produce a p-hat that is within 2 standard errors of the true value, we are 95% confident that ours is a "good" interval.

Formula A 95% CI for a proportion is estimate +/- 2 * (Standard Error) p-hat +/- 2*sqrt(phat*(1-phat)/n) 0.61 +/- 2*sqrt(.61*.39/1004) (.58, .64) note: our replacing phat for p in SE means we get an approximate value

What does 95% mean? • If we repeat this infinitely many times: • take a sample of n = 1004 from population • calculate sample proportion • find an interval using +/- 2 * SE • then 95% of these CIs will contain the truth and 5% will not. • We see only one: (.58, .64). It is either good or bad, but we are confident it is good.

Where did the 95% come from? • It came from the normal curve. • The CLT told us that p-hat followed a (approx) normal distribution. • For Normal's, 68% of probability is within 1 standard deviation of mean, 95% within 2, 99.7% within 3. • A normal table gives other probabilities

Change confidence level by changing the width of margin of error -0.015 +.015 1 SE 68% 2 SEs 95% 3 SEs 99.7% 90% 1.6 SE phat =0.61

The CLT applies to • any linear combination of the observations • assuming observations are randomly sampled, and independent • it does NOT matter what the distribution of the population looks like • if n is small, the distribution will be only approximately normal, and this might be a very poor approximation

the CLT does NOT apply to • non-linear combinations, such as the sample median or the standard deviation • non-random samples • samples that are dependent

simulation • http://onlinestatbook.com/stat_sim/sampling_dist/index.html

Summary • Confidence Level is a statement about the sampling process, not the sample • Margin of error is determined to achieve the desired confidence level • We can calculate the confidence level only if we know the sampling distribution: the probability distribution of the sample

Pop Quiz • Is the value 61% a statistic or a parameter? • The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling

For next time: • In WWII, German army produced tanks with sequential serial numbers. The allies captured a few tanks, and wanted to infer the total number of tanks produced. • Suppose you had captured 10 tanks. Come up with three estimators for the total number of tanks. • Data: 911 5146 6083 944 11944 9365 6087 6647 7076 12275

Stats 120A

Stats 120A

Presentation Transcript

stats

Stats

STATS

Budgeting with Stats, Stats & more Stats!

Midterm Stats

GAUTENG STATS International Arrivals Stats

Stats

Stats

2010 Minnesota Statutes 120A.40 SCHOOL CALENDAR.

Stats

Stats

Stats

Stats

Stats

Stats

Stats

Stats

STATS

Stats Practice

Stats 241.3

Stats 120A

Stats 120A

Presentation Transcript

stats

Stats

STATS

Budgeting with Stats, Stats &amp; more Stats!

Midterm Stats

GAUTENG STATS International Arrivals Stats

Stats

Stats

2010 Minnesota Statutes 120A.40 SCHOOL CALENDAR.

Stats

Stats

Stats

Stats

Stats

Stats

Stats

Stats

STATS

Stats Practice

Stats 241.3

Budgeting with Stats, Stats & more Stats!