Point and Confidence Interval Estimation of a Population Proportion, p

Point and Confidence Interval Estimation of a Population Proportion, p

Point and Confidence Interval Estimation of a Population Proportion, p

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

1. We are frequently interested in estimating the proportion of a population with a characteristic of interest, such as: • the proportion of smokers • the proportion of cancer patients who survive at least 5 years • the proportion of new HIV patients who are female

2. If we take a random sample from a population • observe the number of subjects with the characteristic of interest (# of “successes”) • we are observing a binomial random variable. • Now, however, we will focus on • estimating the true proportion , p, in the population • rather than focusing on the count.

3. Again, one way to deal with this type of data is to define a random variable X that can take two values: • X = 1, if characteristic is present – a “success” • X = 0, if characteristic is absent – a “failure” • Then • if we sum all values in a population, • we are summing zeros and ones – • this will give a count of the number of individuals in the population with the characteristic:

4. The population mean is the Proportion of individuals in population with the characteristic: The sample proportion is then: Therefore, p is the estimator of p, the proportion with a characteristic of interest.

5. By the Central Limit Theorem, we know, for n large even when X is not normally distributed. When X is a 0,1 variable, for n large we know from the central limit theorem.

6. What is the variance, s2, for a 0,1 variable? We know By use of algebra, and the fact that xs2 = xs. for a 0,1 variable, we can show that

7. For those who want the algebra: expand x2 = x, for 0,1 sum over constant

8. Hence, The standard error of the sample proportion is Standard error of P:

9. We also know, by the central limit theorem, that for large n, P is approximately normally distributed: For Estimation of the population proportion, p: Point Estimate:Confidence Interval Estimate:

10. Example: Suppose that a sample of 1000 voters is taken to determine presidential preference. In this sample, 585 persons indicated that they would vote for candidate A. Construct a 95% confidence interval estimate for the true proportion, p, in the population planning to vote for candidate A. The confidence interval forptakes the form:

11. The point estimate of the proportion is: p= (585/1000) = .585 • The 95% confidence interval estimator of p is • However we don’t know p, so we will use p in it’s place to estimate the standard error:

12. The 95% CI on the proportion preferring Candidate A is (.554, .616). This does not include the value .50: Either we obtained an unusually large sample mean (such that the interval estimate did not overlap µ=0.5) if µ really is .5, or the population mean is not .5, suggesting that candidate A will win the election.

13. When is the sample large enough to use the normal approximation to the binomial? • When (n)(π)5, and (n)(1-π)5 • That is, • when both the expected number of successesand the expected number of failures is greater than 5.

14. Aside: improve to the normal Appoximation for a Binomial • The Binomial distribution is discrete, while the normal distribution is continuous. When the true proportion,π, is known, we can match the binomial distribution better to a normal distribution by including a correction. The correction is called the ‘continuity correction’. • For example, when π = .5, and n = 10, to approximate We use instead the normal approximation for the probability

15. Example of ‘Continuity Correction' to the Normal Approximation to the Binomial. Suppose π = .5 and n = 16. Compare the exact normal approximation and continuity corrected values of P(.4375 ≤ P ≤ .5). • From Binomial Table: • Using Normal Approximation, no correction • Using Correction:

16. Using P in place of p to estimate the standard errorsp: • 1.If (n)(π)5 and (n)(1-π)5, use P: • 2.Otherwise, a) Assume π=.5,or b) use an ‘exact ’method for the CI • We do this to avoid underestimating the variance, p(1– p) which is at a maximum when p=.5 • Don’t use Student’s t with proportions since the assumption of normality of the underlying population elements is not satisfied by a 0,1 variable.

17. What do we use when the normal approximation is not appropriate? • Exact Binomial Confidence Intervals for p can be computed: • Solve for x in the following and then substitute into p= x/n: • Lower Limit: • Upper Limit: • Clearly, exact binomial CI is not simple to compute

18. Go to Minitab or other software Stat  Basic Statistics  1 Proportion Leave blank for Binomial CI; Check for Normal approx. n x

19. EXACT Binomial: Test and CI for One Proportion Test of p = 0.5 vs p not = 0.5 Exact Sample X N Sample p 95.0% CI P-Value 1 585 1000 0.585 (0.553748, 0.615750) 0.000 Normal Approximation: Test and CI for One Proportion Test of p = 0.5 vs p not = 0.5 Sample X N Sample p 95.0% CI Z-Value P-Value 1 585 1000 0.585 (0.554461, 0.615539) 5.38 0.000

20. Sample Size Estimation when the goal is Estimating a Population Proportion, p • The pattern is the same as when goal is estimation of a mean: • If we know • the desired precision (width of interval) • confidence level • “guess” of the proportion  to get std error • we can estimate the sample size, n.

21. The width of a confidence interval for P is: w = 2[z1-a/2 (sP)] , where sP is the standard error of P w ) ( P P – z1-a/2(sP) P + z1-a/2(sP) Using we have

22. Solving for n gives us • Note: • this requires information about p, which is our goal! • However, p(1–p) is at a maximum when p=.5 • To be conservative • (over- rather than under-estimate sample size) • use (.5) in place of p

23. Example: • For an election poll, how many voters should be surveyed to estimate the proportion, to within 5%, in favor of re-electing the current mayor, with 95% confidence? • We have a confidence level, 1–a = .95  z.975 = 1.96 • We have a desired width of  5% =  .05,  w = .10 • Conservative: n = (z1-a/2)2/w2 = (1.96)2/(.10)2 = 384.16 • We should poll 385 voters to achieve a 95% CI of  5%

24. What if we have some information on p? • A previous poll tells us that the current office-holder had ~ 75% of the voter support. • Assuming p = .75: • n = 4p(1–p)(z1-a/2)2/w2 • = 4(.75)(.25)(1.96)2/(.10)2 = 288.12 • Using available information • we get a sample size estimate of 289 voters • which can save us considerable time and expense, compared to the more conservative estimate.

25. Confidence Interval Calculation for the Difference between two proportions, p1 – p2, Two independent groups • We are often interested in comparing proportions from 2 populations: • Is the incidence of disease A the same in two populations? • Patients are treated with either drug D, or with placebo. Is the proportion “improved” the same in both groups?

26. Suppose we take independent, random samples from two groups, and estimate a proportion in each. For large enough sample size, we know: Then the standard error of the difference between the sample proportions is the square root of the sum of the variances:

27. Or, since we don’t know the true proportions, the sample estimate of the standard error: Thus, for n large, the (1-a) confidence interval estimator is:

28. Example: In a clinical trial for a new drug to treat hypertension, 50 patients were randomly assigned to receive the new drug, and 50 patients to receive a placebo. 34 of the patients receiving the drug showed improvement, while 15 of those receiving placebo showed improvement. Compute a 95% confidence interval estimate for the difference between proportions improved.

29. Point Estimate of (p1 – p2): • p1 = 34/50 = .68 • p2 = 15/50 = .30  (p1 – p2)= .68 – .30 = .38 • Since we have n1 = n2 = 50, our sample size is large enough to use the sample estimate of standard error:

30. Confidence coefficient: For 1 – a = .95, z1-a/2 = z.975 = 1.96 • Confidence Interval Estimate: • The 95% CI estimate is: • (.199 , .561) or (19.9% , 56.1%) • The difference between proportions improved is bounded away from zero – it seems that the proportion improved by the drug is clearly greater than the proportion by placebo.

31. Using Minitab: Stat  Basic Statistics  2 Proportions Enter sample sizes n1 and n2 Enter # of successes x1 and x2

32. Test and Confidence Interval for Two Proportions Sample X N Sample p 1 34 50 0.680000 2 15 50 0.300000 Estimate for p(1) - p(2): 0.38 95% CI for p(1) - p(2): (0.198748, 0.561252)

33. The same cautions apply here, as for estimates for a single proportion • the sample size should be large enough in each group, so that the normal approximation will hold: • nπ5 and n(1-π)5 for each sample • Otherwise: a) use .5 in place of π when estimating the variance for the confidence interval.b) use some other method. • Minitab offers the option to compute a pooled estimate of the standard error

34. And in summary: • Confidence interval estimates provide • a range of likely values • an associated probability, or confidence level. • The width of the confidence interval depends upon: • The underlying variability in the population • The sample size • The confidence level

35. It is important to keep track of assumptions that we must make about the data: • Samples should be selected randomly • selection of any element is independent of selection of any others • For many cases, we must assume that the underlying population follows a normal distribution • without this assumption, probabilities computed using the • t-distribution • c2–distribution • F-distribution may not be correct.

36. When we speak of “knowing” the population variance, s2, • we really mean that we have an outside source of information • previous research, census data, etc. • the key is that we are not using the sample estimate, s2, based upon the current sample.

37. The key to confidence interval estimation is to know • what parameter you are estimating • the point estimate of the parameter • the confidence level • what distributional assumptions are required • the associated distribution for computing probabilities. • I have started a summary table for you below – completing this table will be a good review exercise.