380 likes | 475 Views
Chapter 7. Inferences Based on a Single Sample. Parameters and Statistics. A parameter is a numeric characteristic of a population or distribution, usually symbolized by a Greek letter, such as μ , the population mean. Inferential Statistics uses sample information to estimate parameters.
E N D
Chapter 7 Inferences Based on a Single Sample
Parameters and Statistics • A parameter is a numeric characteristic of a population or distribution, usually symbolized by a Greek letter, such as μ, the population mean. • Inferential Statistics uses sample information to estimate parameters. • A Statistic is a number calculated from data. • There are usually statistics that do the same job for samples that the parameters do for populations, such as , the sample mean.
Using Samples for Estimation μ Sample (known statistic) Population (unknown parameter) estimate
The Idea of Estimation • We want to find a way to estimate the population parameters. • We only have information from a sample, available in the form of statistics. • The sample mean, , is an estimator of the population mean, μ. • This is called a “point estimate” because it is one point, or a single value.
Interval Estimation • There is variation in , since it is a random variable calculated from data. • A point estimate doesn’t reveal anything about how much the estimate varies. • An interval estimate gives a range of values that is likely to contain the parameter. • Intervals are often reported in polls, such as “56% ±4% favor candidate A.” This suggests we are not sure it is exactly 56%, but we are quite sure that it is between 52% and 60%. • 56% is the point estimate, whereas (52%, 60%) is the interval estimate.
The Confidence Interval • A confidence interval is a special interval estimate involving a percent, called the confidence level. • The confidence level tells how often, if samples were repeatedly taken, the interval estimate would surround the true parameter. • We can use this notation: (L,U) or (LCL,UCL). • L and U stand for Lower and Upper endpoints. The longer versions, LCL and UCL, stand for “Lower Confidence Limit” and “Upper Confidence Limit.” • This interval is built around the point estimate.
Theory of Confidence Intervals • Alpha (α) represents the probability that when the sample is taken, the calculated CI will miss the parameter. • The confidence level is given by (1-α)×100%, and used to name the interval, so for example, we may have “a 90% CI for μ.” • After sampling, we say that we are, for example, “90% confident that we have captured the true parameter.” (There is no probability at this point. Either we did or we didn’t, but we don’t know.)
How to Calculate CI’s • Many CI’s have the following basic structure: • P ± TS • Where P is the parameter estimate, • T is a “table” value equal to the number of standard deviations needed for the confidence level, • and S is the standard deviation of the estimate. • The quantity TS is also called the “Error Bound” (B) or “Margin of Error.” • The CI should be written as (L,U) where L= P-TS, and U= P+TS. • Don’t forget to convert your P ± TS expression to confidence interval form, including parentheses!
A Confidence Interval for μ • If σ is known, and • the population is normally distributed,or n>30 (so that we can say is approximately normally distiributed), gives the endpoints for a (1- α)100% CI for μ • Note how this corresponds to the P ± TS formula given earlier.
Distribution Details • What is ? • α is the significance level, P(CI will miss) • The subscript on z refers to the upper tail probability, that is, P(Z>z). • To find this value in the table, look up thez-value for a probability of .5-α/2. • Examples
Example: Estimation of µ ( Known) A random sample of 25 items resulted in a sample mean of 50. Construct a 95% confidence interval estimate for if = 10.
Confidence Interval Estimates Confidence Intervals Mean Proportion Variance Known Unknown
Estimation of m (s unknown) • We now turn to the situation where s is unknown but the sample size is large or the sample population is normal. • Since s is unknown, we use s in its place. • However, without knowing s, we are not able to make use of the z table in building a confidence interval. • Instead, we will use a distribution called t (Student’s t). • The t distribution is symmetric and bell-shaped like the standard normal, and also has a m=0, but s>1, so the shape is flatter in the middle and thicker in the tails.
Normal distribution Student’s t-Distributions: Degrees of Freedom, df: A parameter that identifies each different distribution of Student’s t-distribution. For the methods presented in this chapter, the value of df will be the sample size minus 1, df = n- 1. Student’s t, df = 15 Student’s t, df = 5
Using t • As the previous graph shows, the t distribution has another parameter, called degrees of freedom (df). So this is actually a family of distributions, with different df values. • The higher the df, the closer the t distribution comes to the standard normal. • For our purposes, df=n-1. It is actually related to the denominator in the formula for s2. • There is a t-table in the back of the book. It is different from the z-table, so we have to understand how it works.
The t table • Refer to the table. First you will notice the left-hand column is for df. • When df ≥100, the z-table can be used, because the values will be very close. • This table gives tail probabilities, similar to z(a). However, only a selection of probabilities is given, across the top of the table. • The interior of the table gives the t-values, so it is arranged almost opposite of the z-table. • The notation used for t-values is t(df,a). • Just like z(a), a refers to the upper tail probability.
Example: Find the value of t(12, 0.025). Portion of t-table
Confidence Intervals • When we build our confidence interval, a refers to the probability in both tails. • This is not the same a used in looking up the distribution! So what we have to look up is actually a/2, because that’s the upper tail probability. • And so we come to the formula for a (1-a)100% CI for m when s is unknown:
Example: A study is conducted to learn how long it takes the typical tax payer to complete his or her federal income tax return. A random sample of 17 income tax filers showed a mean time (in hours) of 7.8 and a standard deviation of 2.3. Find a 95% confidence interval for the true mean time required to complete a federal income tax return. Assume the time to complete the return is normally distributed. Solution: 1. Parameter of Interest: the mean time required to complete a federal income tax return. 2. Confidence Interval Criteria: a. Assumptions: Sampled population assumed normal, s unknown. b. Distribution table value: t will be used. c. Confidence level: 1 - α = 0.95
3. The Sample Evidence: 4. Calculations: 5. (6.62, 8.98) is the 95% confidence interval for µ.
Confidence Interval for a Proportion • Assumptions • Population Follows Binomial Distribution • Normal Approximation Can Be Used if • does not Include 0 or 1 • Or (older guideline) • Confidence Interval Estimate
Example A random sample of 400 graduates showed 32 went to grad school. Set up a 95% confidence interval estimate for p.
New Method • A new method (Agresti & Coull, 1998) can be used to avoid the problems with extreme p’s. There is no need to check the np or nq values with this method. • Define • Then a (1-α)100% CI for p is given by
Example • In the 2004 presidential election, Ralph Nader had about 0.34% of the vote. Suppose an exit poll was taken to estimate Nader’s share of the vote, with a sample size of 200, and 2 people indicated they voted for Nader. • Note that with the traditional method, so the formula is not valid. • Use the p* method to construct a 95% CI for p.
Sample Size Calculation • We may wish to decide upon a sample size so that we can get a confidence interval with a pre-determined width. • This is common in polls, where the margin of error is usually decided in advance. • All CI’s we have seen so far have the form P±B, where B is the margin of error. • We want to fix B in advance.
Sample Size for Estimating µ, σ Known • Suppose X is a random variable with σ=10 and we want a 90% CI to have a Bound, or Margin of Error, of 3. • Use the formula . • Fill in the numbers: • Solve: • This is the minimum sample size, but we need a whole number, so round up to n=31.
Sample Size for Estimating µ, σ Unknown • If σ is unknown, the confidence interval will be calculated using the t distribution, unless n is very large. • But the degrees of freedom depend on n, which we don’t know. • The calculation also depends on s, which we don’t know until after sampling. • We must have an initial guess for s, and then use the normal distribution to approximate the t distribution, since it does not require knowing n.
Example (σ unknown) • A manufacturer needs to be able to estimate the width of a new part to within 2mm with 95% confidence. There is not enough history to know what σ would be, so a pilot study is run by measuring 6 parts, and finding s=3.4mm. • Rounding up to the next whole number gives n=12.
Sample Size for Estimating p, a Population Proportion • With a population proportion, we also have a problem in getting the standard deviation part of the Margin of Error, since it depends on p, the thing we are trying to estimate. • There are two possibilities: • 1) We may have a preliminary guess about p that we can use, or • 2) We can use p=.5 because that maximizes the standard deviation. • The sample size will be calculated from the desired margin of error, or error bound.
Example (proportion) • A pollster wants to do a simple random sample to estimate the proportion of the population favoring an increase in property taxes for school funding. He wants a margin of error of 3%, with 90% confidence. The general belief is that it will be a close election, so an initial value of p=.5 is reasonable. • Rounding up to the next whole number gives n=752.
Misc. Notes • The CI for µ formula using z is also called the “Large Sample” CI. It is valid when σ is known, for any sample size, but it also serves as an approximation of the t formula (using s) when n is large. How large? Many books say n≥30. I recommend making use of the t table up to n=100 since that is how far it goes. Statistical computer programs will always calculate t values, regardless of how large n is, for the σ unknown case.
Misc. Notes • The CI for µ formula using t is also called the “Small Sample” CI, but only because the other one is called “Large Sample.” It is valid for any sample size when σ is unknown and the population is normal. • We do not cover methods for small samples that do not come from a normal population in this course (non-parametric methods).
Misc. Notes • The t table is limited because it does not have a very good selection of probabilities. It also “jumps” in the df column. It is possible to use the “closest” value or interpolate when you can’t find what you need, but a better option is to use the Excel functions, TDIST and TINV. • However, you have to be VERY careful about what Excel is giving you.
Excel’s TDIST function • TDIST takes a t value and returns the tail probability. You can choose one or two tails.
Excel’s TINV Function • The TINV Function takes a two-tailed probability and returns a t-value (just what we need now).
Excel Function Comparison The NORMSINV Function, by contrast, takes a left-tailed probability and returns a z-value. This means you have to enter α/2 and take the negative, or else use 1- α/2 as the argument.