Linking Probability to Statistical Inference

Linking Probability to Statistical Inference Concepts in Statistics

The Big Picture

Probability Statistical inference always involves an argument based on probability. Recall the following important points about probability • Probability is a measure of how likely an event is to occur. • We can make probability statements only about random events. Random here means that the outcome is uncertain in the short run but has a predictable pattern in the long run.

Inference

Research Questions That Involve Inference

Inference Each research question from the previous slide relates to either a categorical variable or a quantitative variable. In this course, three criteria determine the inference procedure we use: • The type of variable. • The type of inference (estimate a population value or test a claim about a population value). • The number of populations involved.

Proportions from Random Samples Vary Imagine a small college with only 200 students, and suppose that 60% of these students are eligible for financial aid. • Population: 200 students at the college • Variable:Eligibility for financial aid is a categorical variable, so we use a proportion as a summary • Population proportion: 0.60 of the population is eligible for financial aid

Parameters vs. Statistics One of the goals of inference is to draw a conclusion about a population on the basis of a random sample from the population. • A parameter is a number that describes a population. • Astatistic is a number that we calculate from a sample. When we do inference, the parameter is not known because it is impossible or impractical to gather data from everyone in the population. • We make an inference about the population parameter on the basis of a sample statistic. • Statistics from samples vary. If the variable is categorical, the parameter and the statistic are both proportions. If the variable is quantitative, the parameter and statistic are both means.

Parameters vs. Statistics Different notation for parameters and statistics: Sometimes we refer to the sample statistics as “p-hat” and “x-bar.”

Random Sampling Observations about patterns in random sampling: • Proportions from random samples approximate the population proportion, , so sample proportions average out to the population proportion. • Larger random samples better approximate the population proportion, so large samples have sample proportions closer to . In other words, a sampling distribution for large samples has less variability.

The Sampling Distribution of Sample Proportions For a categorical variable, imagine a population with a proportion of successes. Create a model that describes the sample proportions from all possible random samples of size from this population. The model has center, spread, and shape. Center: Mean of the sample proportions is , the population proportion. Spread: Standard deviation of the sample proportions is The standard deviation of the sampling distribution is called the standard error. Shape: A normal model is a good fit if the expected number of successes and failures is at least 10. We can translate these conditions into formulas: and The distribution of sample proportions for ALL samples of the same size is called the sampling distribution of sample proportions.

Applying the Model for Sampling Distribution Compare the mean and standard deviation observed in the simulation to the model. The conditions are met, so a normal model is a good fit. The model is a good description of the center, spread, and shape we observed in the simulation.

General Process for Developing a Probability Model for Inference

Sampling Distribution If a normal model is a good fit for the sampling distribution, we can standardize the values by calculating a -score. Formulas for sample proportions: standard error = We can also write this as one formula: A positive -score indicates that the sample proportion is larger than the parameter. A negative -score indicates that the sample proportion is smaller than the parameter.

Statistical Inference Our goal in statistical inference is to infer from the sample data some conclusion about the wider population the sample represents. Statistical inference uses the language of probability to say how trustworthy our conclusions are. Two types of inference: confidence intervals and hypothesis tests We construct a confidence interval when our goal is to estimate a population parameter. We conduct a hypothesis test when our goal is to test a claim about a population parameter.

Confidence Interval The purpose of confidence intervals is to use the sample proportion to construct an interval of values that we can be reasonable confident contains the true population If we use standard errors as the margin of error, we can write the confidence interval as: Sample statistic

Confidence Interval A look at 95% Confidence Intervals on the Number Line: The formula for a 95% confidence interval samplestatistic sampleproportion The lower end of the confidence interval is sample proportion -2(standard error) The upper end of the confidence interval is sample proportion +2(standard error)

Confidence Interval Every confidence interval defines an interval on the number line that is centered at the sample proportion. For example, suppose a sample of 100 part-time college students is 64% female. Here is the 95% confidence interval built around this sample proportion of 0.64.

Confidence Interval We know the margin of error in a confidence interval comes from the standard error in the sampling distribution. For a 95% confidence interval, the margin of error is equal to 2 standard errors. This is shown in the following diagram.

Hypothesis Tests The purpose of a hypothesis test is to use sample data to test a claim about a population parameter. We make a claim about a population proportion. From the claim, we state an assumption about the value of the population proportion. We construct a simulation or a normal model to represent the sampling distribution that occurs when sampling from a population with this assumed value. If the normal model is a good fit for the sampling distribution, we can find a z-score and use a simulation to associate a probability with a “likely” or “unlikely” statement.

Hypothesis Tests: Example If last year 20% of the US adult population smoked, we might claim that the percentage of smokers in the US this year is greater. So in the simulation we set and see if the data causes us to question this claim. • If a sample proportion is likely to occur in the sampling distribution, then this sample result could have come from a population with the assumed value. We therefore conclude that the evidence from the sample is not strong enough to support our original claim • In our example, a sample proportion that is likely to occur means we do not question the assumption that we made when we set . We cannot conclude that the percentage of smokers in the US is greater than 20% this year.

Hypothesis Tests: Example continued If last year 20% of the US adult population smoked, we might claim that the percentage of smokers in the US this year is greater. So in the simulation we set and see if the data causes us to question this claim. • If a sample proportion supports our claim and is unlikely to occur in the sampling distribution, then it is unlikely that this sample result came from a population with the assumed value. In this situation, the data lead us to doubt our assumption about the value of the parameter. We conclude that the evidence from the sample is strong enough to support the claim. • In our example, a sample proportion that is unusually large means that the data makes us doubt the assumption we made when we set . We therefore is probably greater than 20% this year.

Quick Review • What is inference based on? • What are two types of inference procedures? • What is a parameter? • Larger samples have more or less variability? • What is the purpose of a confidence interval? • When a normal model is a good fit for the sampling distribution, the 95% confidence interval has a margin of error equal to ?

Linking Probability to Statistical Inference