Statistical Inference: Understanding Population Distribution Trends

Introduction to Basic Statistical Methods Part 1: “Statistics in a Nutshell” UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics ifischer@wisc.edu

STATISTICS IN A NUTSHELL UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics ifischer@wisc.edu All slides posted at http://www.stat.wisc.edu/~ifischer/Intro_Stat/UWHC

Right-cick on image for full .pdf article • Links in article to access datasets

“Statistical Inference” POPULATION Women in the U.S. who have given birth

“Statistical Inference” POPULATION But what does that mean (at least in principle)? Study Question: Has mean (i.e., average) of X = “Age at First Birth” of women in the U.S. changed since 2010 (25.4yrs old)? Present Day: AssumeX = “Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. Population Distribution X ? ? ? ?

“Statistical Inference” POPULATION Study Question: Has mean (i.e., average) of X = “Age at First Birth” of women in the U.S. changed since 2010 (25.4yrs old)? Present Day: AssumeX = “Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. Population Distribution    X   ad infinitum… Individual ages from the population tend to collect around a single center with a certain amount of spread, but occasional “outliers” are present in left and right symmetric tails. More precisely…

~ The Normal Distribution ~ “population standard deviation” “population mean” • symmetric about its mean Example: X = Body Temp (°F) low variability 98.6 • unimodal (i.e., one peak), • with left and right “tails” • models many (but not all) • naturally-occurring systems • useful mathematical • properties…

~ The Normal Distribution ~ “population standard deviation” “population mean” • symmetric about its mean Example: X = Body Temp (°F) low variability 98.6 Example: X = IQ score high variability 100 • unimodal (i.e., one peak), • with left and right “tails” • models many (but not all) • naturally-occurring systems • useful mathematical • properties…

~ The Normal Distribution ~ “population standard deviation” 95% 2.5% 2.5% ≈ 2 σ ≈ 2 σ “population mean” • symmetric about its mean Approximately 95% of the population values are contained between  – 2σ and  + 2σ. • unimodal (i.e., one peak), • with left and right “tails” • models many (but not all) • naturally-occurring systems 95% is called the confidence level. 5% is called the significance level. • useful mathematical • properties…

“Statistical Inference” POPULATION via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”)in the population. Population Distribution X  cannot be found with 100% certainty, but can be estimated with high confidence (e.g., 95%). H0: pop mean age  = 25.4 (i.e., no change since 2010) “Null Hypothesis”

“Statistical Inference” POPULATION via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”)in the population. T-test T-test T-test T-test Population Distribution X H0: pop mean age  = 25.4 (i.e., no change since 2010) Random Sample size n = 400 ages “Null Hypothesis” x4 x1 x3 FORMULA x2 x5 sample mean age … etc… ? x400 Do the data tend to support or refute the null hypothesis? Is the difference STATISTICALLY SIGNIFICANT, at the 5% level?

~ The Normal Distribution ~ “Sampling Distribution” Population Distribution (of mean ages) (of ages) X ? Actually, this is a special case of… Samples, size n via mathematical proof… … etc…

~ The Normal Distribution ~ “Sampling Distribution” Population Distribution Population Distribution (of mean ages) (of ages) (of ages) ? X Actually, this is a special case of… Samples, size n … as n gets larger CENTRAL LIMIT THEOREM … etc…

~ The Normal Distribution ~ “Sampling Distribution” Population Distribution (of mean ages) (of ages) X ? The sample mean values have much less variabilityabout  than the population values!

~ The Normal Distribution ~ “Sampling Distribution” Population Distribution (of mean ages) (of ages) 95% 2.5% 2.5% ≈ 2 σ ≈ 2 σ Approximately 95% of the population values are contained between  – 2σ and  + 2σ. Approximately 95% of the sample mean values are contained between and

In principle… is called the 95% margin of error  Sample 1  Sample 2  Sample 3 Approximately 95% of the sample mean values are contained between and  Sample 4  Sample 5 etc…

But from the samples’ point of view… is called the 95% margin of error Sample 1  Sample 2  Sample 3  Approximately 95% of the sample mean values are contained between and Sample 4  Sample 5 

But from the samples’ point of view… is called the 95% margin of error Sample 1  Sample 2  Sample 3 Approximately 95% of the intervals from to contain , and approx 5% do not.  Approximately 95% of the sample mean values are contained between and Sample 4  Sample 5 

~ The Normal Distribution ~ “Sampling Distribution” Population Distribution (of mean ages) (of ages) 95% 2.5% 2.5% ≈ 2 σ ≈ 2 σ Approximately 95% of the population values are contained between  – 2σ and  + 2σ. Approximately 95% of the intervals from to contain , and approx 5% do not. Approximately 95% of the sample mean values are contained between and

“Statistical Inference” POPULATION via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”)in the population. “Null Hypothesis” H0: pop mean age  = 25.4 (i.e., no change since 2010) FORMULA SAMPLE n = 400 ages Approximately 95% of the intervals from to contain , and approx 5% do not. x4 x1 x3 x2 x5 sample mean … etc… = 25.6 x400 PROBLEM! σis unknown the vast majority of the time! 95% margin of error

“Statistical Inference” POPULATION via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. “Null Hypothesis” H0: pop mean age  = 25.4 (i.e., no change since 2010) FORMULA SAMPLE n = 400 ages x4 x1 sample variance x3 = modified average of the squared deviations from the mean x2 x5 sample mean … etc… = 25.6 x400 sample standard deviation 95% margin of error

“Statistical Inference” POPULATION via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population. “Null Hypothesis” H0: pop mean age  = 25.4 (i.e., no change since 2010) FORMULA SAMPLE n = 400 ages 400 x4 x1 sample variance x3 x2 x5 sample mean … etc… = 25.6 x400 sample standard deviation 95% margin of error = 1.6 1.6 = 0.16

Approximately 95% of the intervals from to contain , and approx 5% do not.

95% margin of error = 0.16 = 0.16 25.44 25.76 BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”).

Two main ways to conduct a formal hypothesis test: 95% CONFIDENCE INTERVAL FOR µ 25.44 25.76 BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”). “P-VALUE” of our sample IF H0 is true, then we would expect a random sample mean that is at least 0.2 years away from  = 25.4 (as ours was), to occur with probability 1.28%. Very informally, the p-value of a sample is the probability (hence a number between 0 and 1) that it “agrees” with the null hypothesis. Hence a very small p-value indicates strong evidence against the null hypothesis. The smaller the p-value, the stronger the evidence, and the more “statistically significant” the finding (e.g., p < .0001).

Two main ways to conduct a formal hypothesis test: • FORMAL CONCLUSIONS: • The 95% confidence interval corresponding to our sample mean does not contain the “null value” of the population mean, μ = 25.4 years. • The p-value of our sample, .0128, is less than the predetermined α = .05 significance level. • Based on our sample data, we may (moderately) reject the null hypothesis H0: μ = 25.4 in favor of the two-sided alternative hypothesis HA: μ ≠ 25.4, at the α = .05 significance level. • INTERPRETATION: According to the results of this study, there exists a statistically significantdifference between the mean ages at first birth in 2010 (25.4 years old) and today, at the 5% significance level. Moreover, the evidence from the sample data would suggest that the population mean age today is significantly older than in 2010, rather than significantly younger. 95% CONFIDENCE INTERVAL FOR µ 25.44 25.76 BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”). IF H0 is true, then we would expect a random sample mean that is at least 0.2 years away from  = 25.4 (as ours was), to occur with probability 1.28%. “P-VALUE” of our sample Very informally, the p-value of a sample is the probability (hence a number between 0 and 1) that it “agrees” with the null hypothesis. Hence a very small p-value indicates strong evidence against the null hypothesis. The smaller the p-value, the stronger the evidence, and the more “statistically significant” the finding (e.g., p < .0001). However, one problem remains…

Normal Distribution Normal Distribution “Sampling Distribution” Population Distribution (mean ages) (of ages) 95% 2.5% 2.5% ≈ 2 σ ≈ 2 σ Approximately 95% of the population values are contained between  – 2σ and  + 2σ. Approximately 95% of the intervals from to contain , and approx 5% do not. Approximately 95% of the sample mean values are contained between and

Normal Distribution Normal Distribution “Sampling Distribution” T Population Distribution (mean ages) (of ages) 95% 2.5% 2.5% ≈ 2 σ ≈ 2 σ …IFn is large, e.g.,  30 Alas, this introduces “sampling variability.” Approximately 95% of the population values are contained between  – 2s and  + 2s. Approximately 95% of the intervals from to contain , and approx 5% do not. Approximately 95% of the sample mean values are contained between and

Edited R code: y = rnorm(400, 0, 1)z = (y - mean(y)) / sd(y)x = 25.6 + 1.6*z sort(round(x, 1)) Generates a normally-distributed random sample of 400 age values. [1] 19.6 20.2 20.4 20.5 21.2 22.3 22.3 22.4 22.4 22.4 22.6 22.7 22.7 22.7 22.8 [16] 23.0 23.0 23.1 23.1 23.2 23.2 23.2 23.2 23.2 23.3 23.4 23.4 23.4 23.5 23.5 etc... [391] 28.7 28.7 28.9 29.2 29.3 29.4 29.6 29.7 29.9 30.2 Calculates sample mean and standard deviation. c(mean(x), sd(x)) [1] 25.61.6 t.test(x, mu = 25.4) One Sample t-test data: x t = 2.5, df = 399, p-value = 0.01282 alternative hypothesis: true mean is not equal to 25.4 95 percent confidence interval: 25.44273 25.75727 sample estimates: mean of x 25.6

Normal Distribution Normal Distribution “Sampling Distribution” “Sampling Distribution” T Population Distribution (mean ages) (mean ages) (of ages) 95% 2.5% 2.5% ≈ 2 σ ≈ 2 σ …IFn is large, e.g.,  30 Approximately 95% of the population values are contained between  – 2s and  + 2s. Approximately 95% of the intervals from to contain , and approx 5% do not. But if n is small… Approximately 95% of the sample mean values are contained between and

If n is small, T-score > 2. … the “T-score" increases (from ≈ 2 to a max of 12.706 for a 95% confidence level) as ndecreases larger margin of error  less power to reject, even if a genuine statistically significant difference exists! If n is large, T-score ≈ 2.

“Statistical Inference” POPULATION via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”)in the population. T-test Two loose ends H0: pop mean age  = 25.4 (i.e., no change since 2010) Random Sample size n = 400 ages “Null Hypothesis” x4 x1 x3 FORMULA x2 x5 sample mean age … etc… x400 Do the data tend to support or refute the null hypothesis? Is the difference STATISTICALLY SIGNIFICANT, at the 5% level?

“Statistical Inference” POPULATION via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”)in the population. T-test Two loose ends H0: pop mean age  = 25.4 (i.e., no change since 2010) “Null Hypothesis” Check? The reasonableness of the normality assumption is empirically verifiable, and in fact formally testable from the sample data. If violated (e.g., skewed) or inconclusive (e.g., small sample size), then “distribution-free” nonparametric tests can be used instead of the T-test. Examples: Sign Test, Wilcoxon Signed Rank Test (= Mann-Whitney Test)

“Statistical Inference” POPULATION via… “Hypothesis Testing” Study Question: Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)? Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”)in the population. T-test Two loose ends H0: pop mean age  = 25.4 (i.e., no change since 2010) Random Sample size n = 400 ages “Null Hypothesis” x4 x1 x3 x2 Sample size n partially depends on the power of the test, i.e., the desired probability of correctly rejecting a false null hypothesis (80% or more). x5 … etc… x400

Introduction to Basic Statistical Methods Part 1: Statistics in a Nutshell UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics ifischer@wisc.edu Part 2: Overview of Biostatistics: “Which Test Do I Use??” • Sincere thanks to… • Judith Payne • Heidi Miller • Samantha Goodrich • Troy Lawrence • YOU!

Statistical Inference: Understanding Population Distribution Trends