Topic 5. Statistical Inference: Tests of Significance. The Reasoning of Tests of Significance. One may claim that he makes 80% of his basketball free throws. How is his claim tested? The only solution is to ask him to shoot, say 50 free throws. If he makes
Statistical Inference: Tests of Significance
The only solution is to ask him to shoot, say 50 free throws. If he makes
The Survey of Study Habits and Attitudes (SSHA) is a psychological test that measures students’ study habits and attitudes toward school. Scores range from 0 to 200. Suppose we know that scores for college students in the population are normally distributed with mean 115 and standard deviation σ = 30.
A teacher suspects that the mean score for older students is higher than 115. She gives the SSHA to an SRS of 25 students who are at least 30 years old.
By the CLT, the sampling distribution of the sample mean under the claim of µ = 115 is N(115, 6). – why
Sketch the density curve of this distribution and mark the axis with cutoffs specified in the 68-95-99.7 rule. Which of the following sample means might be indicative of good evidence against the claim of µ = 115? (a) 118.6 (b) 120.3 (c ) 128.2
Ha: µ > µ0, is the right tail area under the z density curve that is beyond the z statistic value.
Ha: µ < µ0, is the left tail area under the z density curve that is beyond the z statistic value.
Ha: µ ≠ µ0, is twice the right tail area under the z density curve that is beyond the z statistic value.
The P-value of a test H0 is the probability , computed assuming that H0 is true, that the test statistic would take a value as extreme or more extreme than that actually observed.
Usually, the P-value of a test H0 is compared with a threshold value called significance level (denoted α). A P-value smaller than α indicates rejection of H0, or significance of the test.
Example: (Plasma Aldosterone in Dogs) Aldosterone is a hormone involved in maintaining fluid balance in the body. In a veterinary study, 8 dogs with heart failure were treated with the drug Captopril, and plasma concentrations of aldosterone were measured before and after the treatment.
Suppose that the before-after change (before – after) in concentration has a normal distribution with standard deviation 15.
Test the claim that the drug Captopril has an effect of reducing plasma concentrations of aldosterone. Interpret the P-value.
Here are the IQ scores of 31 seven-grade girls randomly chosen from a school district: 114 100 104 89 102 91 114 114 103 105 108 130 98 122 111 118 108 116 86 72 111 103 74 112 107 103 98 96 112 112 93
The sample mean is 105.8387.
Suppose that the standard deviation of IQ scores in the population is known to be 15. IQ scores in a broad population are supposed to have mean µ = 110. Is there evidence that the mean in this district is less than100?
Here are the IQ scores of 31 7-grade girls randomly chosen from a school district: 114 100 104 89 102 91 114 114 103 105 108 130 120 132 111 128 118 119 86 72 111 103 74 112 107 103 98 96 112 112 93
Suppose that the standard deviation of IQ scores in the population is known to be 15. IQ scores in a broad population are supposed to have mean µ = 110. Is there evidence that the mean in this district differs from 100?
A confidence interval for a parameter gives a set of plausible values of the parameter at a given confidence level.
A test of significance makes a decision about whether a claimed value is a plausible value of the parameter considered.
A level αtwo-sided significance test rejects a null hypothesis H0: µ = µ0 exactly when the value µ0 falls outside a level 1 – α CI for µ.
A level αone-sided significance test rejects a null hypothesis H0: µ = µ0 exactly when the value µ0 falls outside a level 1 – 2α CI for µ.
Ha: µ > µ0, is the right tail area under the t(n-1) density curve that is beyond the t statistic value.
Ha: µ < µ0, is the left tail area under the t(n-1) density curve that is beyond the t statistic value.
Ha: µ ≠ µ0, is twice the right tail area under the t(n-1) density curve that is beyond the t statistic value.
We wish to see if the dial indicating the oven temperature for a certain model oven is properly calibrated. Four ovens of this model are selected at random. The dial on each is set to 300ºF, and, after one hour, the actual temperature of each is measured. The temperatures measured are 305º, 310º, 300º, and 305º. Assuming that the actual temperatures for this model when the dial is set for 300º are normally distributed with mean µ, we test whether the dial is properly calibrated by testing the hypotheses H0: µ = µ0 versus Ha: µ ≠ µ0. Find the P-value for this test.
Do students tend to improve their SAT mathematics (SAT-M) score the second time they take the test? A random sample of four students who took the test twice received the following scores.Student 1 2 3 4
First score 450 520 720 600
Second score 440 600 720 630
Assuming that the change in SAT-M score (second score - first score) for the population of all students taking the test twice is normally distributed with mean µ, are we convinced that retaking the test improves scores? Find the P-value for an appropriate test.
This example shows what is termed the matched pairs t procedure. The design is called a matched pairs design, in which subjects are matched in pairs and each treatment is given to one subject in each pair.
The P-value for a test of H0 against
Ha: µ1 > µ2, is the right tail area beyond the t statistic value under the t density curve with degrees of freedom equal to the smaller of n1 – 1 and n2 – 1.
Ha: µ1 < µ2, is the left tail area beyond the t statistic value under the t density curve with degrees of freedom equal to the smaller of n1 – 1 and n2 – 1.
Ha: µ1 ≠ µ2, is twice the right tail area beyond the t statistic value under the t density curve with degrees of freedom equal to the smaller of n1 – 1 and n2 – 1.
A researcher wished to compare the effect of two stepping heights (low and high) on heart rate in a step-aerobics workout. A collection of fifty adult volunteers was randomly divided into two groups of twenty-five subjects each. Group 1 did a standard step-aerobics workout at the low height. The mean heart rate at the end of the workout for the subjects in Group 1 was = 90.00 beats per minute with a standard deviation = 9 beats per minute. Group 2 did the same workout but at the high step height. The mean heart rate at the end of the workout for the subjects in Group 2 was = 95.08 beats per minute with a standard deviation = 12 beats per minute. Assume that the two groups are independent and the data are approximately normal. Let and represent the mean heart rates we would observe for the entire population represented by the volunteers, if all members of this population did the workout using the low or high step height, respectively. Suppose the researcher had wished to test the hypotheses H0: µ1 = µ2 against Ha: µ1 < µ2. The P-value for the test is (use the conservative value for the degrees of freedom) A. larger than 0.10.B. between 0.10 and 0.05.C. between 0.05 and 0.01.
Growth: 17, 20, 170, 315, 22, 190, 64
Gap: 22, 29, 13, 16, 15, 18, 14, 6
X = c(17,20,170,315,22,190,64)
Y = c(22,29,13,16,15,18,14,6)
wilcox.test(X, Y, alternative = "greater", correct = FALSE)
Wilcoxon rank sum test
data: x and y
W = 49.5, p-value = 0.00638
alternative hypothesis: true location shift is greater than 0.
Notice that a t test could not be applied here because two of the observations are incomplete: patient 3 died with a graft still surviving and observation on patient 10 was incomplete for an unspecified reason.
Carry out a sign test to compare the survival times of the two sets of skin grafts. The null hypothesis is H0: The survival time distribution is the same for close compatibility as it is for poor compatibility against the directional alternative Ha: Skin grafts tend to last longer when the HL-A compatibility is close.
X = c(37, 19, 57, 93, 16, 23, 20, 63, 29, 60, 18)
Y = c(29, 13, 15, 26, 11, 18, 26, 43, 18, 42, 19)
wilcox.test(X, Y, paired = TRUE, alternative = "greater", correct = FALSE)
Wilcoxon signed rank test
data: X and Y
v = 60.5, p-value = 0.007193
alternative hypothesis: true location shift is greater than 0.
The P-value for a test of H0 against
Ha: p > p0, is the right tail area beyond the z statistic value under the standard normal density curve.
Ha: p < p0, is the left tail area beyond the z statistic value under the standard normal density curve.
Ha: p ≠ p0, is twice the right tail area beyond the z statistic value under the standard normal density curve.
A Gallup Poll asked a sample of Canadian adults if they thought the law should allow doctors to end the life of a patient who is in great pain and near death if the patient makes a request in writing. The poll included 270 people in Quebec, 221 of whom agreed that doctor-assisted suicide should be allowed. Is the poll evidence that the majority of people in Quebec favor doctor-assisted suicide?
Flip a coin 25 times and the heads side appears 13 times. Is the coin balanced?
The P-value quantifies the degree of evidence provided by the sample against the null hypothesis. The smaller the P-value, the stronger the evidence. How small is small?
Answers vary. Reporting P-value allows each of us to decide individually if the evidence is sufficiently strong.
When we say that the evidence provided by the sample is sufficiently strong (indicated by a very small P-value), we mean the result is significant and the null hypothesis should be rejected.
Large samples can capture even tiny deviations from the null hypothesis; that is, large samples tend to produce significant results.
On the other hand, small samples can miss even large deviations from the null hypothesis; that is, small samples tend to produce non-significant results.
When a null hypothesis is rejected at a significance level say α = 0.05, there is good evidence that an effect is present. But that effect may be so small that it can be ignored in practice. That this small effect is captured may be because the sample size is very large.
Statistical significance does not tell us whether an effect is large enough to be important. That is, Statistical Significance and Practical Significance Are Not the Same.
The Author of the textbook suggests that confidence intervals be used more often than tests of significance, because the former estimates the size of an effect while the latter answers if it is too large to occur by chance alone.
Running one test and reaching the 5% level of significance is reasonably good evidence that you have found something, but running 20 tests and reaching the 5% level of significance only once is NOT.
This is because by chance we would see 1 test significant among 20 non-significant at the 5% level of significance. (1 = 20*5%)
Similar arguments can be made for confidence intervals: A single 95% confidence interval has probability 0.95 of capturing the true parameter each time you use it, but the probability that all of 20 confidence intervals will capture their parameters is much less than 0.95.