Chapter 10: Hypothesis Testing

Chapter 10:Hypothesis Testing

Outline (Topics from 10.2 and 10.4) • Hypothesis Testing • Definitions • The p value • Examples and summary of steps • Significance levels. • Z-test for means and proportions

Tests of significance • How do we determine how good our estimate of s parameter is? • Establish a Confidence Intervalaround that estimate. • Or: Do a Test of Significance • Test of Significance: assesses the truth about a hypothesis, or a claim that concerns the population, by using the observed data. • The results of the test are expressed in terms of a probability that measures how well the data support the hypothesis.

Example: Fair Coin? You flip a coin 100 times. You get 64 Heads (even though you “expect” 50) Is this difference due to chance error? Or is there something wrong with this coin? To answer this question we need a test of significance: If the observed value is “too many” standard errors away from the expected value (the claimed one) then it is hard to explain it by chance! Here, we are looking at a sum of draws, or count on a population. The estimated number of heads is 64, based on the sample size, n=100. The standard error on the estimate is stotal count= sx sqrt(n)=0.5 xsqrt(100) = 5

Stating the hypotheses: The null hypothesis expresses the idea that the observed difference is due to chance. It represents the statement being tested in a significance test. The null hypothesis is abbreviated as H0. In the example, the null hypothesis states that there is nothing wrong with the coin. So that the statement of the Null Hypothesis, H0, is: the expected value for a sum of 100 : mcount=100 xm = 50 The alternative hypothesis represents the idea that the difference is real. It is expressed as a statement we hope or suspect is true instead of the null hypothesis. The alternative hypothesis is abbreviated as Ha (or H1 ) In the example, the alternative hypothesis, Ha, states that the sum of draws would be a higher amount, that is: mcount > 50

Test statistics and significance levels In the example, the difference between the observed value and the expected value is: (64 – 50)/5 = 2.8 . This says that the observed value is 2.8 Standard Errors away from the expected value. This is an example of a ‘test statistic’ A test statistic is used to measure the difference between the data and what is expected under the null hypothesis. The test statistic used above is called a z-statistic and its general form is:

Z-tests The tests based on the z-statistic are called z-tests. The z-statistic says how many S.E.’s away an observed value is from its expected value if the null hypothesis were true. In the example the z-statistic is z=2.8. In large samples, we can use the standard normal curve to check the area to the right of z=2.8. p=0.26% z=2.8

P-values • The computed chance 0.26% in the example is called observed significance level. It is often denoted with p (for probability) and is called p-value. • We looked at the area on the right of the z-value, because we compare the observed value with other possible values that would be even farther away from the value we want to test about. • A test of significance finds the probability of getting a test statistic as extreme or more extreme than the actually observed one. • The chance is computed on the basis that the null hypothesis is true. • The smaller this chance is, the stronger the evidence against the null hypothesis is. • The procedure for finding the p-value depends also on the alternative hypothesis.

If the p-value is small, then the null hypothesis should be rejected. If the p-value is large (say larger than 0.05 or 0.01), the sample results do not provide evidence against the null hypothesis, the null hypothesis should be accepted. Do not reject H0 Reject H0 p=0.24 expected value under H0 p=0.04 Sample value Note:The p-value is NOT the probability of the null hypothesis being right in the light of the data!! The p-value “measures” the likelihood that a sample such as the one obtained, will occur when the null hypothesis is assumed to be true.

The calculation of the p-value is based on the assumption that the null hypothesis is true. Roughly speaking, a small p-value (such as 0.05 or smaller) indicates that the sample results are very unlikely under the assumption of the null hypothesis. Hence a small p-value is strong evidence against the null hypothesis, since the null hypothesis does not provide a “good explanation” for the observed sample. Distance between observed value and expected value under the null hypothesis p-value<0.05  sample value

In our example, we found p-value = 0.0045 = 0.45%. ie: p-value < 0.05 Conclusion: This p-value is very small. It is unlikely that given the null hypothesis is true (that is: “the coin is fair”), we would get a sample that would behave like this one. Therefore, we reject the Null Hypothesis. It is statistically significant that there is something wrong with this coin.

Example: Nicotine content To determine whether the mean nicotine content of a brand of cigarettes is greater than the advertised value of 1.4 milligrams, a health advocacy group takes a sample of 500 cigarettes and measures the amount of nicotine in the sample. The sample average of nicotine is computed as 1.51 and the standard deviation of the observations is 1.016. Is the difference between the sample value and the advertised value real? Or is it due to chance error? Let’s do a Test of Significance to answer this. Here, we are looking at a mean on a population. The estimated amount of nicotine is 1.51, based on the sample values. The standard error of the estimate is: sx= s/sqrt(n)=1.016 /sqrt(500) = 0.045

Null Hypothesis: The null hypothesis states that the advertised amount of nicotine is true and that the observed value of 1.51mg is simply due to chance. H0: mx=m = 1.4 mg Alternative Hypothesis: The alternative hypothesis states that the cigarettes contain in reality a higher amount of nicotine, and that this difference is not simply due to chance. Ha: mx > 1.4mg

the test statistic that looks at the difference between the observed value and the conjectured value is: The z-statistic is z=2.20. Use the standard normal curve to check the area on the right of z=2.20. p=1.14% z=2.20

p=1.14% z=2.20 p-value = 1.14% The chance of getting a sample average 2.20 S.E.’s or more above its expected values is extremely small, it is 0.014 or 1.4 %. Conclusion: Reject the Null Hypothesis. The test shows that the observed sample cannot be explained by chance error only, and therefore it is unlikely that under the null hypothesis such a measurement could be observed.

Recap the Steps: Making a test of significance • Identify what quantity is being studied: Mean, Proportion? • Set up the null hypothesisH0– the hypothesis you want to test. • Under the Null hypothesis, establish m and s of the sample mean or proportion. • Set up the alternative hypothesisHa– what we accept if H0 is rejected. • Compute the test statistic, to measure the difference between the data and what is expected under the null hypothesis: the z-test, or the t-test. • Compute the observed significance level: p-value. This is the probability, calculated assuming that H0 is true, of getting a test statistic as extreme or more extreme than the observed one in the direction of the alternative hypothesis. • State a conclusion. Given the significance level, , you conclude: • If p value ≤ aNull Hypothesis must be rejected at this level , • If p value > a, the data do not provide enough evidence to reject H0.

Significance levels • In common statistical terminology: • If p is less than =5%, the null hypothesis is rejected at 5% significance level and the test result is called “statistically significant”. • If p is less than =1%, the null hypothesis is rejected at 1% significance level and the test result is called “highly significant”. • Significance levels are very popular for reporting the test results. • However, it is better practice to summarize the test results reporting what test was used, the p-value and whether the test was “statistically significant” or “highly significant”.

Example: Age of commercial jets A report in USA today (July 7, 1995) stated that the average age of commercial jets in the U.S. is 14 years. An executive of a large airline company selects a sample of 40 airplanes and finds that the average age of the planes is 11.8 years. The standard deviation of the sample is 2.7 years. Is it true that the average age of the planes in his company is less than the national average? (use a significance level of 1%) • We need to test the hypothesis: • Null hypothesis H0: mx=14 • against the hypothesis: • Alternative hypothesis Ha: mx < 14 The z-statistic is: The value of the z-statistic says that the observed value is more than 5 standard errors away from the age of 14 years that we assumed in the null hypothesis. Thus if the hypothesis H0 were true, the observed sample would be very unlikely.

Assuming that the sample is random, we can compute the approximate p-value of the significance test, using the normal approximation to the z-statistic. The p-value is equal to the area under the standard normal curve to the left of z. We can use the table of areas under the standard normal table in the textbook. P-value -5.11 0 z p-value is about 0. Conclusion: since the p-value is so small, the data provides strong evidence against the hypothesis that the average age of the planes if 14 years. We can therefore accept the alternative hypothesis that the average age of the planes in the company is less than 14 years.

Example: Testing effectiveness of nicotine patches In one study of 71 smokers who tried to quit smoking with nicotine patch therapy, 39 were not smoking one year after the treatment (from Journal of the American Medical Association, Vol 274). Use a 0.10 significance level to test the claim that among smokers who try to quit with nicotine patch therapy, the majority are smoking a year after the treatment. Do these results suggest that the nicotine patch therapy is not effective? Let p be the probability that a smoker is still smoking a year after the treatment with nicotine patch therapy. We compute a significance test to study the hypothesis above. Null hypothesis: mp = 0.5 & Alternative hypothesis: mp > 0.5 From the sample the observed proportion is Z-statistic:

p-value z=0.83 The p-value is equal to the area on the right of z. It is computed as p-value=20.47% or p-value=0.204. The p-value is larger than the chosen significance level α=0.10. Hence the null hypothesis cannot be rejected. On the basis of the data, we cannot reject the hypothesis than half of the smokers smoke a year after the treatment with the nicotine patch therapy. Furthermore, we can compute a 95% confidence interval for the percentage of smokers that smoke a year after the treatment. It is given as 0.55 ± 1.96*0.06=0.55 ± 0.118 or (0.432, 0.667) The percentage of smokers is between 43.2% and 66.7%. This is a large interval, it is hard to decide about the effectiveness of the nicotine patches.

In this chapter, there are three ways to set up the null and alternative hypotheses: Equal versus not equal hypothesis (two-tailed test) H0: parameter = some value Ha: parameter ≠ some value 10-22

In this chapter, there are three ways to set up the null and alternative hypotheses: Equal versus not equal hypothesis (two-tailed test) H0: parameter = some value Ha: parameter ≠ some value Equal versus less than (left-tailed test) H0: parameter = some value Ha: parameter < some value 10-23

In this chapter, there are three ways to set up the null and alternative hypotheses: Equal versus not equal hypothesis (two-tailed test) H0: parameter = some value Ha: parameter ≠ some value Equal versus less than (left-tailed test) H0: parameter = some value Ha: parameter < some value Equal versus greater than (right-tailed test) H0: parameter = some value Ha: parameter > some value 10-24

“In Other Words” The null hypothesis is a statement of “status quo” or “no difference” and always contains a statement of equality. The null hypothesis is assumed to be true until we have evidence to the contrary. The claim that we are trying to gather evidence for determines the alternative hypothesis. 10-25

For each of the following claims, determine the null and alternative hypotheses. State whether the test is two-tailed, left-tailed or right-tailed. In 2008, 62% of American adults regularly volunteered their time for charity work. A researcher believes that this percentage is different today. According to a study published in March, 2006 the mean length of a phone call on a cellular telephone was 3.25 minutes. A researcher believes that the mean length of a call has increased since then. Examples: 10-26

In 2008, 62% of American adults regularly volunteered their time for charity work. A researcher believes that this percentage is different today. The hypothesis deals with a population proportion, p. If the percentage participating in charity work is no different than in 2008, it will be 0.62 so the null hypothesis isH0: p=0.62. Since the researcher believes that the percentage is different today, the alternative hypothesis is a two-tailed hypothesis: Ha: p≠0.62. Solution 10-27

b) According to a study published in March, 2006 the mean length of a phone call on a cellular telephone was 3.25 minutes. A researcher believes that the mean length of a call has increased since then. The hypothesis deals with a population mean, . If the mean call length on a cellular phone is no different than in 2006, it will be 3.25 minutes so the null hypothesis isH0: =3.25. Since the researcher believes that the mean call length has increased, the alternative hypothesis is: Ha:  > 3.25, a right-tailed test. Solution 10-28

We reject the null hypothesis when the alternative hypothesis is true. This decision would be correct. Four Outcomes from Hypothesis Testing 10-29

We reject the null hypothesis when the alternative hypothesis is true. This decision would be correct. We do not reject the null hypothesis when the null hypothesis is true. This decision would be correct. Four Outcomes from Hypothesis Testing 10-30

We reject the null hypothesis when the alternative hypothesis is true. This decision would be correct. We do not reject the null hypothesis when the null hypothesis is true. This decision would be correct. We reject the null hypothesis when the null hypothesis is true. This decision would be incorrect. This type of error is called a Type I error. Four Outcomes from Hypothesis Testing 10-31

We reject the null hypothesis when the alternative hypothesis is true. This decision would be correct. We do not reject the null hypothesis when the null hypothesis is true. This decision would be correct. We reject the null hypothesis when the null hypothesis is true. This decision would be incorrect. This type of error is called a Type I error. We do not reject the null hypothesis when the alternative hypothesis is true. This decision would be incorrect. This type of error is called a Type II error. Four Outcomes from Hypothesis Testing 10-32

Type I Error: = rejecting H0 when H0 is true Type II Error = not rejecting H0 when H1 is true 10-33

For each of the following claims, explain what it would mean to make a Type I error. What would it mean to make a Type II error? In 2008, 62% of American adults regularly volunteered their time for charity work. A researcher believes that this percentage is different today. According to a study published in March, 2006 the mean length of a phone call on a cellular telephone was 3.25 minutes. A researcher believes that the mean length of a call has increased since then. Examples: Type I and Type II Errors 10-34

In 2008, 62% of American adults regularly volunteered their time for charity work. A researcher believes that this percentage is different today. A Type I error is made if the researcher concludes that p≠0.62 when the true proportion of Americans 18 years or older who participated in some form of charity work is currently 62%. A Type II error is made if the sample evidence leads the researcher to believe that the current percentage of Americans 18 years or older who participated in some form of charity work is still 62% when, in fact, this percentage differs from 62%. Solution 10-35

b) According to a study published in March, 2006 the mean length of a phone call on a cellular telephone was 3.25 minutes. A researcher believes that the mean length of a call has increased since then. A Type I error occurs if the sample evidence leads the researcher to conclude that >3.25 when, in fact, the actual mean call length on a cellular phone is still 3.25 minutes. A Type II error occurs if the researcher fails to reject the hypothesis that the mean length of a phone call on a cellular phone is 3.25 minutes when, in fact, it is longer than 3.25 minutes. Solution 10-36

Chapter 10: Hypothesis Testing