Basic Quantitative Methods in the Social Sciences (AKA Intro Stats)

Basic Quantitative Methods in the Social Sciences(AKA Intro Stats) 02-250-01 Lecture 10

Quantitative vs. Frequency Data • Recall from our first lecture, that data could take the form of: • Quantitative data (AKA measurement data) whereby each observation represents a score on a continuum • Most common statistics: mean and SD • Examples: height, weight, IQ, rating of Chretien on a scale of 1-10. • Categorical data (AKA frequency data) whereby frequencies of observations fall into one of two or more categories. • Examples: female vs. male, brown eyes vs. blue eyes, opposed to Chretien vs. support Chretien • This type of data is not a measurement of anything per se, it is simply frequencies of occurrence in the nominal classes (e.g., # of males vs. females, etc.)

What if we want to know whether the observed frequencies were what we had expected? • When it comes to frequency data, once we have counted our observations, we have… • Observed Frequencies – frequencies that have actually occurred • Expected Frequencies – frequencies that we expected to occur if some assumption is true • The Null Hypothesis (H0) is that the observed frequencies do not differ from what we had expected (the “expected frequencies”)

Chi - Square • Examines the difference between the Observed and the Expected frequencies among groups • Both variables are Nominal (therefore all we can measure is observed frequencies) • E.g., we know how many males are in this class (the observed frequency), and how many we would expect to be in this class

Non-parametric tests • The Chi-Square test is a non-parametric test. • With non-parametric tests, we do not need to assume that the population data are normally distributed. • Non parametric tests allow for nominal and ordinal scales of measurement. • Although non-parametric tests are still inferential, they are less “powerful” than are parametric tests (e.g., t-tests, correlations). This means that parametric tests are not as likely to find significant results when the effect size is small.

Cindy was hired at a fitness club in town. She was told by the boss that 68% of people who come in and inquire about becoming a member end up joining after they are shown around the club. • Three months later, Cindy is fired because the boss feels that she is not good at “selling” the club to people inquiring about memberships. Over the three months, Cindy gave tours to 75 people. 44 of them ended up joining. • Cindy thinks that the boss just doesn’t like her.

Is the Boss’ Accusation True? • The boss originally said that 68% of people typically join. Therefore, of the 75 people Cindy gave tours to, 51 of them should have joined if Cindy is on par with the other employees.

The Chi-Square Goodness-of-Fit Test • The Chi-Square Goodness-of-Fit Test is used when you have one classification variable (but it has 2 or more categories). • H0: The assumption that Cindy is on par with the other employees is true. • Ha: The assumption that Cindy is on par with the other employees is not true. • The Chi-Square test allows a decision about whether observed and expected frequencies differ significantly. • Rejection of H0 suggests that our assumption that led to the expected frequencies is wrong (or in this example, Cindy is not on par with the other club employees).

[ ] S ( fo - fe ) 2 fe  = Greek letter Chi, O = observed frequencies, E = expected frequencies 2 = H0: O-E = 0 (no difference between observed and expected frequencies), therefore, 2 = 0.

Calculating Chi Square • Create a table with columns for each category (so here, “join” and “not join”) • Create a row for each of the following: • Observed frequencies (O) • Expected frequencies (E) • O – E • (O – E)2 • (O – E)2 / E • The Chi Square statistic is then the sum of this final row of (O – E)2 / E

Chi Square Goodness of Fit Table

Calculating Chi-Squared ( ) Sum: 0.9608 + 2.0417 = 3.0025

Testing the Significance of • DF = k-1 where k = the number of outcome categories. • Table E.1 in the text (p. 439) – at the .050 level of significance, df = 1 • 2crit = 3.84 • Since 2obt = 3.00, we retain H0 (NOTE: give obt value to 2 decimal places) • Therefore, the result is not significant, Cindy’s recruiting performance at the fitness club is not significantly different than that of the other employees.

It is easy to see that in the numerator of the formula, observed frequencies are compared to expected frequencies to assess how well the sample data match the hypothesized data. Why must we divide the numerator by the expected frequency for each category? • Suppose you were going to throw a party and you expected 1000 people to show up. However, at the party, you counted the number of guests and observed that 1040 actually showed up. Forty more guests than expected are no major problem when all along you were planning for 1000. There will probably still be enough beer and chips for everyone.

On the other hand, suppose you had a party and you expected 10 people to attend but instead 50 actually showed up. Forty more guests in this case spell big trouble. How “significant” the discrepancy is depends in part on what you were originally expecting. • With very large expected frequencies, allowances are made for more errors between observed and expected frequencies. This is accomplished in the chi-square formula by dividing the squared discrepancy for each category by its expected frequency.

What About When There are More Than Two Categories? • In the preceding example, observed frequencies fell into one of two categories: joined or did not join. • What if there are more than two categories?

Example • Suppose a study showed that of 90 people in trauma-induced comas who were treated with traditional medicine, 30 died, 30 woke up and fully recovered, and 30 remained comatose indefinitely. (Note: These data were made up). • Dr. X, a naturopathic doctor who works with patients with trauma-induced comas, claims that alternative approaches result in superior recovery rates. To test his claim, 90 comatose people were treated with his alternative approach and were then observed. 40 of them woke up and were fully recovered, 30 died, and 20 remained comatose indefinitely.

30 O 30 E Chi- Square Stayed in Coma Woke Died 40 20 Total O = 90 O 30 O 30 E E What’s H0?

Chi- Square Stayed in Coma Died Woke 40 30 20 Total O = 90 O O 30 30 O 30 E E E n = 30 + 30 + 30 = 90

[ ] S ( fo - fe ) 2 fe Chi- Square 2 = Figure for Each Cell

( fo - fe ) 2 ( fo - fe ) 2 ( fo - fe ) 2 fe fe fe Chi- Square Stayed In Coma Woke Died + +

( 40 - 30) 2 ( 30 - 30) 2 fe Chi- Square Stayed In Coma Died Woke ( 20 - 30) 2 + + fe fe

( 10) 2 ( 0) 2 fe Chi- Square Stayed In Coma Died Woke ( -10) 2 + + fe fe

100 0 fe Chi- Square Stayed in Coma Died Woke 100 + + fe fe

0 Chi- Square Stayed In Coma Woke Died 100 100 + + 30 30 30

Chi- Square Stayed In Coma Woke Died 3.3333 0 3.3333 + +

Chi- Square 2obt = 6.6666 = 6.67 df = k - 1 = 2 2crit = 5.99 Therefore: We reject H0, Dr. X’s alternative approach does indeed generate significantly more recoveries.

Distribution of violent crimes in the United States, 1995

Sample results for 500 randomly selected violent-crime reports from last year

Expected frequencies if last year’s violent-crime distribution is the same as the1995 distribution

Calculating the goodness of fit (Chi-Square) 2obt = 4.219. With df=k-1=3, at .05, 2crit =7.81 Therefore, we retain H0, last year’s crime distribution is not significantly different from that in 1995.

The Chi-Square Test for Independence • The 2 statistic can also be used to test whether or not there is a relationship between two categorical (nominal) variables. • Each individual in the sample is measured or classified on two separate variables. • Also known as the Contingency Table Analysis

Do people with cell phones have more car accidents than people without cell phones? • The Department of Transportation wanted to see if cell phone users have more car accidents than non-cell phone users. The following data are a sample of 50 people who have had car accidents over the past month, and 50 randomly sampled drivers who have not had car accidents over the past month:

29 21 O O E E 16 34 O O E E Chi-Square Cell Phone No Cell Phone Car Accident No Car Accident

So what now? • Notice that here, only observed frequencies are given to you. You have to calculate the expected frequencies. • First, total your rows and columns.

29 21 O O E E 16 34 O O E E Chi-Square Cell Phone No Cell Phone 50 Car Accident 50 No Car Accident 45 55 100

Eij = RiCj / N where: Eij = the expected frequency at row i, column j. Ri = Row i’s total Cj = Column j’s total N = Grand total (all cells included)

21 O E 16 34 O O E E Chi-Square Cell Phone No Cell Phone 50 29 Car Accident 50 X 45 100 O 22.5 = 22.5 E 50 No Car Accident 45 55 100

21 O E 16 34 O O E E Chi-Square Cell Phone No Cell Phone 50 29 Car Accident 50 X 55 100 O 27.5 = 22.5 27.5 E 50 No Car Accident 45 55 100

21 O E 16 34 O O E E Chi-Square Cell Phone No Cell Phone 50 29 Car Accident O 22.5 27.5 E 50 No Car Accident 22.5 27.5 45 55 100

And then…. You now have four cells with expected and observed frequencies. Now use the Chi Square formula! [(29-22.5)2/22.5] = 1.8778+ [(21-27.5)2/27.5] = 1.5364+ [(16-22.5)2/22.5] = 1.8778+ [(34-27.5)2/27.5] = 1.5364 2obt = 6.8284 = 6.83

To test this statistic: • Let’s use the .05 level of significance. • Variable 1: Cell phone? • Variable 2: Car accident? • Because we are dealing with frequency data of two categorical variables, we will perform a chi-square test of independence. • Because it is a chi-square, it is a two-tailed test. • H0: Cell phones and car accidents are independent • Ha: Cell phones and car accidents are not independent

Chi-Square • DF (for test of independence):df = (R-1) (C-1) • Where R = number of rows • Where C = number of columns • df = (2-1)(2-1)=1, 2crit = 3.84 (from table E.1, page. 439) • 2obt = 6.83, so reject the H0.

SO…. • Results are significant. The frequency of being in a car accident depends on whether or not one uses a cell phone.

Note: • When the expected frequencies are too small, chi-square may not be a valid test, therefore all expected frequencies should be at least 5 (dependent on sample size). • The chi-square test is also only valid when the observations are independent from each other, therefore N should be equal to the number of subjects (every subject should only be measured once).

What about when you have more than two categories? • A fast-food marketing consultant wanted to know whether men and women had different preferences for fast-food restaurants. She randomly sampled 150 men and 100 women and asked each to declare his or her preference for four fast foods restaurants. Here are her data:

35 15 25 25 O O O O E E E E 70 20 30 30 O O O O E E E E Burger King Total 100 150 ----- 250 Subway Harvey’s McDonalds Women Men Total: 55 55 85 55

Now remember…. • To calculate expected frequencies for each cell, Eij = RiCj / N • So: (Row sum) (Column sum) / N • Do this for each cell to get expected frequencies.

35 15 25 25 O O O O E E E E 70 20 30 30 O O O O E E E E Burger King Total 100 150 ----- 250 Subway Harvey’s McDonalds Women 22 22 22 34 Men Total: 33 33 33 51 55 55 85 55

You now have what you need to calculate 2 [(35-22)2/22] = 7.6818 [(25-22)2/22] = 0.4091 [(15-34)2/34] = 10.6176 [(25-22)2/22] = 0.4091 [(20-33)2/33] = 5.1212 [(30-33)2/33] = 0.2727 [(70-51)2/51] = 7.0784 [(30-33)2/33] = 0.2727 Add these up! 2obt = 31.8626 = 31.86 df= (R-1)(C-1) = (2-1)(4-1) = 3 2crit = 7.82 (at .05)

Basic Quantitative Methods in the Social Sciences (AKA Intro Stats)