Chi-Square and Analysis of Variance (ANOVA)

Chi-Square and Analysis of Variance (ANOVA) Lecture 9

The Chi-Square Distribution and Test for Independence Hypothesis testing between two or more categorical variables

Chi-square Test of Independence • Tests the association between two nominal (categorical) variables. • Null Hyp: The 2 variables are independent. • Its really just a comparison between expected frequencies and observed frequencies among the cells in a crosstabulation table.

Example Crosstab: gender x binary question

Degrees of freedom • Chi-square degrees of freedom • df = (r-1) (c-1) • Where r = # of rows, c = # of columns • Thus, in any 2x2 contingency table, the degrees of freedom = 1. • As the degrees of freedom increase, the distribution shifts to the right and the critical values of chi-square become larger.

Chi-Square Distribution • The chi-square distribution results when independent variables with standard normal distributions are squared and summed.

Requirements for Chi-Square test • Must be a random sample from population • Data must be in raw frequencies • Variables must be independent • Categories for each I.V. must be mutually exclusive and exhaustive

Using the Chi-Square Test • Often used with contingency tables (i.e., crosstabulations) • E.g., gender x race • Basically, the chi-square test of independence tests whether the columns are contingent on the rows in the table. • In this case, the null hypothesis is that there is no relationship between row and column frequencies.

Practical Example: • Expected frequencies versus observed frequencies • General Social Survey Example

ANOVA and the f-distribution Hypothesis testing between a 3+ category variable and a metric variable

Analysis of Variance • In its simplest form, it is used to compare means for three or more categories. • Example: • Life Happiness scale and Marital Status (married, never married, divorced) • Relies on the F-distribution • Just like the t-distribution and chi-square distribution, there are several sampling distributions for each possible value of df.

What is ANOVA? • If we have a categorical variable with 3+ categories and a metric/scale variable, we could just run 3 t-tests. • The problem is that the 3 tests would not be independent of each other (i.e., all of the information is known). • A better approach: compare the variability between groups (treatment variance + error) to the variability within the groups (error)

The F-ratio • MS = mean square • bg = between groups • wg = within groups • Numerator is the “effect” and denominator is the “error” • df = # of categories – 1 (k-1)

Between-Group Sum of Squares (Numerator) • Total variability – Residual Variability • Total variability is quantified as the sum of the squares of the differences between each value and the grand mean. • Also called the total sum-of-squares • Variability within groups is quantified as the sum of squares of the differences between each value and its group mean • Also called residual sum-of-squares

Null Hypothesis in ANOVA • If there is no difference between the means, then the between-group sum of squares should = the within-group sum of squares.

F-distribution • F-test is always a one-tailed test. • Why?

Logic of the ANOVA • Conceptual Intro to ANOVA

Bringing it all together: Choosing the appropriate bivariate statistic

Reminder About Causality • Remember from earlier lectures: bivariate statistics do not test causal relationships, they only show that there is a relationship. • Even if you plan to use more sophisticated causal tests, you should always run simple bivariate statistics on your key variables to understand their relationships.

Choosing the Appropriate Statistical Test • General rules for choosing a bivariate test: • Two categorical variables • Chi-Square (crosstabulations) • Two metric variables • Correlation • One 3+ categorical variable, one metric variable • ANOVA • One binary categorical variable, one metric variable • T-test

Assignment #2 • Online (course website) • Due next Monday in class (April 10th)

Chi-Square and Analysis of Variance (ANOVA)