The Chi Square statistic tests :
1 / 49

The Chi Square statistic tests : - PowerPoint PPT Presentation

  • Updated On :

The Chi Square statistic tests :. Whether the difference between what you observe and what chance would predict is due to sampling error. The greater the deviation of what we observe to what we would expect by chance, the greater the probability that the difference is NOT due to chance.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'The Chi Square statistic tests :' - milt

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

The Chi Square statistic tests :

  • Whether the difference between what you observe and what chance would predict is due to sampling error.

  • The greater the deviation of what we observe to what we would expect by chance, the greater the probability that the difference is NOT due to chance.

Hypothesis testing l.jpg
Hypothesis Testing

  • Step 1: State Hypothesis

    • What is the Null?

    • What is the Alternative (Research) Hypothesis?

First step compute percentage table l.jpg

  • Row margins:

    • 335/908 = 36.9% Disapprove

    • 573/908 = 63.1% Approve

  • NOTE: Most Citizens Approve of Clinton

  • BUT: We Are Testing for a Gender Effect:Are Women more Supportive Than Men?

Interpretation l.jpg

  • Row Marginal: Most Citizens (63%) Approved of Clinton.

  • Column Marginal: There Are More Men in the Sample (54%).

  • Based on the percent table looks like women are more supportive of Clinton than are men – 69% vs. 58%.

    Step #1: Hypotheses:

  • Null Hypothesis:H0: Women – Men = 0 + Chance

  • Cell Percents Show Women More Supportive Null Hypothesis Challenged.

Step 2 what is the distribution l.jpg

Categorizing same individuals in two ways:Approval and Gender

Looking at the Effect of an Independent Variable (Gender) on Dependent Variable (Approval).

This is a classic application for the c2 test.

·  1. We have nominal data in both variables - men vs women, approve vs disapprove.

·  2. The data are in the form of frequencies and

·  3. We are looking to see if there is a relationship between the two variables.

Step 4 determine critical value of 2 l.jpg


We set as our standard 95% confidence that the difference we observe in our study is not due to chance This is equivalent of setting alpha at risk level of 5% (=.05).


  • Degrees of Freedom:

    • (# rows – 1) * (# Columns – 1)

    • Here: (2 – 1) * (2 – 1) = 1 * 1 = 1

  • Look up Critical Value of 2* at 5% level with 1 df and find: 2* = 3.84

Step 5 calculate test statistic and make decision l.jpg
STEP 5: Calculate Test Statistic and Make Decision

  • Question: Is the proportion of men and women approving Clinton different from what you would expect by sampling error in more than 5% of all samples?

    Assuming the null hypothesis is true what would be the expected values?

  • What we need now are the expected values against which to compare our observed values. What would you expect by assuming the null is true?

Calculating expected values l.jpg

  • Look at the Marginal Values:Note: 63% of all respondents approve

  • Therefore, assuming the null is true — that there is no gender difference — what would you expect the cell percentages to look like?

  • 37% of women should disapprove as well as 37% of men, + sampling error

  • 63% of women as well as 63% of men should approve, + sampling error

  • The proportions of men and women approving and disapproving should be the same + sampling error

Two methods for computing expected values l.jpg

  • Method (Easiest):

    Row Margin * Column Margin

    Total N

    Cell a: 335 * 418 / 908 = 140,030 / 908 = 154

    Cell b: 335 * 490 / 908 = 164,150 / 908 = 181

    Cell c: 573 * 418 / 908 = 239,514 / 908 = 264

    Cell d: 573 * 490 / 908 = 280,770 / 908 = 309

Slide15 l.jpg

2. Null Percentage Method:If null is true, the Percentage of Men and Women should be the same. Then compute the frequency based on that percentage.

  • Cell a: .369 * 418 = 154

  • Cell b: .369 * 490 = 181

  • Cell c: .631 * 418 = 264

  • Cell d: .631 * 490 = 309

Key questions l.jpg

  • How closely do fo values match fe values?

  • Do the squared fo – fe differences fit the null hypothesis?

  • Or, are the differences between observations and chance expectations so different as to justify rejecting the null?

Compare chi square values l.jpg
Compare Chi-Square Values:

Critical value: 3.84

Chi-square computed from data: 10.07

Decision: Reject Null.


  • Computed value of chi-square greater than critical value, therefore, reject the null hypothesis.

  • Substantive interpretation?: The difference between groups on the IV is statistically different from the null hypothesis.

Slide20 l.jpg


  • Γ = AP-DP/AP+DP

  • Will give us a value between -1 and 1.

  • Tells us strength and direction.

How to calculate gamma l.jpg
How to Calculate Gamma

Thinking about cases as pairs - Concordant and Discordant Pairs.

  • Concordant:

    A pair where case A scores higher or lower than does case B on BOTH variables.

  • Discordant:

    A pair where case A scores higher or lower than does case B on ONE variable and the opposite for the other variable.

  • Tied:

    Cases A and B tie on at least one of the variables.

  • We add up the number of tied and Concordant and Discordant pairs.

Calculating gamma l.jpg

Concordant Pairs: Down and Right

Calculating Gamma

CP = 30(30 + 40) + 20(40) = 2900

Calculating gamma23 l.jpg

Discordant = Down and Left

Calculating Gamma

DP = 20(20) + 10(20+30) = 900

Gamma cp dp cp dp 2900 900 2900 900 53 l.jpg

Gamma = CP – DPCP + DP= 2900 – 900 2900 + 900= .53

And c l.jpg




Same numerator as Gamma but out of total number of pairs.




Adjusts for the size of the table

M = rows or columns, whichever is less.

Τα and Τc

Slide26 l.jpg

Quantifying Linear Relationships

  • Introduction to Regression Analysis

Slide27 l.jpg

Two Interests

All science is concerned with the relationships between variables -- the effect of one variable on another. This is what hypothesis testing is all about. We hypothesize that X is related to Y. The two most powerful techniques for analyzing the relationship between interval level variables are:

1. Regression:Magnitude of relationship between the independent variable and the dependent variable (how much change in one yields how much change in the other).

2. Correlation: the predictive power of one variable on another (direction and strength of association).

Slide28 l.jpg


Consider the relationship between education and income. First we could look at the strength of the relationship, for example, the impact of education on income, asking how much of a change in income is associated with one’s # of years of education. EG., how many more dollars of income would someone earn, on average, if he or she finished college rather than drop out after 2 years?

We are asking, as education increases how much does income change? Given a positive relationship between education and income (as X goes up Y goes up) how do years of education vary with dollars of yearly income? Is the effect big or small?

Slide29 l.jpg


Correlation analysis asks: how good a predictor is the "independent variable" of the dependent variable? Here, how good a predictor of income is education? Is education a good indicator of income or not? How accurate is our prediction of the effect of education on income. It tells us how strongly related - how predictive - is one variable of another, say, education of income.

  • Both types of analyses go together and both concepts can be pictured on scatterplots

  • Whereas regression effects are depicted by the slope of the line correlation can be seen as the spread of points around the regression line.

Slide31 l.jpg

Types of Correlation

  • Positive Correlation: An increase in one variable results in an increase in the other

  • Negative Correlation: An increase in one variable results in a decrease in the other.

    The greater the amount of spread of points

    around the regression line, the less predictive

    is X of Y and consequently, the weaker the


Slide33 l.jpg

Scatter Plot

  • Is a pictorial depiction of the relationship between variables.

  • Is a two-dimensional surface on which all the X and Y scores of all the objects in your study are represented with each object’s value on X and value on Y appearing as a single point.

    Draw a straight line through these points. Connect the dots. That line is called the "regression line". The regression line is the "best-fitting line" drawn through the points on an X-Y scatterplot.

Slide35 l.jpg


= 1


= -2

Slide37 l.jpg

Imperfect Correlation and Relationships

  • We rarely see perfect correlation

  • However, even with imperfect correlation, we can have some expectation of what will happen on average.

  • While correlation is never perfect, we can draw a line to summarize the trend in the data points. This is the Regression Line

Slide40 l.jpg

Establishing Relationships

Now Add 5 years of education

10 Years of Education Means about $12,000 Income

It adds an Additional $4,000 of Income!

Slide41 l.jpg

Slope of the regression line, called “beta”, written b.Note that some of the points are above the line, some below. The regression line – the best fitting line -- is that one line that can be drawn through the plot of points that produces the minimum amount of deviation of points from the best-fitting line. If you drew the line properly, no other line would yield a smaller overall summary measure of distance from each point to the line.

Beta is the change in the dependent variable associated with one unit of change on the predictor variable.

Deviation is the sum of the squared distance of points to the regression line.

Slide44 l.jpg

Plan 1: Minimize the sum of the distances between the points and the line






Problem: They all add up to zero!

Solution: Square the Distances

Slide45 l.jpg

Y and the linehat is the point where X meets the regression line; it is the estimated point of each X valueon Y. Yhatis that point on the regression line predicted for each value of X – it’s the predicted value of Y for each value of X.

Fitting a regression line to data points by this method is called the "least squares method" -- the regression line is that line which minimizes the squared difference between the observed points and the point predicted by the line.

The best-fitting line is that line which -- compared to any other line you could plot through the points -- produced the lowest sum of squared deviations. So what we do in a regression analysis is compute that line which minimizes the squared deviation of points from the "best-fitting line".

Slide46 l.jpg

The regression line represents the average amount of change on Y due to changes in X.

Hopefully, these pictures will help you visualize relationships. What regression and correlation analyses each do is produce a summary number to represent a relationship. Regression tells you the strength of the relation [shown by slope of the line], and the predictive power of the relationship [as summarized by the correlation coefficient, written r] gives you a summary measure of errors in prediction.

Slide47 l.jpg

The point where the slope intersects the Y axis is called the "intercept" or"constant“, written a. It is the point where the independent variable is zero. It is the value of Y when X is zero.

b (beta) is the slope of the regression line

X is the value of the independent variable

Interpretation: a one unit change on X relates to a

beta change on Y, plus the value of the intercept.

y = a + bX

e.g., Income = $4100 (intercept) + $800 * X(Years of Education)

Slide48 l.jpg

The dependence of Y on X can be of two types: “deterministic” or “probabilistic”.

The classic case of deterministic relationship is that between Fahrenheit and Centigrade measure of temperature:

F0 = 32 + (9/5)C

Where a, the intercept, is 320. So when C=0 degrees F=32, b beta is the slope of the line, here (9/5) or 1.8. C is X degrees Centigrade. So for every one degree of change in degrees C, Fahrenheit goes up by 1.8 degrees, starting at 32 degrees: when C =0 F = 320 + (9/5)0 = 320

when C = 1000 F = 32 + (9/5)100=2120

Slide49 l.jpg

  • Probabilistic Regression “deterministic” or “probabilistic”.

  • Not perfectly predictive.

  • On average, we expect a certain amount of change in Y for a certain change in X

  • The formula for beta, the slope of the regression line

Where the numerator is the covariance of X and Y and the denominator is the variance.