The Chi Square statistic tests :. Whether the difference between what you observe and what chance would predict is due to sampling error. The greater the deviation of what we observe to what we would expect by chance, the greater the probability that the difference is NOT due to chance.
The Chi Square statistic tests :
Step #1: Hypotheses:
Categorizing same individuals in two ways:Approval and Gender
Looking at the Effect of an Independent Variable (Gender) on Dependent Variable (Approval).
This is a classic application for the c2 test.
· 1. We have nominal data in both variables - men vs women, approve vs disapprove.
· 2. The data are in the form of frequencies and
· 3. We are looking to see if there is a relationship between the two variables.
Step 3: DETERMINE LEVEL OF SIGNIFICANCE
We set as our standard 95% confidence that the difference we observe in our study is not due to chance This is equivalent of setting alpha at risk level of 5% (=.05).
Assuming the null hypothesis is true what would be the expected values?
Row Margin * Column Margin
Cell a: 335 * 418 / 908 = 140,030 / 908 = 154
Cell b: 335 * 490 / 908 = 164,150 / 908 = 181
Cell c: 573 * 418 / 908 = 239,514 / 908 = 264
Cell d: 573 * 490 / 908 = 280,770 / 908 = 309
2. Null Percentage Method:If null is true, the Percentage of Men and Women should be the same. Then compute the frequency based on that percentage.
Critical value: 3.84
Chi-square computed from data: 10.07
Decision: Reject Null.
STEP 6: STATE CONCLUSION
Thinking about cases as pairs - Concordant and Discordant Pairs.
A pair where case A scores higher or lower than does case B on BOTH variables.
A pair where case A scores higher or lower than does case B on ONE variable and the opposite for the other variable.
Cases A and B tie on at least one of the variables.
Concordant Pairs: Down and Right
CP = 30(30 + 40) + 20(40) = 2900
Discordant = Down and Left
DP = 20(20) + 10(20+30) = 900
Gamma = CP – DPCP + DP= 2900 – 900 2900 + 900= .53
AP – DP
Same numerator as Gamma but out of total number of pairs.
Adjusts for the size of the table
M = rows or columns, whichever is less.
Quantifying Linear Relationships
All science is concerned with the relationships between variables -- the effect of one variable on another. This is what hypothesis testing is all about. We hypothesize that X is related to Y. The two most powerful techniques for analyzing the relationship between interval level variables are:
1. Regression:Magnitude of relationship between the independent variable and the dependent variable (how much change in one yields how much change in the other).
2. Correlation: the predictive power of one variable on another (direction and strength of association).
Consider the relationship between education and income. First we could look at the strength of the relationship, for example, the impact of education on income, asking how much of a change in income is associated with one’s # of years of education. EG., how many more dollars of income would someone earn, on average, if he or she finished college rather than drop out after 2 years?
We are asking, as education increases how much does income change? Given a positive relationship between education and income (as X goes up Y goes up) how do years of education vary with dollars of yearly income? Is the effect big or small?
Correlation analysis asks: how good a predictor is the "independent variable" of the dependent variable? Here, how good a predictor of income is education? Is education a good indicator of income or not? How accurate is our prediction of the effect of education on income. It tells us how strongly related - how predictive - is one variable of another, say, education of income.
Types of Correlation
The greater the amount of spread of points
around the regression line, the less predictive
is X of Y and consequently, the weaker the
Draw a straight line through these points. Connect the dots. That line is called the "regression line". The regression line is the "best-fitting line" drawn through the points on an X-Y scatterplot.
Correlation = 1 Slope =1
Imperfect Correlation and Relationships
Now Add 5 years of education
10 Years of Education Means about $12,000 Income
It adds an Additional $4,000 of Income!
Slope of the regression line, called “beta”, written b.Note that some of the points are above the line, some below. The regression line – the best fitting line -- is that one line that can be drawn through the plot of points that produces the minimum amount of deviation of points from the best-fitting line. If you drew the line properly, no other line would yield a smaller overall summary measure of distance from each point to the line.
Beta is the change in the dependent variable associated with one unit of change on the predictor variable.
Deviation is the sum of the squared distance of points to the regression line.
Where do we Draw the Line?
Plan 1: Minimize the sum of the distances between the points and the line
Problem: They all add up to zero!
Solution: Square the Distances
Yhat is the point where X meets the regression line; it is the estimated point of each X valueon Y. Yhatis that point on the regression line predicted for each value of X – it’s the predicted value of Y for each value of X.
Fitting a regression line to data points by this method is called the "least squares method" -- the regression line is that line which minimizes the squared difference between the observed points and the point predicted by the line.
The best-fitting line is that line which -- compared to any other line you could plot through the points -- produced the lowest sum of squared deviations. So what we do in a regression analysis is compute that line which minimizes the squared deviation of points from the "best-fitting line".
The regression line represents the average amount of change on Y due to changes in X.
Hopefully, these pictures will help you visualize relationships. What regression and correlation analyses each do is produce a summary number to represent a relationship. Regression tells you the strength of the relation [shown by slope of the line], and the predictive power of the relationship [as summarized by the correlation coefficient, written r] gives you a summary measure of errors in prediction.
The point where the slope intersects the Y axis is called the "intercept" or"constant“, written a. It is the point where the independent variable is zero. It is the value of Y when X is zero.
b (beta) is the slope of the regression line
X is the value of the independent variable
Interpretation: a one unit change on X relates to a
beta change on Y, plus the value of the intercept.
y = a + bX
e.g., Income = $4100 (intercept) + $800 * X(Years of Education)
The dependence of Y on X can be of two types: “deterministic” or “probabilistic”.
The classic case of deterministic relationship is that between Fahrenheit and Centigrade measure of temperature:
F0 = 32 + (9/5)C
Where a, the intercept, is 320. So when C=0 degrees F=32, b beta is the slope of the line, here (9/5) or 1.8. C is X degrees Centigrade. So for every one degree of change in degrees C, Fahrenheit goes up by 1.8 degrees, starting at 32 degrees: when C =0 F = 320 + (9/5)0 = 320
when C = 1000 F = 32 + (9/5)100=2120
Where the numerator is the covariance of X and Y and the denominator is the variance.