CORRELATION. Introduction to Correlation.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
In this scatterplot, computer anxiety scores (openness to computing) are plotted against the Y (vertical) axis and computer self-efficacy scores are plotted along the X (horizontal) axis. For example, the person to whom the arrow is pointing had a score of about 17 on the openness scale and about 162 on the self-efficacy scale. What were the scores on the two scales of the person with the star next to his point?
The purpose of the scatterplot is to visualize the relationship between the two variables represented by the horizontal and vertical axes. Note that although the relationship is not perfect, there is a tendency for higher values of openness to computing to be associated with larger values of computer self-efficacy, suggesting that as openness increases, self-efficacy increases. This indicates that there is a positive correlation
Let’s draw a line through the swarm of points that best “fits” the data set (minimizes the distance between the line and each of the points). This is imposing a linear description of the relationship between the two variables, when sometimes you might want to find out if a line that represented a curvilinear relationship (in this case an inverted U) was a better fit, but we’ll leave that question for another time. The line that represents this relationship best mathematically is called a “regression line” and the point at which the mathematically best fitting line crosses the y axis is called the “intercept”
Strong negative Relationship between X and Y; points tightly clustered around line; nonlinear trend at lower weights
Essentially no relationship between X and Y; points loosely clustered around line
Positive Relationship between X and Y
∑(X – X) (Y – Y)
[∑ (X – X)2 ] [∑ (Y – Y)2 ]
Where X is a person’s or case’s score on the independent variable, Y is a person’s or case’s score on the dependent variable, and X-bar and Y-bar are the means of the scores on the independent and dependent variables, respectively. The quantity in the numerator is called the sum of the crossproducts (SP). The quantity in the denominator is the square root of the product of the sum of squares for both variables (SSx and SSy)
Here is another computing formula
N ∑XY - ∑X ∑Y
[ N ∑X2 – (∑X)2] [N ∑Y2 – (∑Y)2]
We will do an example using this computing formula next, so let’s download the correlation.sav data set
A negative relationship: The more shy you are (the farther you are along the X axis), the fewer speeches you give (the lower you are on the Y axis)
N ∑XY - ∑X ∑Y
[ N ∑X2 – (∑X)2] [N ∑Y2 – (∑Y)2]
(6 X 107) – 30 (32)
[6 (230) – 302] [6 (226) – 322 ]
r = -.797 (note crossproducts term in the numerator is negative) and R-square = .635
Significance of r is tested with a t-statistic with N-2 degrees of freedom where t =
r N – 2
1 – r2
SPSS provides the results of the t test of the significance of r for
you. Can also consult table F in
Levin and Fox
Write a sentence which states your findings. Report the correlation coefficient, r, R2(the percent of variance in y accounted for by x), the significance level, and N, as well as the means on each of the two variables. Indicate whether or not your hypothesis was supported.
Correlation of the variable with itself = 1 which appears in all the main diagonal cells
The hypothesis that the proportion of its people living in cities would be positively associated with a country’s rate of male literacy was confirmed (r = .587, DF=83, p < .01, one-tailed).
The intercept, or a (sometimes called β0
Beta weight when X and Y are expressed in standard score units
The slope, or β
The regression equation for predicting Y (male literacy) is Y = a + (b)X, or Y = 52.372 +.495X, so if we wanted to predict the male literacy rate in country j we would multiply its percentage of people living in cities by .495, and add the constant, 52.372. Compare this to the scatterplot. Does it look right?
When scores on X and Y are available as Z scores, and are expressed in the same standardized units, then there is no intercept (constant) because you don’t have to make an adjustment for the differences in scale between X and Y, and so the equation for the regression line just becomes Y = (b) X, or in this case Y = .587 X, where .587 is the standardized version of b (note that it’s also the value of r, but only when there is just the one X variable and not multiple independent variables)
The correlation coefficient and the coefficient of determination. The coefficient of determination, or R-square, is the proportion of variance in the dependent variable which can be accounted for by the independent variable. Adjusted R-square is an adjustment made to R-square when you get a lot of independent variables or predictors in the equation or have complexities like cubic terms. Minor adjustment with only one predictor
If the independent variable, X, were of no value in predicting Y, the best estimate of Y would be the mean of Y. To see how much better our calculated regression line is as a predictor of Ythan the simple mean of Y, we calculate the sum of squares for the regression line and then a residual sum of squares (variance left over after the regression line has done its work as a predictor) which shows how well or how badly the regression line fits the actual obtained scores on Y. If the residual mean square is large compared to the regression mean square, the value of F would be low and the resulting F ratio may not be significant. If the F ratio is statistically significant it suggests that we can reject the hypothesis that our predictor, β, is zero in the population, and say that the regression line is a good fit to the data
In the calculation of the partial correlation coefficient rYX2.X1, the area of interest is section a, and the effects removed are those in b, c, and d; partial correlation is the relationship of X2 and Y after the influence of X1 is completely removed from both variables. When only the effect of X1 on X2 is removed, this is called a part correlation; part correlation first removes from X2 all variance which may be accounted for by X1 (sections c and b), then correlates the remaining unique component of the X2 with the dependent variable, Y
Comparing Partial to Zero-Order Correlation: Effect of Controlling for GDP on Relationship Between Percent Living in Cities and Male Literacy
Zero order r of
Control variable with X
r when effect of GDP is removed
Note that the partial correlation of % people living in cities and male literacy is only .4644 when GDP is held constant, where the zero order correlation you obtained previously was .5871. So clearly GDP is a control variable which influences the relationship between % of people living in cities and male literacy, although the % living in cities-literacy relationship is still significant even with GDP removed