Statistics for the Social Sciences

Statistics for the Social Sciences Psychology 340 Fall 2006 Prediction

Outline (for week) • Simple bi-variate regression, least-squares fit line • The general linear model • Residual plots • Using SPSS • Multiple regression • Comparing models, (?? Delta r2) • Using SPSS

Regression • Last time: with correlation, we examined whether variables X & Y are related • This time: with regression, we try to predict the value of one variable given what we know about the other variable and the relationship between the two.

Predicted variable Y 6 • The variable that you are predicting goes on the Y-axis (criterion variable) 5 4 • The variable that you are making the prediction based on goes on the X-axis (predictor variable) 3 2 1 1 2 3 4 5 6 X Predicting variable Regression • Last time: “it doesn’t matter which variable goes on the X-axis or the Y-axis” • For regression this is NOT the case Quiz performance Hours of study

Y 6 5 4 3 2 1 1 2 3 4 5 6 X Regression • Last time: “Imagine a line through the points” • But there are lots of possible lines • One line is the “best fitting line” • Today: learn how to compute the equation corresponding to this “best fitting line” Quiz performance Hours of study

Y 6 5 4 2.0 3 2 1 0 1 2 3 4 5 6 X The equation for a line • A brief review of geometry Y = intercept, when X = 0 Y = (X)(slope) + (intercept)

Y 6 5 4 1 0.5 2.0 3 2 2 Change in Y Change in X 1 = slope 1 2 3 4 5 6 X The equation for a line • A brief review of geometry Y = (X)(slope) + (intercept) 0

Y 6 5 4 3 2 1 1 2 3 4 5 6 X The equation for a line • A brief review of geometry Y = (X)(slope) + (intercept) Y = (X)(0.5) + 2.0 0

Y 6 5 4 3 2 4.5 1 1 2 3 4 5 6 X Regression • A brief review of geometry • Consider a perfect correlation X = 5 Y = ? Y = (X)(0.5) + (2.0) Y = (5)(0.5) + (2.0) Y = 2.5 + 2 = 4.5 • Can make specific predictions about Y based on X

Y 6 5 4 3 2 4.5 1 1 2 3 4 5 6 X Regression • Consider a less than perfect correlation • The line still represents the predicted values of Y given X X = 5 Y = ? Y = (X)(0.5) + (2.0) Y = (5)(0.5) + (2.0) Y = 2.5 + 2 = 4.5

Y 6 5 4 3 2 1 1 2 3 4 5 6 X Regression • The “best fitting line” is the one that minimizes the error (differences) between the predicted scores (the line) and the actual scores (the points) • Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line

Regression • The linear model Y = intercept + slope (X) + error Beta’s () are sometimes called parameters • Come in two types: • standardized • unstanderdized Now let’s go through an example computing these things

Scatterplot • Using the dataset from our correlation lecture Y X Y 6 6 6 1 2 5 4 5 6 3 3 4 2 3 2 1 X 1 2 3 4 5 6

2.4 2.0 4.8 4.0 5.76 -2.6 6.76 -2.0 4.0 5.2 1.4 1.96 2.0 4.0 2.8 -0.6 0.36 0.0 0.0 0.0 -0.6 0.36 -2.0 4.0 1.2 3.6 4.0 0.0 15.20 0.0 16.0 14.0 SP mean SSY SSX From the Computing Pearson’s r lecture X Y 6 6 1 2 5 6 3 4 3 2

15.20 16.0 14.0 SP SSY SSX Computing regression line(with raw scores) X Y 6 6 1 2 5 6 3 4 3 2 3.6 4.0 mean

Y 6 5 4 3 2 1 X 1 2 3 4 5 6 Computing regression line(with raw scores) X Y 6 6 1 2 5 6 3 4 3 2 3.6 4.0 mean

Computing regression line (with raw scores) Y X Y 6 6 6 1 2 5 4 5 6 The two means will be on the line 3 3 4 2 3 2 1 3.6 4.0 X mean 1 2 3 4 5 6

Computing regression line(standardized, using z-scores) • Sometimes the regression equation is standardized. • Computed based on z-scores rather than with raw scores X Y 6 6 2.4 2.0 4.0 1.38 1.1 5.76 -2.6 6.76 -2.0 4.0 -1.49 -1.1 1 2 5 6 1.4 1.96 2.0 4.0 0.8 1.1 3 4 -0.6 0.36 0.0 0.0 - 0.34 0.0 3 2 -0.6 0.36 -2.0 4.0 - 0.34 -1.1 3.6 4.0 0.0 15.20 0.0 16.0 0.0 0.0 Mean 1.74 1.79 Std dev

Computing regression line(standardized, using z-scores) • Sometimes the regression equation is standardized. • Computed based on z-scores rather than with raw scores • Prediction model • Predicted Z score (on criterion variable) = standardized regression coefficient multiplied by Z score on predictor variable • Formula 1.38 1.1 -1.49 -1.1 0.8 1.1 - 0.34 0.0 - 0.34 -1.1 • The standardized regression coefficient (β) • In bivariate prediction, β = r 0.0 0.0

Computing regression line(with z-scores) ZY 2 1 1.38 1.1 ZX -1.49 -1.1 1 2 -2 -1 -1 0.8 1.1 - 0.34 0.0 -2 - 0.34 -1.1 0 0.0 0.0 mean

Y Y 6 6 5 5 4 4 3 3 2 2 1 1 1 1 2 2 3 3 4 4 5 5 6 6 X X Regression Y = intercept + slope (X) + error • Also need a measure of error • The linear equation isn’t the whole thing Y = X(.5) + (2.0) + error Y = X(.5) + (2.0) + error • Same line, but different relationships (strength difference)

Proportionate reduction in error • Note: Total squared error when predicting from the mean = SSTotal=SSY Regression • Error • Actual score minus the predicted score • Measures of error • r2 (r-squared) • Squared error using prediction model = Sum of the squared residuals = SSresidual= SSerror

Y Y 6 6 5 5 4 4 3 3 2 2 1 1 1 1 2 2 3 3 4 4 5 5 6 6 X X R-squared • r2 represents the percent variance in Y accounted for by X r = 0.8 r2 = 0.64 r = 0.5 r2 = 0.25 64% variance explained 25% variance explained

Y 6 5 4 3 2 1 X 1 2 3 4 5 6 Computing Error around the line • Compute the difference between the predicted values and the observed values (“residuals”) • Sum of the squared residuals = SSresidual = SSerror • Square the differences • Add up the squared differences

Predicted values of Y (points on the line) Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 1 2 5 6 3 4 3 2 3.6 4.0 mean

Predicted values of Y (points on the line) Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 = (0.92)(6)+0.688 1 2 5 6 3 4 3 2 3.6 4.0 mean

1.6 = (0.92)(1)+0.688 5.3 = (0.92)(5)+0.688 3.45 = (0.92)(3)+0.688 3.45 = (0.92)(3)+0.688 Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 = (0.92)(6)+0.688 1 2 5 6 3 4 3 2 3.6 4.0 mean

Y 6.2 6 5.3 5 4 3.45 3 2 1.6 1 X 1 2 3 4 5 6 Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 1.6 1 2 5 6 5.3 3 4 3.45 3 2 3.45

residuals Quick check Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 -0.20 6 - 6.2 = 1.6 0.40 12 2 - 1.6 = 5 6 5.3 0.70 6 - 5.3 = 3 4 3.45 0.55 4 - 3.45 = 3 2 3.45 -1.45 2 - 3.45 = 3.6 4.0 0.00 mean

SSERROR Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 0.04 -0.20 1.6 0.40 0.16 1 2 5 6 5.3 0.70 0.49 3 4 3.45 0.55 0.30 3 2 3.45 -1.45 2.10 3.6 4.0 0.00 3.09 mean

4.0 4.0 4.0 0.0 4.0 16.0 SSERROR SSY Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 0.04 -0.20 1.6 0.40 0.16 1 2 5 6 5.3 0.70 0.49 3 4 3.45 0.55 0.30 3 2 3.45 -1.45 2.10 3.6 4.0 0.00 3.09 mean

SSERROR SSY Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror • Also (like r2) represents the percent variance in Y accounted for by X • Proportionate reduction in error • In fact, it is mathematically identical to r2 16.0 3.09

Seeing patterns in the error • Residual plots • The sum of the residuals should always equal 0 (as should the mean). • the least squares regression line splits the data in half, half of the error is above the line and half is below the line. • In addition to summing to zero, we also want there the residuals to be randomly distributed. • That is, there should be no pattern to the residuals. • If there is a pattern, it may suggest that there is more than a simple linear relationship between the two variables. • Residual plots are very useful tools to examine the relationship even further. • These are basically scatterplots of the residuals () against the Explanatory (X) variable (note: the examples actually plot the residuals that have transformed into z-scores).

Seeing patterns in the error Residual plot Scatterplot • The scatterplot shows a nice linear relationship. • The residual plot shows that the residuals fall randomly above and below the line. Critically there doesn't seem to be a discernable pattern to the residuals.

Seeing patterns in the error Residual plot Scatterplot • The residual plot shows that the residuals get larger as X increases. • This suggests that the variability around the line is not constant across values of X. • This is referred to as a violation of homogeniety of variance. • The scatterplot also shows a nice linear relationship.

Seeing patterns in the error Residual plot Scatterplot • The scatterplot shows what may be a linear relationship. • The residual plot suggests that a non-linear relationship may be more appropriate (see how a curved pattern appears in the residual plot).

Regression in SPSS • Running the analysis is pretty easy • Analyze: Regression: Linear • Predictor variables go into the ‘independent variable’ field • (Predicted variable) goes into the “dependent variable’ field • You get a lot of output

Regression in SPSS • The variables in the model • r • r2 • We’ll get back to these numbers in a few weeks • Unstandardized coefficients • Slope (indep var name) • Intercept (constant) • Standardized coefficients

Multiple Regression • Multiple regression prediction models “fit” “residual”

Prediction in Research Articles • Bivariate prediction models rarely reported • Multiple regression results commonly reported

Statistics for the Social Sciences