1 / 40

Statistics for the Social Sciences

Statistics for the Social Sciences. Psychology 340 Fall 2006. Prediction. Outline (for week). Simple bi-variate regression, least-squares fit line The general linear model Residual plots Using SPSS Multiple regression Comparing models, (?? Delta r 2 ) Using SPSS. Regression.

Download Presentation

Statistics for the Social Sciences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics for the Social Sciences Psychology 340 Fall 2006 Prediction

  2. Outline (for week) • Simple bi-variate regression, least-squares fit line • The general linear model • Residual plots • Using SPSS • Multiple regression • Comparing models, (?? Delta r2) • Using SPSS

  3. Regression • Last time: with correlation, we examined whether variables X & Y are related • This time: with regression, we try to predict the value of one variable given what we know about the other variable and the relationship between the two.

  4. Predicted variable Y 6 • The variable that you are predicting goes on the Y-axis (criterion variable) 5 4 • The variable that you are making the prediction based on goes on the X-axis (predictor variable) 3 2 1 1 2 3 4 5 6 X Predicting variable Regression • Last time: “it doesn’t matter which variable goes on the X-axis or the Y-axis” • For regression this is NOT the case Quiz performance Hours of study

  5. Y 6 5 4 3 2 1 1 2 3 4 5 6 X Regression • Last time: “Imagine a line through the points” • But there are lots of possible lines • One line is the “best fitting line” • Today: learn how to compute the equation corresponding to this “best fitting line” Quiz performance Hours of study

  6. Y 6 5 4 2.0 3 2 1 0 1 2 3 4 5 6 X The equation for a line • A brief review of geometry Y = intercept, when X = 0 Y = (X)(slope) + (intercept)

  7. Y 6 5 4 1 0.5 2.0 3 2 2 Change in Y Change in X 1 = slope 1 2 3 4 5 6 X The equation for a line • A brief review of geometry Y = (X)(slope) + (intercept) 0

  8. Y 6 5 4 3 2 1 1 2 3 4 5 6 X The equation for a line • A brief review of geometry Y = (X)(slope) + (intercept) Y = (X)(0.5) + 2.0 0

  9. Y 6 5 4 3 2 4.5 1 1 2 3 4 5 6 X Regression • A brief review of geometry • Consider a perfect correlation X = 5 Y = ? Y = (X)(0.5) + (2.0) Y = (5)(0.5) + (2.0) Y = 2.5 + 2 = 4.5 • Can make specific predictions about Y based on X

  10. Y 6 5 4 3 2 4.5 1 1 2 3 4 5 6 X Regression • Consider a less than perfect correlation • The line still represents the predicted values of Y given X X = 5 Y = ? Y = (X)(0.5) + (2.0) Y = (5)(0.5) + (2.0) Y = 2.5 + 2 = 4.5

  11. Y 6 5 4 3 2 1 1 2 3 4 5 6 X Regression • The “best fitting line” is the one that minimizes the error (differences) between the predicted scores (the line) and the actual scores (the points) • Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line

  12. Regression • The linear model Y = intercept + slope (X) + error Beta’s () are sometimes called parameters • Come in two types: • standardized • unstanderdized Now let’s go through an example computing these things

  13. Scatterplot • Using the dataset from our correlation lecture Y X Y 6 6 6 1 2 5 4 5 6 3 3 4 2 3 2 1 X 1 2 3 4 5 6

  14. 2.4 2.0 4.8 4.0 5.76 -2.6 6.76 -2.0 4.0 5.2 1.4 1.96 2.0 4.0 2.8 -0.6 0.36 0.0 0.0 0.0 -0.6 0.36 -2.0 4.0 1.2 3.6 4.0 0.0 15.20 0.0 16.0 14.0 SP mean SSY SSX From the Computing Pearson’s r lecture X Y 6 6 1 2 5 6 3 4 3 2

  15. 15.20 16.0 14.0 SP SSY SSX Computing regression line(with raw scores) X Y 6 6 1 2 5 6 3 4 3 2 3.6 4.0 mean

  16. Y 6 5 4 3 2 1 X 1 2 3 4 5 6 Computing regression line(with raw scores) X Y 6 6 1 2 5 6 3 4 3 2 3.6 4.0 mean

  17. Computing regression line (with raw scores) Y X Y 6 6 6 1 2 5 4 5 6 The two means will be on the line 3 3 4 2 3 2 1 3.6 4.0 X mean 1 2 3 4 5 6

  18. Computing regression line(standardized, using z-scores) • Sometimes the regression equation is standardized. • Computed based on z-scores rather than with raw scores X Y 6 6 2.4 2.0 4.0 1.38 1.1 5.76 -2.6 6.76 -2.0 4.0 -1.49 -1.1 1 2 5 6 1.4 1.96 2.0 4.0 0.8 1.1 3 4 -0.6 0.36 0.0 0.0 - 0.34 0.0 3 2 -0.6 0.36 -2.0 4.0 - 0.34 -1.1 3.6 4.0 0.0 15.20 0.0 16.0 0.0 0.0 Mean 1.74 1.79 Std dev

  19. Computing regression line(standardized, using z-scores) • Sometimes the regression equation is standardized. • Computed based on z-scores rather than with raw scores • Prediction model • Predicted Z score (on criterion variable) = standardized regression coefficient multiplied by Z score on predictor variable • Formula 1.38 1.1 -1.49 -1.1 0.8 1.1 - 0.34 0.0 - 0.34 -1.1 • The standardized regression coefficient (β) • In bivariate prediction, β = r 0.0 0.0

  20. Computing regression line(with z-scores) ZY 2 1 1.38 1.1 ZX -1.49 -1.1 1 2 -2 -1 -1 0.8 1.1 - 0.34 0.0 -2 - 0.34 -1.1 0 0.0 0.0 mean

  21. Y Y 6 6 5 5 4 4 3 3 2 2 1 1 1 1 2 2 3 3 4 4 5 5 6 6 X X Regression Y = intercept + slope (X) + error • Also need a measure of error • The linear equation isn’t the whole thing Y = X(.5) + (2.0) + error Y = X(.5) + (2.0) + error • Same line, but different relationships (strength difference)

  22. Proportionate reduction in error • Note: Total squared error when predicting from the mean = SSTotal=SSY Regression • Error • Actual score minus the predicted score • Measures of error • r2 (r-squared) • Squared error using prediction model = Sum of the squared residuals = SSresidual= SSerror

  23. Y Y 6 6 5 5 4 4 3 3 2 2 1 1 1 1 2 2 3 3 4 4 5 5 6 6 X X R-squared • r2 represents the percent variance in Y accounted for by X r = 0.8 r2 = 0.64 r = 0.5 r2 = 0.25 64% variance explained 25% variance explained

  24. Y 6 5 4 3 2 1 X 1 2 3 4 5 6 Computing Error around the line • Compute the difference between the predicted values and the observed values (“residuals”) • Sum of the squared residuals = SSresidual = SSerror • Square the differences • Add up the squared differences

  25. Predicted values of Y (points on the line) Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 1 2 5 6 3 4 3 2 3.6 4.0 mean

  26. Predicted values of Y (points on the line) Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 = (0.92)(6)+0.688 1 2 5 6 3 4 3 2 3.6 4.0 mean

  27. 1.6 = (0.92)(1)+0.688 5.3 = (0.92)(5)+0.688 3.45 = (0.92)(3)+0.688 3.45 = (0.92)(3)+0.688 Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 = (0.92)(6)+0.688 1 2 5 6 3 4 3 2 3.6 4.0 mean

  28. Y 6.2 6 5.3 5 4 3.45 3 2 1.6 1 X 1 2 3 4 5 6 Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 1.6 1 2 5 6 5.3 3 4 3.45 3 2 3.45

  29. residuals Quick check Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 -0.20 6 - 6.2 = 1.6 0.40 12 2 - 1.6 = 5 6 5.3 0.70 6 - 5.3 = 3 4 3.45 0.55 4 - 3.45 = 3 2 3.45 -1.45 2 - 3.45 = 3.6 4.0 0.00 mean

  30. SSERROR Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 0.04 -0.20 1.6 0.40 0.16 1 2 5 6 5.3 0.70 0.49 3 4 3.45 0.55 0.30 3 2 3.45 -1.45 2.10 3.6 4.0 0.00 3.09 mean

  31. 4.0 4.0 4.0 0.0 4.0 16.0 SSERROR SSY Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror X Y 6 6 6.2 0.04 -0.20 1.6 0.40 0.16 1 2 5 6 5.3 0.70 0.49 3 4 3.45 0.55 0.30 3 2 3.45 -1.45 2.10 3.6 4.0 0.00 3.09 mean

  32. SSERROR SSY Computing Error around the line • Sum of the squared residuals = SSresidual = SSerror • Also (like r2) represents the percent variance in Y accounted for by X • Proportionate reduction in error • In fact, it is mathematically identical to r2 16.0 3.09

  33. Seeing patterns in the error • Residual plots • The sum of the residuals should always equal 0 (as should the mean). • the least squares regression line splits the data in half, half of the error is above the line and half is below the line. • In addition to summing to zero, we also want there the residuals to be randomly distributed. • That is, there should be no pattern to the residuals. • If there is a pattern, it may suggest that there is more than a simple linear relationship between the two variables. • Residual plots are very useful tools to examine the relationship even further. • These are basically scatterplots of the residuals () against the Explanatory (X) variable (note: the examples actually plot the residuals that have transformed into z-scores).

  34. Seeing patterns in the error Residual plot Scatterplot • The scatterplot shows a nice linear relationship. • The residual plot shows that the residuals fall randomly above and below the line. Critically there doesn't seem to be a discernable pattern to the residuals.

  35. Seeing patterns in the error Residual plot Scatterplot • The residual plot shows that the residuals get larger as X increases. • This suggests that the variability around the line is not constant across values of X. • This is referred to as a violation of homogeniety of variance. • The scatterplot also shows a nice linear relationship.

  36. Seeing patterns in the error Residual plot Scatterplot • The scatterplot shows what may be a linear relationship. • The residual plot suggests that a non-linear relationship may be more appropriate (see how a curved pattern appears in the residual plot).

  37. Regression in SPSS • Running the analysis is pretty easy • Analyze: Regression: Linear • Predictor variables go into the ‘independent variable’ field • (Predicted variable) goes into the “dependent variable’ field • You get a lot of output

  38. Regression in SPSS • The variables in the model • r • r2 • We’ll get back to these numbers in a few weeks • Unstandardized coefficients • Slope (indep var name) • Intercept (constant) • Standardized coefficients

  39. Multiple Regression • Multiple regression prediction models “fit” “residual”

  40. Prediction in Research Articles • Bivariate prediction models rarely reported • Multiple regression results commonly reported

More Related