Simple Linear Regression

Simple Linear Regression Key Points about Statistical Test Visualizing Regression Analysis Sample Homework Problem Solving the Problem with SPSS Logic for Simple Linear Regression

Regression Analysis • Regression analysis is the generic term for several statistical tests for evaluating the relationship between interval level dependent and independent variables. • When we are considering the relationship between one dependent variable and one independent variable, we use Simple Linear Regression. • When we are considering the relationship between one dependent variable and more than one independent variable, we use Multiple Regression. • SPSS uses the same procedure for both Simple Linear Regression and Multiple Regression, which adds some complications to our interpretation.

Purpose of Simple Linear Regression - 1 • The purpose of simple linear regression analysis is to answer three questions that have been identified as requirements for understanding the relationship between an independent and a dependent variable: • Is there a relationship between the two variables? • How strong is the relationship (e.g. trivial, weak, or strong; how much does it reduce error)? • What is the direction of the relationship (high scores are predictive of high or low scores)?

Purpose of Simple Linear Regression - 2 • The question of the existence of a relationship between the variables is answered by the hypothesis test in regression analysis. • The strength of the relationship is based on interpretation of the correlation coefficient, r (as trivial, small, medium, large) and/or the coefficient of determination, r-squared (as the proportion that error was produced or accuracy was improved). • The question of the direction of the relationship is based on the interpretation of the sign of the b coefficient or the beta coefficient.

Simple Linear Regression: Examples • There is a relationship between undergraduate GPA’s and graduate GPA’s. • GRE scores are a useful predictor of graduate GPA’s. • For social work students, the relationship between GPA and future income enables us to predict future earnings based on academic performance.

Simple Linear Regression - 1 • When we studied measures of central tendency, we showed that the best measure of central tendency for an interval level variable (assuming it is not badly skewed) was the mean. • When we used the mean as the estimated score for all cases in the distribution, the total error computed for all of the cases was smaller than the error would be for any other value used for the estimate. • Error was measured as the deviation or difference between the mean and the score for each case, squared and summed.

Simple Linear Regression - 2 • Simple linear regression tests the existence of a relationship between an independent and a dependent variable by determining whether or not there is a statistically significant reduction in total error if we use the scores on the independent variable to estimate the scores on the dependent variable. • Regression analysis finds the equation or formula for the straight line that minimizes the total error. • The regression equation takes the algebraic form for a straight line: y = a + bx, where y is the dependent variable, x is the independent variable, b is the slope of the line, and a is the point at which the line crosses the y axis.

The Regression Equation The regression equation is the algebraic formula for the regression line, which states the mathematical relationship between the independent and the dependent variable. We can use the regression line to estimate the value of the dependent variable for any value of the independent variable. The stronger the relationship between the independent and dependent variables, the closer these estimates will come to the actual score that each case had on the dependent variable.

Components of the Regression Equation • The regression equation has two components. • The first component is a number called the y-intercept that defines where the regression line crosses the vertical y axis. • The second component is called the slope of the line, and is a number that multiplies the value of the independent variable. • These two elements are combined in the general form for the regression equation: • the estimated score on the dependent variable = the y-intercept + the slope × the score on the independent variable

The Standard Form of the Regression Equation • The standard form for the regression equation or formula is: Y = a + bX • where • Y is the estimated score for the dependent variable • X is the score for the independent variable • b is the slope of the regression line, or the multiplier of X • a is the intercept, or the point on the vertical axis where the regression line crosses the vertical y-axis

Depicting the Regression Equation The regression equation includes both the y-intercept and the slope of the line. The y-intercept is 1.0 and the slope is 0.5. The y-intercept is the point on the vertical y-axis where the regression line crosses the axis, i.e. 1.0. The slope is the multiplier of x. It is the amount of change in y for a change of one unit in x. If x changes one unit from 2.0 to 3.0, depicted by the blue arrow, y will change by 0.5 units, from 2.0 to 2.5 as depicted by the red arrow.

Deriving the Regression Equation - 1 • In this plot, none of the points fall on the regression line. • The difference between the actual value for the dependent variable and the predicted value for each point is shown by the red lines. These differences are called the residuals, and represent the errors between the actual and predicted values.

Deriving the Regression Equation - 2 • The regression equation is computed to minimize the total amount of error in predicting values for the dependent variable. The method for deriving the equation is called the "method of least squares," meaning that the regression line minimizes the sum of the squared residuals, or errors between actual and predicted values.

Interpreting the Regression Equation: the Intercept • The intercept is the point on the vertical axis where the regression line crosses the axis. It is the predicted value for the dependent variable when the independent variable has a value of zero. • This may or may not be useful information depending on the context of the problem.

Interpreting the Regression Equation: the Slope • The slope is interpreted as the amount of change in the predicted value of the dependent variable associated with a one unit change in the value of the independent variable. • If the slope has a negative sign, the direction of the relationship is negative or inverse, meaning that the scores on the two variables move in opposite directions. • If the slope has a positive sign, the direction of the relationship is positive or direct, meaning that the scores on the two variables move in the same direction.

Interpreting the Regression Equation: the Slope equals 0 • If there is no relationship between two variables, the slope of the regression line is zero and the regression line is parallel to the horizontal axis. • A slope of zero means that the predicted value of the dependent variable will not change, no matter what value of the independent variable is used. • If there is no relationship, using the regression equation to predict values of the dependent variable is no improvement over using the mean of the dependent variable.

Simple Linear Regression: Hypotheses • The hypothesis tested in simple linear regression is based on the slope or angle of the regression line. • Hypotheses: • Null: the slope of the regression line as measured by the b coefficient = 0, i.e. there is no relationship • Research: the slope of the regression line as measured by the b coefficient ≠ 0, i.e. there is a relationship • The b coefficient is tested with a two-tailed t-test • Decision: • Reject null hypothesis if pSPSS ≤ alpha

Simple Linear Regression: Level of Measurement • Dependent variable is interval level (ordinal with caution) • Independent variable is interval level (ordinal with caution) or dichotomous

Simple Linear Regression: Sample Size Requirements - 1 • In previous semesters, the rule of thumb for required sample size that I have used was a minimum of 5 cases for each independent variable included in the analysis, and preferably 15 cases for each independent variable. This rule was based on the text Multivariate Data Analysis by Hair, Black, Babin, Anderson, and Tatham. • Since attempting to incorporate more material on power analysis, I find that rule to be inadequate because we are unlikely to achieve statistical significance in all but the simplest problems that contain very strong relationships.

Simple Linear Regression: Sample Size Requirements - 2 • In Using Multivariate Statistics, Tabachnick and Fidell recommend that the required number of cases should be the larger of the number of independent variables x 8 + 50 or the number of independent variables + 105. • Following this rule, simple linear regression with one independent variable would require a sample of 106 cases.

Simple Linear Regression: Assumptions • The relationship between the variables is linear • The residuals (errors) have the same variance for all values of the independent variable • The residuals (errors) are independent, i.e. not correlated from one case to the next • The residuals (errors) are normally distributed • We will defer the evaluation of assumptions until the next class.

Simple Linear Regression: APA Style • The t-test for the test of the Beta coefficient : β = -.34, t(225) = 6.53, p < .01 • Example: Social support significantly predicted depression scores, β = -.34, t(225) = 6.53, p < .01. Social support also explained a significant proportion of variance in depression scores, R2 = .12, F(1, 225) = 42.64, p < .01. Degrees of freedom for t-test (not in SPSS output) The beta coefficient Value of t-statistic Significance of t-statistic

Visualizing Regression Analysis - 1 • While we will base our problem solving on numeric statistical results computed by SPSS, we can use a scatterplot to demonstrate regression graphically. • We will use the variable "highest year of school completed" [educ] as the independent variable and "occupational prestige score" [prestg80] as the dependent variable from the GSS2000R data set to demonstrate the relationship graphically.

Visualizing Regression Analysis - 2 A scatterplot of prestg80 by educ produced by SPSS. The dependent variable is plotted on the y-axis, or the vertical axis. The dots in the body of the chart represented the cases in the distribution. The independent variable is plotted on the x-axis, or the horizontal axis.

Visualizing Regression Analysis - 3 I have drawn a green horizontal line through the mean of prestg80 (44.17). The differences between the mean line and the dots (shown as pink lines), are the deviations. The sum of the squared deviations is the measure of total error when the mean is used as the estimated score for each case. NOTE: the plots were created in SPSS by adding features to the default plot.

Visualizing Regression Analysis - 4 A regression line and the regression equation are added in red to the scatterplot. The pink deviations from the mean have been replaced with the orange deviations from the regression line. Deviations between cases and the regression line are called residuals.

Visualizing Regression Analysis - 5 The existence of a relationship between the variables is supported when the sum of the squared orange residuals is significantly less than the sum of the squared pink deviations Recall that both deviations and residuals can be referred to as errors. If there is a relationship, we can characterize it as a reduction in error.

Visualizing Regression Analysis – 6 While it is difficult for us to square and sum deviations and residuals, SPSS regression output provides us with the answer. The squared sum of the orange residuals from the regression line is the Residual Sum of Squares in the ANOVA table (37086.80). The squared sum of the pink deviations from the mean is the Total Sum of Squares in the ANOVA table (49104.91).

Visualizing Regression Analysis – 7 The difference between the Total Sum of Squares and the Residual Sum of Squares is the Regression Sum of Squares. The Regression Sum of Squares is the amount of error that can be eliminated by using the regression equation to estimate values of prestg80 instead of the mean of prestg80. The Regression Sum of Squares in the ANOVA table is 12018.11.

Visualizing Regression Analysis – 8 We can compute the proportion or error that was reduced by the regression by dividing the Regression Sum of Squares by the Total Sum of Squares: 12018.11 ÷ 49104.91 = 0.245

Visualizing Regression Analysis – 9 • The reduction in error that we computed (0.245) is equal to the R Square that SPSS provides in the Model Summary table. • R² is the coefficient of determination which is usually characterized as: • the proportion of variance in the dependent variable explained by the independent variable, or • the reduction in error (or increase in accuracy). In multiple regression, the symbol for coefficient of determination is R². In simple linear regression, the symbol is r².

Visualizing Regression Analysis – 10 The correlation coefficient, Multiple R, is the positive square root of R Square. This can be misleading in Simple Linear Regression when the correlation for the relationship between the two variables, r, can have a negative sign for an inverse relationship. Aside from the direction of the relationship, the value of Multiple R will be the same as the value for r in Simple Linear Regression.

Visualizing Regression Analysis – 11 The ANOVA table tests the null hypothesis that R² = 0, i.e. the reduction in error associated with the regression is zero. The test of the this hypothesis is reported for Multiple Regression as a test of an overall relationship between the dependent variable and the independent variables. In Simple Linear Regression, we usually report the hypothesis test that the slope = 0, though we would reach the same conclusion no matter which test we report.

Visualizing Regression Analysis – 12 The test of the null hypothesis that the slope of the regression line (b coefficient) = 0 is reported in the Coefficients table. Note that the significance of the t-test is that same as the significance of the F-test. Furthermore, in simple linear regression, the value of the F-statistic (81.662) is the same as the square of the t-statistic (9.037).

Visualizing Regression Analysis - 13 We can depict the hypothesis test visually. The null hypothesis for simple linear regression is that the slope of the regression line is zero. The slope of the green mean line is zero. The null hypothesis means that the red regression line would be the equal to the green line. In this example, the red regression line is obviously different from the green mean line, which is verified by the value of the slope in the regression equation (2.36) and the t-test of B.

Visualizing Regression Analysis – 14 The regression equation is based on the Unstandardized Coefficients(B) in the table of Coefficients. The B coefficient labeled (Constant) is the intercept. The B coefficient for the variable educ is the slope of the regression line. The regression equation for the relationship between prestg80 and educ is: prestg80 = 12.928 + 2.359 x educ

Visualizing Regression Analysis – 15 The Standardized Coefficients(Beta) in the table of Coefficients are the regression coefficients for the relationship between the standardized dependent variable (z-scores) and the standardized independent variable (z-scores). Since standardizing variables removes the unit of measurement from the coefficients, we can compare the Beta coefficients to interpret the relative importance of each independent variable in Multiple Regression. In Simple Linear Regression, Beta will be equal to r, the correlation coefficient. Multiple R, r, and Beta all have the same numeric value, though Multiple R will be positive even when r and Beta are negative.

Visualizing Regression Analysis – 16 The sign of the Beta coefficient, as well as the sign of the B coefficient, tells us the direction of the relationship. If the coefficients are positive, the relationship is characterized as direct or positive, meaning that higher values of the dependent variable are associated with higher values of the independent variables. If the coefficients are negative, the relationship is characterized as inverse or negative, meaning that lower values of the dependent variable are associated with higher values of the independent variables.

Visualizing Regression Analysis - 17 The regression line represents the estimated value of prestg80 for every value of educ. To obtain the estimate, we draw a line perpendicular to the value on the x-axis to the point where it intersects the regression line. We then draw a line from the intersection point to the y-axis. The intersection point on the y-axis is the estimated value for the dependent variable.

Visualizing Regression Analysis - 18 If we draw a vertical line from the educ value of 5 to the regression line and then to the horizontal axis, we see that the estimated value for prestg80 is about 25. We can compute the exact value by substituting in the regression equation: Prestg80 = 12.93 + 2.36 x 5 = 24.73

Visualizing Regression Analysis - 19 If we draw a vertical line from the educ value of 15 to the regression line and then to the horizontal axis, we see that the estimated value for prestg80 is about 50. We can compute the exact value by substituting in the regression equation: Prestg80 = 12.93 + 2.36 x 15 = 48.33

Sample homework problem: Simple linear regression Based on information from the data set GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Assume that the assumptions of linear regression are satisfied. Use .05 for alpha. Simple linear regression revealed a strong, positive relationship between "highest academic degree" [degree] and "occupational prestige score" [prestg80] (β = 0.546, t(250) = 10.30, p < .001). Survey respondents who had higher academic degrees had more prestigious occupations. The accuracy of predicting scores for the dependent variable "occupational prestige score" will improve by approximately 30% if the prediction is based on scores for the independent variable "highest academic degree" (r² = 0.298). • True • True with caution • False • Incorrect application of a statistic This is the general framework for the problems in the homework assignment on simple linear regression problems. The description is similar to findings one might state in a research article.

Sample homework problem: Data set and alpha Based on information from the data set GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Assume that the assumptions of linear regression are satisfied. Use .05 for alpha. Simple linear regression revealed a strong, positive relationship between "highest academic degree" [degree] and "occupational prestige score" [prestg80] (β = 0.546, t(250) = 10.30, p < .001). Survey respondents who had higher academic degrees had more prestigious occupations. The accuracy of predicting scores for the dependent variable "occupational prestige score" will improve by approximately 30% if the prediction is based on scores for the independent variable "highest academic degree" (r² = 0.298). • True • True with caution • False • Incorrect application of a statistic • The first paragraph identifies: • The data set to use, e.g. GSS2000R.Sav • Thealpha level for the hypothesis test

Sample homework problem: Specifications for the test - 1 Based on information from the data set GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Assume that the assumptions of linear regression are satisfied. Use .05 for alpha. Simple linear regression revealed a strong, positive relationship between "highest academic degree" [degree] and "occupational prestige score" [prestg80] (β = 0.546, t(250) = 10.30, p < .001). Survey respondents who had higher academic degrees had more prestigious occupations. The accuracy of predicting scores for the dependent variable "occupational prestige score" will improve by approximately 30% if the prediction is based on scores for the independent variable "highest academic degree" (r² = 0.298). • True • True with caution • False • Incorrect application of a statistic • The second paragraph states the finding that we want to verify with a simple linear regression. The finding identifies: • The independent variable • The dependent variable • The strength of the relationship • The direction of the relationship

Sample homework problem: Specifications for the test - 2 • The second paragraph also states additional statements the can be included in findings; • A interpretative statement about direction of the relationship • The proportional reduction in error (PRE) interpretation of the coefficient of determination r2 Based on information from the data set GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Assume that the assumptions of linear regression are satisfied. Use .05 for alpha. Simple linear regression revealed a strong, positive relationship between "highest academic degree" [degree] and "occupational prestige score" [prestg80] (β = 0.546, t(250) = 10.30, p < .001). Survey respondents who had higher academic degrees had more prestigious occupations. The accuracy of predicting scores for the dependent variable "occupational prestige score" will improve by approximately 30% if the prediction is based on scores for the independent variable "highest academic degree" (r² = 0.298). • True • True with caution • False • Incorrect application of a statistic

Sample homework problem: Simple linear regression The answer to a problem will be True with caution if the analysis supports the finding in the problem statement, but one or both of the variables is ordinal level. The answer will be True if all parts of the finding in the problem statement are correct. Based on information from the data set GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Assume that the assumptions of linear regression are satisfied. Use .05 for alpha. Simple linear regression revealed a strong, positive relationship between "highest academic degree" [degree] and "occupational prestige score" [prestg80] (β = 0.546, t(250) = 10.30, p < .001). Survey respondents who had higher academic degrees had more prestigious occupations. The accuracy of predicting scores for the dependent variable "occupational prestige score" will improve by approximately 30% if the prediction is based on scores for the independent variable "highest academic degree" (r² = 0.298). • True • True with caution • False • Incorrect application of a statistic The answer will be False if any part of the finding in the problem statement is not correct. The answer to a problem will Incorrect application of a statistic if the level of measurement or sample size requirement is violated.

Solving the problem with SPSS:Level of measurement Simple linear regression requires that the dependent variable be interval and the independent variable be interval or dichotomous. "Occupational prestige score" [prestg80] is interval level, satisfying the requirement for the dependent variable. "Highest academic degree" [degree] is ordinal level. However, we will follow the common convention of using ordinal variables with interval level statistics, adding a caution to any true findings.

Solving the problem with SPSS: Simple linear regression- 1 Before we can address the other issues involved in solving the problem, we need to generate the SPSS output. Select Regression > Linear… from the Analyze menu.

Solving the problem with SPSS: Simple linear regression- 2 First, move the dependent variable prestg80 to the Dependent list box. The problem states that: Simple linear regression revealed a strong, positive relationship between "highest academic degree" [degree] and "occupational prestige score" [prestg80] (ß = 0.546, t(250) = 10.30, p < .001). We first enter the independent and dependent variable in the dialog box Unless the problem statement clearly specifies which variable is having an effect on the other, we treat the variable mentioned first as the independent variable and the one mentioned second as the dependent variable. Second, move the independent variable degree to the Independents list box. Third, click on the Statistics button to add the additional statistics.

Solving the problem with SPSS: Simple linear regression- 3 Second, click on the Continue button to close the dialog box. First, in addition to the SPSS defaults, we add the check box for Descriptives statistics.

Simple Linear Regression