Go to Exercise #5 on Class Handout #6:

Go to Exercise #5 on Class Handout #6:

5. In the “SIMPLE LINEAR REGRESSION” section of the textbook, read the subsection “Assumptions”, and open the version of the SPSS data file Job Satisfaction that was saved after Exercise #9 on Class Handout #5. (a) Use the Analyze> Correlate> Bivariate options in SPSS to obtain Table 3.1, and use the Graphs> Legacy Dialogs> Scatter/Dot options in SPSS to obtain Figure 3.2. (b) Use the Analyze>Regression> Linear options in SPSS to select the variable Satisfaction for the Dependent slot and select the variable SQRT_Supervision for the Independent(s) section. Click on the Plots button to display the Linear Regression: Plots dialog box. From the list of variables on the left, select ZRESID, and click on the arrow pointing toward the Y slot; then select ZPRED, and click on the arrow pointing toward the X slot. This will generate the plot in Figure 3.3. In the Standardized Residual Plots section of the Linear Regression: Plots dialog box, select the Histogram option and the Normal probability plot option. This will generate the plots in Figures 3.4A and 3.4B (except that you will need to add the line at zero to the plot). Click on the Continue button, and then click on the OK button, after which the SPSS output will be generated.

(c) Read the PRACTICAL EXAMPLE section beginning on page 81 to page 95, but you do not need to follow any of the instructions to produce the various SPSS output displayed in the text.

In a study of the impact of temperature during the summer months on the maximum amount of power that must be generated to meet demand each day, the prediction of daily peak power load (megawatts) from daily high temperature (degrees Fahrenheit) is of interest. Data for 25 randomly selected summer days is stored in the SPSS data file powerloads. 6. (a) Identify the dependent (response) variable and the independent (explanatory) variable for a regression analysis. The dependent (response) variable is Y = “daily peak power load”, and the independent (explanatory) variable is “daily high temperature”. (b) Does the data appear to be observational or experimental? Since the daily high temperature is random, the data is observational. (c) In the document titled Using SPSS Version 19.0, use SPSS with the section titled Performing a simple linear regression with bivariate data, with checks of linearity, homoscedasticity, and normality assumptions to do the following: Follow the instructions in the first five steps to graph the least squares line on a scatter plot; then state why it might appear that the linearity assumption is not satisfied.

The data points do not appear to be randomly distributed around the least squares line. As temperature increases, the power loads appear to increase at a faster rate.

Class Handout #7 (Chapter 4) Definitions Multiple Linear Regression multiple linear regression data a sample of n sets of observations of k independent variables X1 , X2 , … , Xk and a quantitative dependent variable Y : (x11 , x21 , … , xk1 , y1) … (x1n , x2n , … , xkn , yn). Each independent variable is called a factor or a predictor. least squares regression equation the linear equation which minimizes the sum of the squared differences between observed values of Y and predicted values of Y We again use Y to represent an observed value of the dependent (response) variable, use Y to represent a predicted value of the dependent (response) variable, and use Y to represent the mean of all observed values of the dependent (response) variable. ^ The (estimated) unstandardized least squares regression equation is ^ Y = a + b1X1 + b2X2 + … + bk Xk ^ a is the intercept, which is the (predicted) value Y when X1 = X2 = … = Xk = 0 (Note: This predicted value often is not meaningful in practice.), and where

bi is the regression coefficient for Xi, which is the (estimated) amount Y changes on average with each increase of one unit in the variable Xi with all other independent variables (factors) remaining fixed. The (estimated) standardized least squares regression equation is where ZY = 1ZX1 + 2ZX2 + … k ZXk i (beta) is the standardized regression coefficient for Xi , ZXi is the z-score for the value of Xi, and ZY is the z-score for the predicted value of Y.

1. The data stored in the SPSS data file realestate is to be used in a study concerning the prediction of sale price of a residential property (dollars). Appraised land value (dollars), appraised value of improvements (dollars), and area of property living space (square feet) are to be considered as possible predictors, and the 20 properties selected for the data set are a random sample. Does the data appear to be observational or experimental? (a) Since the land value, improvement value, and area are all random, the data is observational. One assumption in multiple regression is that the dependent (response) variable Y is linearly related to each quantitative independent (predictor) variable; if this assumption is not satisfied, then a higher-order term, which refers to a function of one or more other independent variable(s), can be added to the model. Another assumption in multiple regression is that the variance of the variable Y is the same for any fixed values of the independent variables; as in simple linear regression, homoscedasticity refers to this assumption being satisfied while heteroscedasticity refers to this assumption not being satisfied, and examination of a scatter plot with standardized predicted values on the horizontal axis and standardized residuals on the vertical axis can be made to evaluate this uniformity of variance assumption.

1. The data stored in the SPSS data file realestate is to be used in a study concerning the prediction of sale price of a residential property (dollars). Appraised land value (dollars), appraised value of improvements (dollars), and area of property living space (square feet) are to be considered as possible predictors, and the 20 properties selected for the data set are a random sample. Does the data appear to be observational or experimental? (a) Since the land value, improvement value, and area are all random, the data is observational. Another assumption in multiple regression is that the distribution of random errors, estimated by the residuals (residual = observed Y– predicted Y), is at least approximately normal. This assumption can be evaluated by examining a normal probability plot and by using the Shapiro-Wilk test. NOTE: The textbook suggests (see pages 101 to 103 of the textbook) that the dependent variable Y and each quantitative independent variable must have each at least an approximate normal distribution. In practice, however, it is often sufficient to simply check the residuals for normality.

1. The data stored in the SPSS data file realestate is to be used in a study concerning the prediction of sale price of a residential property (dollars). Appraised land value (dollars), appraised value of improvements (dollars), and area of property living space (square feet) are to be considered as possible predictors, and the 20 properties selected for the data set are a random sample. Does the data appear to be observational or experimental? (a) Since the land value, improvement value, and area are all random, the data is observational. In the document titled Using SPSS Version 19.0, use SPSS with the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptions to do each of the following: (b) Follow the instructions in the first six steps to graph the least squares line on a scatter plot for the dependent variable with each quantitative independent variable; then decide whether or not the linearity assumption appears to be satisfied.

For each of the quantitative predictors, the relationship looks reasonably linear, since the data points appear randomly distributed around the least squares line.

1(b)-continued Continue to follow the instructions beginning with the 8th step (notice that step 7 is not necessary here) down to the 15th step to create graphs for assessing whether or not the uniform variance (homoscedasticity) assumption and the normality assumption appear to be satisfied, and to generate the output for the linear regression. Then, decide whether or not each of these assumptions appears to be satisfied. There appears to be much variation, but it looks reasonably uniform.

The histogram of standardized residuals looks somewhat non-normal, and the points on the normal probability plot seem to depart somewhat from the diagonal line.

1.-continued Based on the histogram and normal probability plot for the standardized residuals in part (b), explain why we might want to look at the skewness coefficient, the kurtosis coefficient, and the results of the Shapiro-Wilk test. Then use SPSS with the section titled Data Diagnostics to make a statement about whether or not non-normality needs to be a concern. (c) Since there appears to be some possible evidence of non-normality in part (b), we want to know if non-normality needs to be a concern. Since the skewness and kurtosis coefficients are each well within two standard errors of zero, and the p = 0.166 is not less than 0.001 in the Shapiro-Wilk test, non-normality need not be a concern in the regression.

Multicollinearity is said to be present among a set of independent variables when there is at least one high correlation among two of the independent variables or among two different linear combinations of the independent variables. Multicollinearity can cause problems with the calculations involved in a multiple linear regression. Often, multicollinearity can be detected by observing that the correlation between a pair of independent variables is larger than 0.80. The tolerance for any given independent variable is the proportion of variance in that independent variable unexplained by the other independent variables, and the variance inflation factor (VIF) is the reciprocal of the tolerance. A multicollinearity problem is indicated by tolerance < 0.10, that is, VIF > 10.

From the Correlations table of the SPSS output comment on the possibility of multicollinearity in the multiple regression. (d) Since the correlation matrix does not contain any correlation greater than 0.8 for any pair of independent variables, there is no indication that multicollinearity will be a problem.

Analysis of Variance (ANOVA) can be used to derive hypothesis tests in multiple regression, concerning the statistical significance of the overall regression and the statistical significance of individual independent variables (factors). The partitioning of the total sum of squares is analogous to that in simple linear regression. The basic ANOVA table can be organized as follows: k n – k – 1 n – 1 The f statistic in this ANOVA table is for deciding whether or not at least one regression coefficient is significantly different from zero (0), or in other words, whether or not the overall regression is statistically significant. Note that the regression degrees of freedom (df) is k (since the dependent variable is predicted from k independent variables), and that the total degrees of freedom is, as always, one less than the sample size n. Since the df for regression and the df for error must sum to the df for total, then the df for error must be equal to n – k – 1.

(e) With a 0.05 significance level, summarize the results of the f-test in the ANOVA table. Since f3, 16= 46.662 and f3, 16; 0.05 = 3.24, we have sufficient evidence to reject H0 at the 0.05 level. We conclude that the linear regression to predict sale price from land value, improvements value, and area is significant (p < 0.001). at least one coefficient in the linear regression to predict sale price from land value, improvements value, and area is different from zero

(e) With a 0.05 significance level, summarize the results of the f-test in the ANOVA table. Since f3, 16= 46.662 and f3, 16; 0.05 = 3.24, we have sufficient evidence to reject H0 at the 0.05 level. We conclude that the linear regression to predict sale price from land value, improvements value, and area is significant (p < 0.001). at least one coefficient in the linear regression to predict sale price from land value, improvements value, and area is different from zero (f) Use the SPSS output to find the least squares regression equation. ^ sale_prc = 1470.276 + 0.814(land_val) + 0.820(impr_val) + 13.529(area)

When a large number of predictors are available in the prediction of a dependent variable Y, it is desirable to have some method for screening the predictors, that is, selecting those predictors which are most important. One possible method is to select independent variables that are significantly correlated with the dependent variable Y. A more sophisticated method is stepwise regression. Given that significance levels E and R are chosen respectively for entry and removal, stepwise regression is applied as follows: Note: The text description of stepwise regression is not quite correct. Step 1 All possible regressions with exactly one predictor are fit to the data. Among all predictors which are statistically significant at the Elevel, if any, the one for which the p-value is lowest (i.e., the one which is “most statistically significant”) is entered into the model, and Step 2 is performed next; if no predictors are statistically significant at the chosen Elevel, no predictors are entered into the model and the procedure ends. Step 2 Labeling the predictor entered into the model as X1, then all possible regressions with X1 and one other predictor are fit to the data. Among all predictors which are statistically significant with X1 in the model at the Elevel, if any, the one for which the p-value is lowest (i.e., the one which is “most statistically significant”) is entered into the model, and Step 3 is performed next; if no predictors are statistically significant with X1 in the model at the Elevel, no predictors are entered into the model and the procedure ends.

Step 3 Labeling the predictor entered into the model after X1 as X2, then a check is performed to see if X1 is statistically significant with X2 in the model at the Rlevel, and if not, then X1 is removed. Next, Step 4 is performed. Step 4 All possible regressions with the predictor(s) currently in the model and one other predictor are fit to the data. Among all predictors which are statistically significant after the predictor(s) currently in the model at the E level, if any, the one for which p-value is lowest (i.e., the one which is “most statistically significant”) is entered into the model, and Step 5 is performed next; if no predictors are statistically significant after the predictor(s) currently in the model at the Elevel, no predictors are entered into the model and the procedure ends. Step 5 A check is then performed to see if each of the predictors in the model is statistically significant with all other predictors now in the model. Among all predictors which are not statistically significant at the Rlevel, if any, the one for which the p-value is highest is removed from the model, and the check is repeated until no more variables can be removed. Step 6 Steps 4 and 5 are repeated successively until no more variables can be entered or removed.

Other methods to decide which of many predictors are the most important include the forward selection method (which is the same as stepwise regression except there is no option to remove variables from the model), the backward elimination method (where we begin with all predictors in the model and remove the most statistically insignificant at each step until no more predictors can be removed), and various methods which depend on doing all possible regressions (discussed in Section 6.3). The hypothesis test to decide whether or not a predictor is statistically significant when entered into a model after other predictors are already in the model is equivalent to the hypothesis test to decide whether or not the partial correlation between the dependent variable Y and the predictor being entered into the model given all the predictors already in the model is statistically significant. The results of only one method with one data set to select the most important predictors should not be considered final. Typically, further analysis is necessary. For instance, several procedures could be used with the same data to see if the same results are obtained. Higher order terms can also be investigated.

1.-continued In the document titled Using SPSS Version 19.0, use SPSS with the five instructions at the end of the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptionsto obtain the output for a stepwise regression. (g)

1.-continued From the Collinearity Statistics section of the Coefficientstable of the SPSS output, add to the comment on the possibility of multicollinearity in the multiple regression. (h) Multicollinearity is said to be present among a set of independent variables when there is at least one high correlation among two of the independent variables or among two different linear combinations of the independent variables. Multicollinearity can cause problems with the calculations involved in a multiple linear regression. Often, multicollinearity can be detected by observing that the correlation between a pair of independent variables is larger than 0.80. The tolerance for any given independent variable is the proportion of variance in that independent variable unexplained by the other independent variables, and the variance inflation factor (VIF) is the reciprocal of the tolerance. A multicollinearity problem is indicated by tolerance < 0.10, that is, VIF > 10.

1.-continued From the Collinearity Statistics section of the Coefficientstable of the SPSS output, add to the comment on the possibility of multicollinearity in the multiple regression. (h) We see that tolerance > 0.10 (i.e., VIF < 10) for each independent variable, which is a further indication that multicollinearitywill not be a problem.

1.-continued From the Collinearity Statistics section of the Coefficientstable of the SPSS output, add to the comment on the possibility of multicollinearity in the multiple regression. (h) We see that tolerance > 0.10 (i.e., VIF < 10) for each independent variable, which is a further indication that multicollinearitywill not be a problem. From the Variables Entered/Removed table of the SPSS output, find the default values of the significance level to enter an independent variable into the model and the significance level to remove an independent variable from the model. (i) Respectively these are 0.05 and 0.10.

From the Variables Entered/Removed table of the SPSS output, find the number of steps in the stepwise multiple regression, and list the independent variables selected and removed at each step. (j) There were two steps in the stepwise multiple regression; the variable “appraised value of improvements” was entered in the first step, and the variable “area of property living space” was entered in the second step. No variables were removed at either step. We shall finish this exercise next class.

Go to Exercise #5 on Class Handout #6:

Go to Exercise #5 on Class Handout #6:

Presentation Transcript

Go To

Go to Exercise #6 on Class Handout #5:

Go to

Go to

Go to

To go