Multiple Regression

Multiple Regression Chapter 17

Ch 17 Introduction • In this chapter we extend the simple linear regression model, and allow for any number of independent variables. • We expect to build a model that fits the data better than the simple linear regression model.

Introduction • We shall use computer printout to • Assess the model • How well it fits the data • Is it useful • Are any required conditions violated? • Employ the model • Interpreting the coefficients • Predictions using the prediction equation • Estimating the expected value of the dependent variable

Dependent variable Independent variables 17.2 Model and Required Conditions • We allow for k independent variables to potentially be related to the dependent variable y = b0 + b1x1+ b2x2 + …+ bkxk + e Coefficients Random error variable

Multiple Regression for k = 2, Graphical Demonstration - I y The simple linear regression model allows for one independent variable, “x” y =b0 + b1x + e y = b0 + b1x y = b0 + b1x y = b0 + b1x y = b0 + b1x Note how the straight line becomes a plain, and... y = b0 + b1x1 + b2x2 y = b0 + b1x1 + b2x2 y = b0 + b1x1 + b2x2 y = b0 + b1x1 + b2x2 y = b0 + b1x1 + b2x2 X y = b0 + b1x1 + b2x2 1 y = b0 + b1x1 + b2x2 The multiple linear regression model allows for more than one independent variable. Y = b0 + b1x1 + b2x2 + e X2

Multiple Regression for k = 2, Graphical Demonstration - II y y= b0+ b1x2 Note how a parabola becomes a parabolic Surface. b0 X 1 y = b0 + b1x12 + b2x2 X2

Required conditions for the error variable • The error e is normally distributed. • The mean is equal to zero and the standard deviation is constant (se)for all values of y. • The errors are independent. • We will test for normality and independence.

Estimating the Coefficients and Assessing the Model • The procedure used to perform regression analysis: • Obtain the model coefficients and statistics using a statistical software. • Diagnose violations of required conditions. Try to remedy problems when identified. • Assess the model fit using statistics obtained from the sample. • If the model assessment indicates good fit to the data, use it to interpret the coefficients and generate predictions.

Estimating the Coefficients and Assessing the Model, Example • Example 17.1 Where to locate a new motor inn? • La Quinta Motor Inns is planning an expansion. • Management wishes to predict which sites are likely to be profitable. • Several areas where predictors of profitability can be identified are: • Competition • Market awareness • Demand generators • Demographics • Physical quality

Example 17.1 – La Quinta Inns… • Where should La Qunita locate a new motel? Factors influencing profitability… Profitability Competition Market Awareness Demand Generators Community Physical factor # of rooms within 3 mile radius Distance to the nearest La Quinta inn. Median household income. Distance to downtown. Offices, Higher Ed. measure *these need to be interval data !

Example 17.1 – La Quinta Inns… • Where should La Qunita locate a new motel? • Several possible predictors of profitability were identified, and data were collected. Its believed that operating margin (y) is dependent upon these factors: • x1 = Total motel and hotel rooms within 3 mile radius • x2 = Number of miles to closest competition • x3 = Volume of office space in surrounding community • x4 = College and university student numbers in community • x5 = Median household income in community • x6 = Distance (in miles) to the downtown core.

Transformation… • Can we transform this data: • into a mathematical model that looks like this: margin awareness (distance to nearest alt.) physical (distance to downtown) competition (i.e. # of rooms) …

Regression Analysis, Minitab Output • Regression Analysis: Margin versus Number, Nearest, ... • The regression equation is • Margin = 38.1 - 0.00762 Number + 1.65 Nearest + 0.0198 Office S + 0.212 Enrollment + 0.413 Income - 0.225 Distance • Predictor Coef SE Coef T P • Constant 38.139 6.993 5.45 0.000 • Number -0.007618 0.001255 -6.07 0.000 • Nearest 1.6462 0.6328 2.60 0.011 • Office S 0.019766 0.003410 5.80 0.000 • Enrollme 0.2118 0.1334 1.59 0.116 • Income 0.4131 0.1396 2.96 0.004 • Distance -0.2253 0.1787 -1.26 0.211 • S = 5.51208 R-Sq = 52.5% R-Sq(adj) = 49.4% • Analysis of Variance • Source DF SS MS F P • Regression 6 3123.83 520.64 17.14 0.000 • Residual Error 9 3 2825.63 30.38 • Total 99 5949.46

Example 17.1… COMPUTE • In Excel: Tools > Data Analysis… > Regression

The Model… INTERPRET • Although we haven’t done any assessment of the model yet, at first pass: • it suggests that increases in the number of miles to closest competition, office space, student enrollment and household income will positively impact the operating margin. • Likewise, increases in the total number of lodging rooms within a short distance and the distance from downtown will negatively impact the operating margin…

Model Assessment… • We will assess the model in three ways: • the standard error of estimate, • the coefficient of determination, and • the F-test of the analysis of variance. • The standard error of estimate is used “downstream” in the subsequent calculations (though we will restrict our discussion to Excel based output vs. manual calculations).

Standard Error of Estimate… • In multiple regression, the standard error of estimate is defined as: • n is the sample size and k is the number of independent variables in the model. We compare this value to the mean value of y: • It seems the standard error of estimate is not particularly small. What can we conclude?

Coefficient of Determination… • Again, the coefficient of determination is defined as: • This means that 52.51% of the variation in operating margin is explained by the six independent variables, • but 47.49% remains unexplained.

Adjusted R2 value… • What’s this? • The “adjusted” R2 is: • the coefficient of • determination adjusted for degrees of freedom. • It takes into account the sample size n, and k, the number of independent variables, and is given by:

Testing the Validity of the Model • We pose the question: Is there at least one independent variable linearly related to the dependent variable? • To answer the question we test the hypothesis H0: b0 = b1 = b2 = … = bk H1: At least one bi is not equal to zero. • If at least one bi is not equal to zero, the model has some validity.

Testing the Validity of the Model… • In a multiple regression model (i.e. more than one independent variable), we utilize an analysis of variance technique to test the overall validity of the model. Here’s the idea: • H0: • H1: At least one is not equal to zero. • If the null hypothesis is true, none of the independent variables is linearly related to y, and so the model is invalid. • If at least one is not equal to 0, the model does have some validity.

Source of Variation • degrees of freedom • Sums of Squares • Mean Squares • F-Statistic • Regression • k • SSR • MSR = SSR/k • F=MSR/MSE • Error • n–k–1 • SSE • MSE = SSE/(n–k-1) • Total • n–1 Testing the Validity of the Model… • ANOVA table for regression analysis… A large value of F indicates that most of the variation in y is explained by the regression equation and that the model is valid. A small value of F indicates that most of the variation in y is unexplained.

Testing the Validity of the La Quinta Inns Regression Model [Variation in y] = SSR + SSE. Large F results from a large SSR. Then, much of the variation in y is explained by the regression model; the model is useful, and thus, the alternative hypothesis should be accepted.

Testing the Validity of the La Quinta Inns Regression Model Conclusion: There is overwhelming evidence to accept the alternative hypothesis. At least one of the bi is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid The p-value (Significance F) = 0.0000 Accept the alternative hypothesis.

SSE • R2 • F • Assessment of Model • 0 • 0 • 1 • Perfect • small • small • close to 1 • large • Good • large • large • close to 0 • small • Poor • 0 • 0 • Useless Table 17.2… Summary Once we’re satisfied that the model fits the data as well as possible, and that the required conditions are satisfied, we can interpret and test the individual coefficients and use the model to predict and estimate…

Interpreting the Coefficients • b0 = 38.14. This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept. • b1 = – 0.0076. In this model, for each additional room within 3 mile of the La Quinta inn, the operating margin decreases on average by .0076% (assuming the other variables are held constant).

Interpreting the Coefficients • b2 = 1.65. In this model, for each additional mile that the nearest competitor is to a La Quinta inn, the operating margin increases on average by 1.65% when the other variables are held constant. • b3 = 0.020.For each additional 1000 sq-ft of office space, the operating margin will increase on average by .02% when the other variables are held constant. • b4 = 0.21. For each additional thousand students the operating margin increases on average by .21% when the other variables are held constant.

Interpreting the Coefficients • b5 = 0.41. For additional $1000 increase in median household income, the operating margin increases on average by .41%, when the other variables remain constant. • b6 = -0.23. For each additional mile to the downtown center, the operating margin decreases on average by .23% when the other variables are held constant.

Test statistic Testing the Coefficients • The hypothesis for each bi is H0: bi= 0 H1: bi¹ 0 d.f. = n - k -1

Regression Analysis: Margin versus Number, Nearest, ... The regression equation is Margin = 38.1 - 0.00762 Number + 1.65 Nearest + 0.0198 Office S + 0.212 Enrollment + 0.413 Income - 0.225 Distance Predictor Coef SE Coef T P Constant 38.139 6.993 5.45 0.000 Accept H1 Number -0.007618 0.001255 -6.07 0.000 Accept H1 Nearest 1.6462 0.6328 2.60 0.011 Accept H1 Office S 0.019766 0.003410 5.80 0.000 Accept H1 Enrollme 0.2118 0.1334 1.590.116 Do not accept H1 Income 0.4131 0.1396 2.96 0.004 Accept H1 Distance -0.2253 0.1787 -1.260.211 Do not accept H1 S = 5.51208 R-Sq = 52.5% R-Sq(adj) = 49.4%

Errors reasonable normal? Yes

Errors Independent? Yes

Conclusion • We are 95% confident that in this model the number of hotel and motel rooms, distance to the nearest motel, amount of office space, and median household income are linearly related to the operating margin. Moreover in this model we found no evidence to infer that college enrollment and distance to downtown center are linearly related to operating margin.

Next Step • Re-do the analyses using only “significant” x-variables. • In this case, this means eliminating enrollment and distance to downtown center.

Multiple linear regression resultsDependent Variable: MarginIndependent Variable(s): Number, Nearest, Office Space, Income Parameter estimates:

Regression Analysis: Margin versus Number, Nearest, Office S, Income The regression equation is Margin = 41.3 - 0.00786 Number + 1.65 Nearest + 0.0196 Office S + 0.399 Income Predictor Coef SE Coef T P Constant 41.273 6.415 6.43 0.000 Number -0.007863 0.001260 -6.24 0.000 Nearest 1.6537 0.6353 2.60 0.011 Office S 0.019607 0.003439 5.70 0.000 Income 0.3994 0.1399 2.86 0.005 S = 5.56330 R-Sq = 50.6% R-Sq(adj) = 48.5% Analysis of Variance Source DF SS MS F P Regression 4 3009.18 752.30 24.31 0.000 Residual Error 95 2940.27 30.95 Total 99 5949.46

Predictions • Use this printout & data to have Minitab or StatCrunch form prediction intervals and confidence intervals as required.

Next • Check the validity for this new model (normality and independence only.)

Finishing up • You can now interpret coefficients, interpret R2, make predictions, etc using the new model.

Interpreting the Coefficients b0 = 41.273. This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept. b1 = – 0.007863. In this model, for each additional room within 3 mile of the La Quinta inn, the operating margin decreases on average by .0079% (assuming the other variables are held constant).

Interpreting the Coefficients b2 = 1.6537. In this model, for each additional mile that the nearest competitor is to a La Quinta inn, the operating margin increases on average by 1.65% when the other variables are held constant. b3 = 0.019607.For each additional 1000 sq-ft of office space, the operating margin will increase on average by .02% when the other variables are held constant. b4 = 0.3994. For additional $1000 increase in median household income, the operating margin increases on average by .40%, when the other variables remain constant.

Interpreting the Coefficients • Notice, these coefficients are almost the same as the ones done in the first part.

Coefficient of Determination • From the printout, R2 = 0.506 • 50.6% of the variation in operating margin is explained by the four independent variables. 49.4% remains unexplained. • Again, this is close to the other one using the six variables.

Using the Linear Regression Equation • The model can be used for making predictions by • Producing prediction interval estimate for the particular value of y, for a given values of xi. • Producing a confidence interval estimate for the expected value of y, for given values of xi. • The model can be used to learn about relationships between the independent variables xi, and the dependent variable y, by interpreting the coefficients bi

La Quinta Inns, Predictions • Predict the average operating margin of an inn at a site with the following characteristics: • 3815 rooms within 3 miles, • Closet competitor .9 miles away, • 476,000 sq-ft of office space, • $35,000 median household income, Xm17-01 MARGIN = 38.14 - 0.0076(3815)+1.65(.9) + 0.020(476) + 0.41(35) = 37.1%

La Quinta Inns, Predictions It is predicted, with 95% confidence that the operating margin will lie between 25.4% and 48.8%. It is estimated the average operating margin of all sites that fit this category falls within 33% and 41.2%. The average inn would not be profitable (Less than 50%). • Interval estimates by Minitab • Predicted Values for New Observations • New • Obs Fit SE Fit 95% CI 95% PI • 1 37.091 2.076 (32.970, 41.213) (25.395, 48.788) • Values of Predictors for New Observations • New • Obs Number Nearest Office S Income • 1 3815 0.900 476 35.0

Using the Regression Equation… • We add one row (our given values for the independent variables) to the bottom of our data set: • Then we use: • Tools > Data Analysis Plus > • Prediction Interval • to crunch the numbers…

Prediction Interval… INTERPRET • We predict that the operating margin will fall between 25.4 and 48.8. • If management defines a profitable inn as one with an operating margin greater than 50% and an unprofitable inn as one with an operating margin below 30%, they will pass on this site, since the entire prediction interval is below 50%.

Confidence Interval… INTERPRET • The expected operating margin of all sites that fit this category is estimated to be between 33.0 and 41.2. • We interpret this to mean that if we built inns on an infinite number of sites that fit the category described, the mean operating margin would fall between 33.0 and 41.2. In other words, the average inn wouldnot be profitable either…

Regression Diagnostics… • Calculate the residuals and check the following: • Is the error variable nonnormal? • Draw the histogram of the residuals • Is the error variance constant? • Plot the residuals versus the predicted values of y. • Are the errors independent (time-series data)? • Plot the residuals versus the time periods. • Are there observations that are inaccurate or do not belong to the target population? • Double-check the accuracy of outliers and influential observations.

Multiple Regression