1 / 62

MBA Statistics 51-651-00 COURSE #4

MBA Statistics 51-651-00 COURSE #4. Simple and multiple linear regression What should be the sales of ice cream?. Example:. Before beginning building a movie theater, one must estimate the daily number of people entering the building. How can we estimate it ?

bree
Download Presentation

MBA Statistics 51-651-00 COURSE #4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MBAStatistics 51-651-00COURSE #4 Simple and multiplelinear regression What should be the sales of ice cream?

  2. Example: • Before beginning building a movie theater, one must estimate the daily number of people entering the building. • How can we estimate it? • There are 2 millions individuals in the city.

  3. Possible solutions: • One could realize a local market study. However it is often imprecise, specially for new projects. • One could get data from similar projects in other cities.

  4. What do you think?Can we do better?

  5. Probably, taking into account the size of the city

  6. Case study: Ice Cream Sales • The file icecream.xls contains pairs of data representing ice cream sales and temperature recorded that day, for 30 days. • Is there a relation between temperature and sales? • Can temperature be used to predict ice cream sales? • If so what’s the prediction when the temperature is 25?

  7. Introduction • One of the principle objectives of statistics is to explain the variability that we observe in data. • Linear regression (or linear models) is a statistical tool MUCH USED to study the presence of a linear relation between • a dependent variable Y (quantitative and continuous) • and one or more independent variables X1, X2, …, Xp(qualitative and/or quantitative), called independent or explanatory variables.

  8. For example, a manager could be interested in seeing if he could explain a good part of the variability that he observes in sales in his differents branches (dependant variable Y) in the last 12 months, by the area, number of employees, number of payed overtime hours, quality of customer service, number of promotions, etc. ( independent or explanatory variables).

  9. A regression model can be used to answer one of the following three objectives: • Describe data coming from non experimental studies i.e. we observe reality as it is. • Examine the hypothesis (data coming from controled experimental studies). • Predict (if we like to take risks!!).

  10. Example: • We are interested in knowing what are the important factors that influence or determine the value of a property and we want to build a model that would help us evaluate this value using certain factors. • To do this, we have obtained the total value for a sample of 79 properties in a given region. The following variables have also been collected for each property:

  11. Brief glimpse of the data file:house.xls # of square feet total land first outdoor heating OBS value value # of acres floor condition type 1 199657 63247 1.63 1726 Good NatGas 2 78482 38091 0.495 1184 Good NatGas 3 119962 37665 0.375 1014 Good Electric 4 116492 54062 0.981 1260 Average Electric 5 131263 61546 1.14 1314 Average NatGas ... 78 253480 57948 0.862 1720 Good Electric 79 257037 57489 0.95 2004 Excellnt Electric # of # of # of completed # of non completed # of OBS rooms bedroom bathrooms bathrooms fire-places GARAGE 1 8 4 2 1 2 Garage 2 6 2 1 0 0 NoGarage 3 7 3 2 0 1 Garage 4 6 3 2 0 1 Garage 5 8 4 2 1 2 NoGarage ... 78 10 5 5 1 1 Garage 79 9 4 2 2 2 Garage

  12. Is there a link between the total value and the different factors?

  13. The Pearson correlation coefficient r is used to measure the intensity of the linear relation between two quantitative variables. • The correlation coefficient r will take its values between -1 and 1. • If a perfect linearrelationexistbetweenX and Y, then r= 1 (r =1 if X and Y vary in the same direction and r = -1 if X varies in the opposite direction of Y). • If r = 0, there is no linear link between X and Y. • The more the r value furthers from 0 to get closer to 1, the more the linear link intensity between X and Y becomes larger.

  14. Y | 6.5 | * r = 0.035Y | r = 1 | | | 31 | * 6.0 | * * 29 | * | 27 | * | 25 | * 5.5 | * * 23 | * | 21 | * | 19 | * 5.0 | * 17 | * | 15 | * | 13 | * 4.5 | * * * 11 | * | | | ------------------------------------ 4.0 | * * 4 5 6 7 8 9 10 11 12 13 14 |----------------------------------- 4 5 6 7 8 9 10 11 12 13 14 X X Y | r = -1 | -8.0 | * -10.5 | * -13.0 | * -15.5 | * -18.0 | * -20.5 | * -23.0 | * -25.5 | * -28.0 | * -30.5 | * -33.0 | * | ---------------------------------- 4 5 6 7 8 9 10 11 12 13 14 X

  15. Descriptive statistics Variable N Mean Median Sta.Deviation Minimum Maximum Total 79 187253 156761 84401 74365 453744 Land 79 65899 59861 22987 35353 131224 Acre 79 1.579 1.040 1.324 0.290 5.880 Sq.Feet 79 1678 1628 635 672 3501 Rooms 79 8.519 8.000 2.401 5 18 Bedrooms 79 3.987 4.000 1.266 2 8 C.Bathro 79 2.241 2.000 1.283 1 7 Bathro 79 0.7215 1.000 0.715 0 3 Fire-pl. 79 1.975 2.000 1.368 0 7 Pearson Correlation Coefficients Total Land Acre Sq.Feet Rooms Bedroom C.Bathro Bathro Land 0.815 Acre 0.608 0.918 Sq.Feet 0.767 0.516 0.301 Rooms 0.626 0.518 0.373 0.563 Bedrooms 0.582 0.497 0.382 0.431 0.791 C.Bathro 0.626 0.506 0.376 0.457 0.479 0.586 Bathro 0.436 0.236 0.074 0.354 0.489 0.166 0.172 Fire-pl. 0.548 0.497 0.391 0.365 0.394 0.400 0.486 0.386

  16. BECAREFULL!! it is important to interpret the correlation coefficient with the graph. r = 0.816 in all cases below 12.5 | 10 | | | * | | * * * | * | * 10.0 | * 8 | * * | | * Y1 | * Y2 | | * * | 7.5 | * * 6 | * | * | | | | * | * 5.0 | * 4 | | * | | | * | | 2.5 | 2 | ----------------------------------- ------------------------------------ 4 5 6 7 8 9 10 11 12 13 14 4 5 6 7 8 9 10 11 12 13 14 X X 15.0 | Y4 | | 12.5 | * | | | | 12.5 | * | | | Y3 | 10.0 | | | 10.0 | | * | | * | * | * | * 7.5 | * 7.5 | * * | * | * * | * | * * | * | * * | * 5.0 | 5.0 | ----------------------------------- ----------------------------- 4 5 6 7 8 9 10 11 12 13 14 8 19 X X

  17. Simple linear regression • To describe a linear relation between two quantitative variables or to be able to predict Y for a given value of X, we use a regression line: Y = 0 + 1X +  • Since any statistical model is only an approximation (we hope the best possible !!) and because the linear link is never perfect , in the model, there is always an error, noted . • If there was a perfect linear relation between Y and X, the error term would always be equal to 0, and all the variability of Y would be explained by the independent variable X.

  18. So, for a given value of X, we would like to estimate Y. • Thus, with the help of the data sample we will estimate the regression model parameters 0 and 1 in order to minimize the residuals (errors) sum of squares. • The squared correlation coefficient is called the coefficient of determination and the percentage of the variability of Y explained by X: R2 = 1 - (n-2)/(n-1){Se /Sy}2, where Se is the standard deviation of the errors and Sy is the standard deviation of Y.

  19. We can also use the adjusted coefficient of determination to indicate the percentage of the variability of Y explained by X: R2ajusted = 1 - {Se/Sy}2 .

  20. Simple linear regression example: MODEL 1. Regression Analysis The regression equation is Total = 16209 + 102 Sq.Feet Predictor Coef StDev T P Constant 16209 17447 0.93 0.356 Sq.Feet 101.939 9.734 10.47 0.000 S = 54556 R-Sq = 58.8% R-Sq(adj) = 58.2% Analysis of Variance Source DF SS MS F P Regression 1 3.26460E+11 3.26460E+11 109.68 0.000 Residual Error 77 2.29181E+11 2976374177 Total 78 5.55641E+11

  21. MODEL 2. The regression equation is : Total = - 347 + 22021 Rooms Predictor Coef StDev T P Constant -347 27621 -0.01 0.990 Rooms 22021 3122 7.05 0.000 S = 66210 R-Sq = 39.3% R-Sq(adj) = 38.5% Analysis of Variance Source DF SS MS F P Regression 1 2.18090E+11 2.18090E+11 49.75 0.000 Residual Error 77 3.37551E+11 4383775699 Total 78 5.55641E+11 __________________________________________________________________ MODEL 3. The regression equation is : Total = 32428 + 38829 Bedrooms Predictor Coef StDev T P Constant 32428 25826 1.26 0.213 Bedrooms 38829 6177 6.29 0.000 S = 69056 R-Sq = 33.9% R-Sq(adj) = 33.1% Analysis of Variance Source DF SS MS F P Regression 1 1.88445E+11 1.88445E+11 39.52 0.000 Residual Error 77 3.67196E+11 4768775127 Total 78 5.55641E+11

  22. Model 1: • total value = 16209 + 102*( # ofsquared feet ). • R2 = 58.8%. Thus 58.8% of the variability of the total value is explained by the # ofsquared feet . • Model 2: • total value = -347 + 22021*(#of rooms ). • R2 = 39.3%. Thus 39.3% of the variability of the total value is explained by the #of rooms . • Model 3: • total value = 32428 + 38829 *(#of bedrooms ). • R2 = 33.9%. Thus 33.9% of the variability of the total value is explained by the #of bedrooms .

  23. Which one of the 3 previous models would you choose and why? Model 1 because it has the largest value of R2.

  24. 1- confidence interval for the mean of the values ofYfor a specific value ofX: • For model 1 and a value of X=1500 sq.ft we obtain the following point estimation : • est. total value = 16 209 + 102*1500 = 169 117$ • 95% confidence interval for the mean of the total value for properties of 1500 sq.ft : [156 418, 181 817] as calculated byCI-regression.xls

  25. 1- confidence interval for a new value ofY(prediction) being given a specific value ofX: • For model 1 and a value of X=1500 sq.ft we obtain the following point estimation : • est.total value = 16 209 + 101.939*1500 = 169 117$ • 95% confidence interval for a predicted total value when the area of the first floor is 1500 sq.ft : [59 742, 278 492] • The confidence interval for a predicted value is always larger than for the mean of the value of Y for a specific X .

  26. Inference on regression model parameters: • If there is no linear link between Y and X then 1 = 0. So, we want to examine the following hypothesis : • H0 : 1 = 0 vs H1 : 1  0 • We will reject H0 when the ‘ p-value ’ is too small • This test will be valid if • the relation between X and Y is linear • the data are independent • the variance of Y is the same for every value of X. • Y has a normal distribution for every value of X or the sample size n is large.

  27. Multiple linear regression • It is more likely possible that the variability of the dependent variable Y will be explained not only by one independent variableX, but rather by a linear combination of several independent variables X1, X2, …, Xp. • In this case, the multiple regression model is given by: Y = 0 + 1X1 + 2X2 + … + pXp +  • Also, using the sample data, we will estimate the regression model parameters 0, 1, …, p in order to minimize the residuals (errors) sum of squares.

  28. The multiple correlation coefficient R2, also called the coefficient of determination, represents the percentage of the variability of Y explained by the independent variables X1, X2, …, Xp. • In the model, when we add one or more independent variables, R2 increases. • The question is to know if R2 increases to a significant degree. • Note that we cannot have more independent variables in the model that there are observations in the sample. (general rule: n  5p).

  29. Example: MODEL 1. The regression equation is Total = - 89131 + 3.05 Land - 20730 Acre + 43.3 Sq.Feet - 4352 Rooms + 10049 Bedroom + 7606 C.Bathro + 18725 Bathro + 882 Fire-pl. Predictor Coef StDev T P Constant -89131 18302 -4.87 0.000 Land 3.0518 0.5260 5.80 0.000 Acre -20730 7907 -2.62 0.011 Sq.Feet 43.336 7.670 5.65 0.000 Rooms -4352 3036 -1.43 0.156 Bedroom 10049 5307 1.89 0.062 CBathro 7606 3610 2.11 0.039 Bathro 18725 6585 2.84 0.006 Fire-pl. 882 3184 0.28 0.783 S = 29704 R-Sq = 88.9% R-Sq(adj) = 87.6% Analysis of Variance Source DF SS MS F P Regression 8 4.93877E+11 61734659810 69.97 0.000 Residual Error 70 61763515565 882335937 Total 78 5.55641E+11

  30. MODEL 2 Regression Analysis The regression equation is Total = - 97512 + 3.11 Land - 21880 Acre + 40.2 Sq.Feet + 4411 Bedroom + 8466 C.bathro + 14328 Bathro Predictor Coef StDev T P Constant -97512 17466 -5.58 0.000 Land 3.1103 0.5236 5.94 0.000 Acre -21880 7884 -2.78 0.007 Sq.Feet 40.195 7.384 5.44 0.000 Bedroom 4411 3469 1.27 0.208 C.bathro 8466 3488 2.43 0.018 Bathro 14328 5266 2.72 0.008 S = 29763 R-Sq = 88.5% R-Sq(adj) = 87.6% Analysis of Variance Source DF SS MS F P Regression 6 4.91859E+11 81976430646 92.54 0.000 Residual Error 72 63782210167 885864030 Total 78 5.55641E+11

  31. MODEL 3 Regression Analysis The regression equation is Total = - 90408 + 3.20 Land - 22534 Acre + 41.1 Sq.Feet + 10234 C.bathro + 14183 Bathro Predictor Coef StDev T P Constant -90408 16618 -5.44 0.000 Land 3.2045 0.5205 6.16 0.000 Acre -22534 7901 -2.85 0.006 Sq.Feet 41.060 7.383 5.56 0.000 C.bathro 10234 3213 3.19 0.002 Bathro 14183 5287 2.68 0.009 S = 29889 R-Sq = 88,3% R-Sq(adj) = 87,5% Analysis of Variance Source DF SS MS F P Regression 5 4.90426E+11 98085283380 109.80 0.000 Residual Error 73 65214377146 893347632 Total 78 5.55641E+11

  32. Model without the area of the land ( # of acres ) because of the multicolinearity with the land value. MODEL 4 The regression equation is Total = - 55533 + 1.82 Land + 49.8 Sq.Feet + 11696 C.bathro + 18430 Bathro Predictor Coef StDev T P Constant -55533 11783 -4.71 0.000 Land 1.8159 0.1929 9.42 0.000 Sq.Feet 49.833 7.028 7.09 0.000 C.bathro 11696 3321 3.52 0.001 Bathro 18430 5312 3.47 0.001 S = 31297 R-Sq = 87.0% R-Sq(adj) = 86.3% Analysis of Variance Source DF SS MS F P Regression 4 4.83160E+11 1.20790E+11 123.32 0.000 Residual Error 74 72481137708 979474834 Total 78 5.55641E+11

  33. Which one of the 4 previous models would you choose and why? • Probably model 4because all the independent variables are significant at the 5% level (i.e. for each  in the model, p-value < 5%) and although R2 is smaller, it is just marginally smaller. Moreover, all the model coefficients make « sense »! • In model 1 , the variables ‘ # of rooms  ’ and ‘ # of fire-place’ are not statistically significant at the 5% level (p-value > 5%). The variable ‘ # of bedrooms ’ is at the limit with a p-value = 0.0624.

  34. Which one of the 4 previous models would you choose and why?(continued) • In model 2 the variable ‘ # of bedroom ’ is not statistically significant at the 5% level. • In model 3 (and the previous models), the variable ‘ # of acres ’ coefficient is negative which is contrary to « common sense » and to what we observed in the scatter plot and the positive Pearson correlation coefficient (r = 0.608). • In models 1 to 3,the negative coefficient for the variable ‘ # of acres ’ is due to the fact that there is a strong linear relation between the value of the land and the area of the land (r = 0.918): multicolinearity problem.

  35. Multicolinearity • If two or more explanatory variables are strongly correlated (> 0.85 in absolute value), one says that there is multicolinearity. It has an influence on the estimation of parameters in the model. • If two explanatory variables are highly correlated, then can get rid of one of these variables. Because of the strong correlation, the contribution of the other variable is not significant. • The correlation between several pairs of variables can be calculated in Excel using correlation in the Data Analysis toolbox.

  36. How can we choose a particular linear regression model among all the possible ones? There are several techniques: • Step by step selection by adding one variable at a time, starting with the most significant one (stepwise, forward). • Selection starting from the model in which all the variables are included and removing one variable at a time starting with the least significant (backward). • Construct all possible models and choose the best subset of variables according to certain specific criteria (ex: adjustedR2, Cp de Mallow.)

  37. Example of selection among the best subsets: Best Subsets Regression : Response is Total B C S e b B F q R d a a i L A f o r t t r a c e o o h h e Adj. n r e m o r r p Vars R-Sq R-Sq C-p s d e t s m o o l 1 66.4 65.9 136.8 49262 X 1 58.8 58.2 184.7 54556 X 1 39.3 38.5 307.6 66210 X 2 82.7 82.2 35.9 35564 X X 2 78.8 78.3 60.3 39343 X X 2 74.4 73.7 88.1 43244 X X 3 85.6 85.0 19.5 32637 X X X 3 84.8 84.2 24.5 33521 X X X 3 84.8 84.2 24.9 33591 X X X 4 87.1 86.4 12.2 31115 X X X X 4 87.0 86.3 13.1 31297 X X X X 4 86.6 85.9 15.2 31682 X X X X 5 88.3 87.5 6.9 29889 X X X X X 5 87.6 86.7 11.2 30744 X X X X X 5 87.4 86.5 12.4 30979 X X X X X 6 88.5 87.6 7.3 29763 X X X X X X 6 88.3 87.3 8.6 30030 X X X X X X 6 88.3 87.3 8.9 30096 X X X X X X 7 88.9 87.8 7.1 29510 X X X X X X X 7 88.6 87.4 9.1 29924 X X X X X X X 7 88.3 87.2 10.6 30240 X X X X X X X 8 88.9 87.6 9.0 29704 X X X X X X X X

  38. Selection of the model without the variable# of acres Best Subsets Regression : Response is Total B C S e b B F q R d a a i L f o r t t r a e o o h h e Adj. n e m o r r p Vars R-Sq R-Sq C-p s d t s m o o l 1 66.4 65.9 120.6 49262 X 1 58.8 58.2 164.9 54556 X 1 39.3 38.5 278.3 66210 X 2 82.7 82.2 27.6 35564 X X 2 72.7 71.9 86.0 44704 X X 2 72.5 71.8 86.8 44813 X X 3 84.8 84.2 17.2 33521 X X X 3 84.8 84.2 17.6 33591 X X X 3 84.0 83.3 22.3 34467 X X X 4 87.0 86.3 6.9 31297 X X X X 4 86.1 85.3 12.1 32352 X X X X 4 85.3 84.5 16.5 33226 X X X X 5 87.3 86.4 6.9 31100 X X X X X 5 87.0 86.1 8.5 31439 X X X X X 5 87.0 86.1 8.9 31509 X X X X X 6 87.8 86.8 6.1 30707 X X X X X X 6 87.3 86.3 8.7 31264 X X X X X X 6 87.0 85.9 10.5 31656 X X X X X X 7 87.8 86.6 8.0 30908 X X X X X X X

  39. The selection of the best model is done according to the combination: • The greatest value of R2 adjusted for the number of variables in the model. • The smallest value of Cp . • For the models with R2 adjusted and comparable Cp, we will choose the model which has the most « common sense » according to the experts in the field. • For the models with R2 adjusted and comparable Cp, the model with the independent variables that are the easiest and least expensive to measure. • The model validity.

  40. 1- confidence interval forYmean and a new value of Y(prediction) being given a specific value combinationforX1, X2, …, Xp. • For model 4 and property with a land= 65 000$, sq.ft = 1500, 2 completed bathrooms and 1 not-completed, we obtain the following point estimation : • est.total value = -55 533 + 1.816*65 000 + 49.833*1 500 + 11 696*2 + 18 430*1 = 179 074$ • 95% confidence interval for the mean of the total value: [170 842, 187 306] • 95% confidence interval for a total predicted value : [116 173, 241 974]

  41. Notes: • For a 1500 sq.ft property, the multiple regression model gives a smaller 95% confidence intervals than the simple regression model. • Therefore the addition of several other variables in the model helped to better explain the total value variability and to improve our estimations. • If two or more independent variables are correlated we will say that there is multicolinearity. This can influence the value of the parameters in the model . • Also, if two independent variables are strongly correlated then only one of the two variables would be included in the model, the other one bringing very little additional information. • Certain conditions are required for the validity of the model and the corresponding inference (similar to the simple linear regression ).

  42. Dummy variables • How can one take into account qualitative information in a regression? • Application: Test on two or more means

  43. Trick • If a qualitative variable takes two values, one defined one dummy variable taking values 0 or 1. • Examples: Sex: 1 if male, 0 otherwise Garage: 1 if garage, 0 if not.

  44. Trick (continued) • More generally, if a qualitative variable can take m values, one defines (m-1) dummy variables all taking values 0 or 1. • Example: Sex and job category (executive, white-collar, blue-collar) • X1 = 1 if male, 0 otherwise. • X2 = 1 si exe, 0 otherwise. • X3 = 1 si w-c, 0 otherwise.

  45. Example • One wants to explain the salary of an employee (Y) with the following variables: sex, job category and experience. • X1 = 1 if male, 0 otherwise. • X2 = 1 if exe, 0 otherwise. • X3 = 1 if w-c, 0 otherwise. • X4 = years of experience.

  46. Example (continued) • Regression model: Y = 0 + 1X1 + 2X2 + 3X3 + 4X4 +  Question: Interpret 0, 1, 2, 3 , 4 . How do know if women have a smaller salary?

  47. “P-value” for one-tailed tests in Excel. • The evaluation of the p-value of a “one-tailed” test hypothesis H1 is not given in general, only the p-value of a “two-tailed” test . For example, in regression, Excel calculates the “p-value” P corresponding to • H0 : bi = 0 vs H1 : bi ≠ 0 . • How can we calculate the p-value correponding to one-tailed hypotheses H1?

  48. Rules : • P: p-value for the two-tailed test. • If H1 is of the form bi > 0 and bi >0, then the “p-value” of the right-tailed is P/2. Otherwise it is 1- P/2. • If H1 is of the form bi < 0 and bi <0, then the “p-value” of the left-tailed is P/2. Otherwise it is 1- P/2. •  In other words, the one-tailed p-value is half of the two-tailed p-value when the estimated coefficient has the same sign as the coefficient in H1. Otherwise, it is 1- “p-value”/2.

More Related