# MBA Statistics 51-651-00 COURSE #4 - PowerPoint PPT Presentation

1 / 62

MBA Statistics 51-651-00 COURSE #4. Simple and multiple linear regression What should be the sales of ice cream?. Example:. Before beginning building a movie theater, one must estimate the daily number of people entering the building. How can we estimate it ?

## Related searches for MBA Statistics 51-651-00 COURSE #4

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

MBA Statistics 51-651-00 COURSE #4

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## MBAStatistics 51-651-00COURSE #4

Simple and multiplelinear regression

What should be the sales of ice cream?

### Example:

• Before beginning building a movie theater, one must estimate the daily number of people entering the building.

• How can we estimate it?

• There are 2 millions individuals in the city.

### Possible solutions:

• One could realize a local market study. However it is often imprecise, specially for new projects.

• One could get data from similar projects in other cities.

### Case study: Ice Cream Sales

• The file icecream.xls contains pairs of data representing ice cream sales and temperature recorded that day, for 30 days.

• Is there a relation between temperature and sales?

• Can temperature be used to predict ice cream sales?

• If so what’s the prediction when the temperature is 25?

### Introduction

• One of the principle objectives of statistics is to explain the variability that we observe in data.

• Linear regression (or linear models) is a statistical tool MUCH USED to study the presence of a linear relation between

• a dependent variable Y (quantitative and continuous)

• and one or more independent variables X1, X2, …, Xp(qualitative and/or quantitative), called independent or explanatory variables.

• For example, a manager could be interested in seeing if he could explain a good part of the variability that he observes in sales in his differents branches (dependant variable Y) in the last 12 months, by the area, number of employees, number of payed overtime hours, quality of customer service, number of promotions, etc. ( independent or explanatory variables).

### A regression model can be used to answer one of the following three objectives:

• Describe data coming from non experimental studies i.e. we observe reality as it is.

• Examine the hypothesis (data coming from controled experimental studies).

• Predict (if we like to take risks!!).

### Example:

• We are interested in knowing what are the important factors that influence or determine the value of a property and we want to build a model that would help us evaluate this value using certain factors.

• To do this, we have obtained the total value for a sample of 79 properties in a given region. The following variables have also been collected for each property:

### Brief glimpse of the data file:house.xls

# of

square

feet

total land first outdoor heating

OBS value value # of acres floor condition type

1 199657 63247 1.63 1726 Good NatGas

2 78482 38091 0.495 1184 Good NatGas

3 119962 37665 0.375 1014 Good Electric

4 116492 54062 0.981 1260 Average Electric

5 131263 61546 1.14 1314 Average NatGas

...

78 253480 57948 0.862 1720 Good Electric

79 257037 57489 0.95 2004 Excellnt Electric

# of # of # of completed # of non completed # of

OBS rooms bedroom bathrooms bathrooms fire-places GARAGE

1 8 4 2 1 2 Garage

2 6 2 1 0 0 NoGarage

3 7 3 2 0 1 Garage

4 6 3 2 0 1 Garage

5 8 4 2 1 2 NoGarage

...

78 10 5 5 1 1 Garage

79 9 4 2 2 2 Garage

### The Pearson correlation coefficient r is used to measure the intensity of the linear relation between two quantitative variables.

• The correlation coefficient r will take its values between -1 and 1.

• If a perfect linearrelationexistbetweenX and Y, then r= 1 (r =1 if X and Y vary in the same direction and r = -1 if X varies in the opposite direction of Y).

• If r = 0, there is no linear link between X and Y.

• The more the r value furthers from 0 to get closer to 1, the more the linear link intensity between X and Y becomes larger.

Y |

6.5 | * r = 0.035Y | r = 1

| |

| 31 | *

6.0 | * * 29 | *

| 27 | *

| 25 | *

5.5 | * * 23 | *

| 21 | *

| 19 | *

5.0 | * 17 | *

| 15 | *

| 13 | *

4.5 | * * * 11 | *

| |

| ------------------------------------

4.0 | * * 4 5 6 7 8 9 10 11 12 13 14

|-----------------------------------

4 5 6 7 8 9 10 11 12 13 14 X

X

Y | r = -1

|

-8.0 | *

-10.5 | *

-13.0 | *

-15.5 | *

-18.0 | *

-20.5 | *

-23.0 | *

-25.5 | *

-28.0 | *

-30.5 | *

-33.0 | *

|

----------------------------------

4 5 6 7 8 9 10 11 12 13 14

X

Descriptive statistics

Variable N Mean Median Sta.Deviation Minimum Maximum

Total 79 187253 156761 84401 74365 453744

Land 79 65899 59861 22987 35353 131224

Acre 79 1.579 1.040 1.324 0.290 5.880

Sq.Feet 79 1678 1628 635 672 3501

Rooms 79 8.519 8.000 2.401 5 18

Bedrooms 79 3.987 4.000 1.266 2 8

C.Bathro 79 2.241 2.000 1.283 1 7

Bathro 79 0.7215 1.000 0.715 0 3

Fire-pl. 79 1.975 2.000 1.368 0 7

Pearson Correlation Coefficients

Total Land Acre Sq.Feet Rooms Bedroom C.Bathro Bathro

Land 0.815

Acre 0.608 0.918

Sq.Feet 0.767 0.516 0.301

Rooms 0.626 0.518 0.373 0.563

Bedrooms 0.582 0.497 0.382 0.431 0.791

C.Bathro 0.626 0.506 0.376 0.457 0.479 0.586

Bathro 0.436 0.236 0.074 0.354 0.489 0.166 0.172

Fire-pl. 0.548 0.497 0.391 0.365 0.394 0.400 0.486 0.386

### BECAREFULL!! it is important to interpret the correlation coefficient with the graph.

r = 0.816 in all cases below

12.5 | 10 |

| | *

| | * * *

| * | *

10.0 | * 8 | * *

| | *

Y1 | * Y2 |

| * * |

7.5 | * * 6 | *

| * |

| |

| * | *

5.0 | * 4 |

| * |

| | *

| |

2.5 | 2 |

----------------------------------- ------------------------------------

4 5 6 7 8 9 10 11 12 13 14 4 5 6 7 8 9 10 11 12 13 14

X X

15.0 | Y4 |

| 12.5 | *

| |

| |

12.5 | * |

| |

Y3 | 10.0 |

| |

10.0 | | *

| | *

| * | *

| * 7.5 | *

7.5 | * * | *

| * * | *

| * * | *

| * * | *

5.0 | 5.0 |

----------------------------------- -----------------------------

4 5 6 7 8 9 10 11 12 13 14 8 19

X X

### Simple linear regression

• To describe a linear relation between two quantitative variables or to be able to predict Y for a given value of X, we use a regression line:

Y = 0 + 1X + 

• Since any statistical model is only an approximation (we hope the best possible !!) and because the linear link is never perfect , in the model, there is always an error, noted .

• If there was a perfect linear relation between Y and X, the error term would always be equal to 0, and all the variability of Y would be explained by the independent variable X.

• So, for a given value of X, we would like to estimate Y.

• Thus, with the help of the data sample we will estimate the regression model parameters 0 and 1 in order to minimize the residuals (errors) sum of squares.

• The squared correlation coefficient is called the coefficient of determination and the percentage of the variability of Y explained by X:

R2 = 1 - (n-2)/(n-1){Se /Sy}2,

where Se is the standard deviation of the errors and Sy is the standard deviation of Y.

• We can also use the adjusted coefficient of determination to indicate the percentage of the variability of Y explained by X:

R2ajusted = 1 - {Se/Sy}2 .

### Simple linear regression example:

MODEL 1.

Regression Analysis

The regression equation is

Total = 16209 + 102 Sq.Feet

Predictor Coef StDev T P

Constant 16209 17447 0.93 0.356

Sq.Feet 101.939 9.734 10.47 0.000

S = 54556 R-Sq = 58.8% R-Sq(adj) = 58.2%

Analysis of Variance

Source DF SS MS F P

Regression 1 3.26460E+11 3.26460E+11 109.68 0.000

Residual Error 77 2.29181E+11 2976374177

Total 78 5.55641E+11

MODEL 2.

The regression equation is : Total = - 347 + 22021 Rooms

Predictor Coef StDev T P

Constant -347 27621 -0.01 0.990

Rooms 22021 3122 7.05 0.000

S = 66210 R-Sq = 39.3% R-Sq(adj) = 38.5%

Analysis of Variance

Source DF SS MS F P

Regression 1 2.18090E+11 2.18090E+11 49.75 0.000

Residual Error 77 3.37551E+11 4383775699

Total 78 5.55641E+11

__________________________________________________________________

MODEL 3.

The regression equation is : Total = 32428 + 38829 Bedrooms

Predictor Coef StDev T P

Constant 32428 25826 1.26 0.213

Bedrooms 38829 6177 6.29 0.000

S = 69056 R-Sq = 33.9% R-Sq(adj) = 33.1%

Analysis of Variance

Source DF SS MS F P

Regression 1 1.88445E+11 1.88445E+11 39.52 0.000

Residual Error 77 3.67196E+11 4768775127

Total 78 5.55641E+11

• Model 1:

• total value = 16209 + 102*( # ofsquared feet ).

• R2 = 58.8%. Thus 58.8% of the variability of the total value is explained by the # ofsquared feet .

• Model 2:

• total value = -347 + 22021*(#of rooms ).

• R2 = 39.3%. Thus 39.3% of the variability of the total value is explained by the #of rooms .

• Model 3:

• total value = 32428 + 38829 *(#of bedrooms ).

• R2 = 33.9%. Thus 33.9% of the variability of the total value is explained by the #of bedrooms .

### Which one of the 3 previous models would you choose and why?

Model 1 because it has the largest value of R2.

### 1- confidence interval for the mean of the values ofYfor a specific value ofX:

• For model 1 and a value of X=1500 sq.ft we obtain the following point estimation :

• est. total value = 16 209 + 102*1500 = 169 117\$

• 95% confidence interval for the mean of the total value for properties of 1500 sq.ft :

[156 418, 181 817]

as calculated byCI-regression.xls

### 1- confidence interval for a new value ofY(prediction) being given a specific value ofX:

• For model 1 and a value of X=1500 sq.ft we obtain the following point estimation :

• est.total value = 16 209 + 101.939*1500 = 169 117\$

• 95% confidence interval for a predicted total value when the area of the first floor is 1500 sq.ft :

[59 742, 278 492]

• The confidence interval for a predicted value is always larger than for the mean of the value of Y for a specific X .

### Inference on regression model parameters:

• If there is no linear link between Y and X then 1 = 0. So, we want to examine the following hypothesis :

• H0 : 1 = 0 vs H1 : 1  0

• We will reject H0 when the ‘ p-value ’ is too small

• This test will be valid if

• the relation between X and Y is linear

• the data are independent

• the variance of Y is the same for every value of X.

• Y has a normal distribution for every value of X or the sample size n is large.

### Multiple linear regression

• It is more likely possible that the variability of the dependent variable Y will be explained not only by one independent variableX, but rather by a linear combination of several independent variables X1, X2, …, Xp.

• In this case, the multiple regression model is given by:

Y = 0 + 1X1 + 2X2 + … + pXp + 

• Also, using the sample data, we will estimate the regression model parameters 0, 1, …, p in order to minimize the residuals (errors) sum of squares.

• The multiple correlation coefficient R2, also called the coefficient of determination, represents the percentage of the variability of Y explained by the independent variables X1, X2, …, Xp.

• In the model, when we add one or more independent variables, R2 increases.

• The question is to know if R2 increases to a significant degree.

• Note that we cannot have more independent variables in the model that there are observations in the sample. (general rule: n  5p).

### Example:

MODEL 1.

The regression equation is

Total = - 89131 + 3.05 Land - 20730 Acre + 43.3 Sq.Feet - 4352 Rooms

+ 10049 Bedroom + 7606 C.Bathro + 18725 Bathro + 882 Fire-pl.

Predictor Coef StDev T P

Constant -89131 18302 -4.87 0.000

Land 3.0518 0.5260 5.80 0.000

Acre -20730 7907 -2.62 0.011

Sq.Feet 43.336 7.670 5.65 0.000

Rooms -4352 3036 -1.43 0.156

Bedroom 10049 5307 1.89 0.062

CBathro 7606 3610 2.11 0.039

Bathro 18725 6585 2.84 0.006

Fire-pl. 882 3184 0.28 0.783

S = 29704 R-Sq = 88.9% R-Sq(adj) = 87.6%

Analysis of Variance

Source DF SS MS F P

Regression 8 4.93877E+11 61734659810 69.97 0.000

Residual Error 70 61763515565 882335937

Total 78 5.55641E+11

MODEL 2

Regression Analysis

The regression equation is

Total = - 97512 + 3.11 Land - 21880 Acre + 40.2 Sq.Feet

+ 4411 Bedroom + 8466 C.bathro + 14328 Bathro

Predictor Coef StDev T P

Constant -97512 17466 -5.58 0.000

Land 3.1103 0.5236 5.94 0.000

Acre -21880 7884 -2.78 0.007

Sq.Feet 40.195 7.384 5.44 0.000

Bedroom 4411 3469 1.27 0.208

C.bathro 8466 3488 2.43 0.018

Bathro 14328 5266 2.72 0.008

S = 29763 R-Sq = 88.5% R-Sq(adj) = 87.6%

Analysis of Variance

Source DF SS MS F P

Regression 6 4.91859E+11 81976430646 92.54 0.000

Residual Error 72 63782210167 885864030

Total 78 5.55641E+11

MODEL 3

Regression Analysis

The regression equation is

Total = - 90408 + 3.20 Land - 22534 Acre + 41.1 Sq.Feet

+ 10234 C.bathro + 14183 Bathro

Predictor Coef StDev T P

Constant -90408 16618 -5.44 0.000

Land 3.2045 0.5205 6.16 0.000

Acre -22534 7901 -2.85 0.006

Sq.Feet 41.060 7.383 5.56 0.000

C.bathro 10234 3213 3.19 0.002

Bathro 14183 5287 2.68 0.009

S = 29889 R-Sq = 88,3% R-Sq(adj) = 87,5%

Analysis of Variance

Source DF SS MS F P

Regression 5 4.90426E+11 98085283380 109.80 0.000

Residual Error 73 65214377146 893347632

Total 78 5.55641E+11

### Model without the area of the land ( # of acres ) because of the multicolinearity with the land value.

MODEL 4

The regression equation is

Total = - 55533 + 1.82 Land + 49.8 Sq.Feet + 11696 C.bathro

+ 18430 Bathro

Predictor Coef StDev T P

Constant -55533 11783 -4.71 0.000

Land 1.8159 0.1929 9.42 0.000

Sq.Feet 49.833 7.028 7.09 0.000

C.bathro 11696 3321 3.52 0.001

Bathro 18430 5312 3.47 0.001

S = 31297 R-Sq = 87.0% R-Sq(adj) = 86.3%

Analysis of Variance

Source DF SS MS F P

Regression 4 4.83160E+11 1.20790E+11 123.32 0.000

Residual Error 74 72481137708 979474834

Total 78 5.55641E+11

### Which one of the 4 previous models would you choose and why?

• Probably model 4because all the independent variables are significant at the 5% level (i.e. for each  in the model, p-value < 5%) and although R2 is smaller, it is just marginally smaller. Moreover, all the model coefficients make « sense »!

• In model 1 , the variables ‘ # of rooms  ’ and ‘ # of fire-place’ are not statistically significant at the 5% level (p-value > 5%). The variable ‘ # of bedrooms ’ is at the limit with a p-value = 0.0624.

### Which one of the 4 previous models would you choose and why?(continued)

• In model 2 the variable ‘ # of bedroom ’ is not statistically significant at the 5% level.

• In model 3 (and the previous models), the variable ‘ # of acres ’ coefficient is negative which is contrary to « common sense » and to what we observed in the scatter plot and the positive Pearson correlation coefficient (r = 0.608).

• In models 1 to 3,the negative coefficient for the variable ‘ # of acres ’ is due to the fact that there is a strong linear relation between the value of the land and the area of the land (r = 0.918): multicolinearity problem.

### Multicolinearity

• If two or more explanatory variables are strongly correlated (> 0.85 in absolute value), one says that there is multicolinearity. It has an influence on the estimation of parameters in the model.

• If two explanatory variables are highly correlated, then can get rid of one of these variables. Because of the strong correlation, the contribution of the other variable is not significant.

• The correlation between several pairs of variables can be calculated in Excel using correlation in the Data Analysis toolbox.

### How can we choose a particular linear regression model among all the possible ones?

There are several techniques:

• Step by step selection by adding one variable at a time, starting with the most significant one (stepwise, forward).

• Selection starting from the model in which all the variables are included and removing one variable at a time starting with the least significant (backward).

• Construct all possible models and choose the best subset of variables according to certain specific criteria (ex: adjustedR2, Cp de Mallow.)

### Example of selection among the best subsets:

Best Subsets Regression : Response is Total

B C

S e b B F

q R d a a i

L A f o r t t r

a c e o o h h e

Adj. n r e m o r r p

Vars R-Sq R-Sq C-p s d e t s m o o l

1 66.4 65.9 136.8 49262 X

1 58.8 58.2 184.7 54556 X

1 39.3 38.5 307.6 66210 X

2 82.7 82.2 35.9 35564 X X

2 78.8 78.3 60.3 39343 X X

2 74.4 73.7 88.1 43244 X X

3 85.6 85.0 19.5 32637 X X X

3 84.8 84.2 24.5 33521 X X X

3 84.8 84.2 24.9 33591 X X X

4 87.1 86.4 12.2 31115 X X X X

4 87.0 86.3 13.1 31297 X X X X

4 86.6 85.9 15.2 31682 X X X X

5 88.3 87.5 6.9 29889 X X X X X

5 87.6 86.7 11.2 30744 X X X X X

5 87.4 86.5 12.4 30979 X X X X X

6 88.5 87.6 7.3 29763 X X X X X X

6 88.3 87.3 8.6 30030 X X X X X X

6 88.3 87.3 8.9 30096 X X X X X X

7 88.9 87.8 7.1 29510 X X X X X X X

7 88.6 87.4 9.1 29924 X X X X X X X

7 88.3 87.2 10.6 30240 X X X X X X X

8 88.9 87.6 9.0 29704 X X X X X X X X

### Selection of the model without the variable# of acres

Best Subsets Regression : Response is Total

B C

S e b B F

q R d a a i

L f o r t t r

a e o o h h e

Adj. n e m o r r p

Vars R-Sq R-Sq C-p s d t s m o o l

1 66.4 65.9 120.6 49262 X

1 58.8 58.2 164.9 54556 X

1 39.3 38.5 278.3 66210 X

2 82.7 82.2 27.6 35564 X X

2 72.7 71.9 86.0 44704 X X

2 72.5 71.8 86.8 44813 X X

3 84.8 84.2 17.2 33521 X X X

3 84.8 84.2 17.6 33591 X X X

3 84.0 83.3 22.3 34467 X X X

4 87.0 86.3 6.9 31297 X X X X

4 86.1 85.3 12.1 32352 X X X X

4 85.3 84.5 16.5 33226 X X X X

5 87.3 86.4 6.9 31100 X X X X X

5 87.0 86.1 8.5 31439 X X X X X

5 87.0 86.1 8.9 31509 X X X X X

6 87.8 86.8 6.1 30707 X X X X X X

6 87.3 86.3 8.7 31264 X X X X X X

6 87.0 85.9 10.5 31656 X X X X X X

7 87.8 86.6 8.0 30908 X X X X X X X

### The selection of the best model is done according to the combination:

• The greatest value of R2 adjusted for the number of variables in the model.

• The smallest value of Cp .

• For the models with R2 adjusted and comparable Cp, we will choose the model which has the most « common sense » according to the experts in the field.

• For the models with R2 adjusted and comparable Cp, the model with the independent variables that are the easiest and least expensive to measure.

• The model validity.

1- confidence interval forYmean and a new value of Y(prediction) being given a specific value combinationforX1, X2, …, Xp.

• For model 4 and property with a land= 65 000\$, sq.ft = 1500, 2 completed bathrooms and 1 not-completed, we obtain the following point estimation :

• est.total value = -55 533 + 1.816*65 000 + 49.833*1 500 + 11 696*2 + 18 430*1 = 179 074\$

• 95% confidence interval for the mean of the total value:

[170 842, 187 306]

• 95% confidence interval for a total predicted value :

[116 173, 241 974]

### Notes:

• For a 1500 sq.ft property, the multiple regression model gives a smaller 95% confidence intervals than the simple regression model.

• Therefore the addition of several other variables in the model helped to better explain the total value variability and to improve our estimations.

• If two or more independent variables are correlated we will say that there is multicolinearity. This can influence the value of the parameters in the model .

• Also, if two independent variables are strongly correlated then only one of the two variables would be included in the model, the other one bringing very little additional information.

• Certain conditions are required for the validity of the model and the corresponding inference (similar to the simple linear regression ).

### Dummy variables

• How can one take into account qualitative information in a regression?

• Application: Test on two or more means

### Trick

• If a qualitative variable takes two values, one defined one dummy variable taking values 0 or 1.

• Examples:

Sex: 1 if male, 0 otherwise

Garage: 1 if garage, 0 if not.

### Trick (continued)

• More generally, if a qualitative variable can take m values, one defines (m-1) dummy variables all taking values 0 or 1.

• Example: Sex and job category (executive, white-collar, blue-collar)

• X1 = 1 if male, 0 otherwise.

• X2 = 1 si exe, 0 otherwise.

• X3 = 1 si w-c, 0 otherwise.

### Example

• One wants to explain the salary of an employee (Y) with the following variables: sex, job category and experience.

• X1 = 1 if male, 0 otherwise.

• X2 = 1 if exe, 0 otherwise.

• X3 = 1 if w-c, 0 otherwise.

• X4 = years of experience.

### Example (continued)

• Regression model:

Y = 0 + 1X1 + 2X2 + 3X3 + 4X4 + 

Question: Interpret 0, 1, 2, 3 , 4 .

How do know if women have a smaller salary?

### “P-value” for one-tailed tests in Excel.

• The evaluation of the p-value of a “one-tailed” test hypothesis H1 is not given in general, only the p-value of a “two-tailed” test . For example, in regression, Excel calculates the “p-value” P corresponding to

• H0 : bi = 0 vs H1 : bi ≠ 0 .

• How can we calculate the p-value correponding to one-tailed hypotheses H1?

### Rules :

• P: p-value for the two-tailed test.

• If H1 is of the form bi > 0 and bi >0, then the “p-value” of the right-tailed is P/2. Otherwise it is 1- P/2.

• If H1 is of the form bi < 0 and bi <0, then the “p-value” of the left-tailed is P/2. Otherwise it is 1- P/2.

•  In other words, the one-tailed p-value is half of the two-tailed p-value when the estimated coefficient has the same sign as the coefficient in H1. Otherwise, it is 1- “p-value”/2.

### Question:

• One wants to know if having a garage increase the total value of the property. The hypotheses to be tested should be:

H0: bgarage 0 vs H1: bgarage> 0

• Since bgarage = 22372 > 0, the p-value corresponding to H1: bgarage> 0 is 0.058/2 = 0.029 < 0.05. The anwser is yes because we accept H1.

• Does the decision depend on coding?

• If the dummy is defined by 0 if there is a garage and 1 otherwise, we would have got:

Totale = - 72080 + 1,83 Terrain + 47,2 Pied2

+ 11535 SbainsC + 18899 Sbains - 22372 Garage

Predictor Coef StDev T P

Constant -72080 14175 -5,08 0,000

Terrain 1,8342 0,1892 9,69 0,000

Pied2 47,175 7,013 6,73 0,000

SbainsC 11535 3256 3,54 0,001

Sbains 18899 5211 3,63 0,001

Garage -22372 11116 -2,01 0,058

S = 30671 R-Sq = 87,6% R-Sq(adj) = 86,8%

• In that case, the right choice for hypotheses would have been:

H0: bgarage≥ 0 vs H1: bgarage< 0

• The corresponding p-value stays 0.029 = 0.058/2 because bgarage = -22372 < 0 has the same sign as bgarage in H1.

### Comparison of several means

• Suppose one wants to compare the respective means of a quantitative variable Y for two groups: m1 = mean of group 1, m2 = mean of group 2.

• One can use regression by defining X = 1 for group 1, and X= 0 for group 2.

• In this case, b = m1 –m2.

• Hypothesis H1 : m1> m2 correspond to H1 : b > 0.

• Hypothesis H1 : m1< m2 correspond to H1 : b < 0.

• Hypothesis H1 : m1 ≠m2 correspond to H1 : b ≠ 0.

### Example

• A manager has some doubts on the (positive) effects of a course in order to improve the speed a given task is performed by employees.

• To confirm his belief, he asked a technician to choose at random 10 employees and to measure the time (hours) to complete a task.

• Then the same employees attend the course.

• After the course the employees had to realize a similar task.

• The results are summarized in the following table:manager.xls

### Questions:

• a) Should the company maintain the formation program? Take a = 5%.

• b) The technician in charge of the measurements forget to identify employees on the measurements form. What is the conclusion using that data set?

• Unfortunately, case b is based on a real case.

### Solution

• For situation a), data are paired and we have to check if the differences « Before – After » are significantly positive. The p-value is 0.0003 < 0.05 = a.

• One accepts H1 and the manager conclude that the program should be maintained.

• In the second case, data are not paired. One can use regression with

• Y = time of execution, and

• X = 1 for measurements before the course and X = 0 for measurements after the course.

In that case, the right choice for H1 is:

• H1: b > 0

• Results are given by:

• Since H1 : b > 0(which is equivalent to H1 : mbefore> mafter ), and

• b = 0.244 > 0, the p-value is 0.201/2 = 0.1005 > 0.05.

• One accepts H0, so the formation program shouldn’t be maintained.

• This is a very good example of the consequence of the greater variability for two samples compared to a paired sample.

### Remark: Comparing several means

• If one needs to compare the means of k groups, for some variable Y, one can use also regression.

• For i=1, 2, …, k-1, set:

• Xi = 1 for group i, 0 otherwise.

• Then

• 0 = mean of group k = k and

• i = i - k, 1  i  k-1.

• Therefore, the regression test where H0 is given by

H0: 1 = 2 = ... = k-1 = 0

• is equivalent to a test where H0 is given by

H0: 1 = 2 = ... = k

• If H0 is rejected, then we conclude that at least two means are different.

• The p-value is the Significance of F, found in the ANOVA table.