Qualitative Independent Variables. Sometimes called Dummy Variables.
Qualitative Independent Variables Sometimes called Dummy Variables
In the simple and multiple regression we have studied so far the dependent variable, y, and the independent variable(s), x(s) have been quantitative variables. But the regression can be used with other variables. We will study the case where The dependent variable, y, is quantitative, One (or more, in general) independent variable is quantitative, and, One independent variable is qualitative. Remember that a qualitative variable is of the type where different values for the variable are just categories. Some examples include gender and method of payment (cash, check, credit card).
An example y = the repair time in hours. The company provides maintenance and it would like to understand why the repair time takes as long as it does. With an understanding of repair time maybe it can schedule employee hours better or improve company performance in some other way. x1 = the number of months since the last repair service was performed. The idea is that the longer since the last repair the more that will be need to be done. The is a quantitative variable. x2 = the type of repair service needed. In this example there are only two types of repairs – electrical and mechanical. So, the company has clients that need repairs and the company is exploring what accounts for the time it takes to make a repair.
On the next slide I have a graph where two quantitative variables are on the axes. The two ovals represent the “cloud” of data points. Here the points suggest a positive relationship between months since last repair and repair time. Of course, we will have to test if this is the real case or not, but the graph suggests that is the case. I have two ovals because it is thought that maybe each type of repair has a different impact on repair time. The different ovals represent what is happening for each type of repair and here I am suggesting that there is a difference in repair time for each level of repair type. Here we will also do a test to see if the different types of repair lead to different repair times.
Repair time Months since last repair
The model Here the regression model is y = Bo +B1x1 + B2x2. When we estimate the model we use data on y and x1 and x2. Here we make the data for x2 special. We will say that x2 = 0 if the data point is for a mechanical repair and x2 = 1 if the data point is for an electrical repair. Now, when we look at the model for the two types of repair we get the following: When x2=0 y = Bo + B1x1 + B2(0) = Bo + B1x1, and when x2 = 1, y = Bo + B1x1 + B2(1) = Bo + B2 + B1x1. The impact of creating x2 as a 0, 1 variable is that when the value is 0 we have one line and when the value is 1 we have another line with a different intercept. The intercept is Bo with the mechanical repair and the intercept is Bo + B2 with the electrical repair.
Getting and interpreting the results: The previous slide has the Excel printout for this regression model. The interpretation starts with the F test. The null is that both B1 and B2 are equal to zero. Here the F stat is 21.357 with a p-value (Significance F) = .001. Then we would reject the null with alpha as small as .001 (certainly we reject at alpha = .05) and we go with the alternative that at least one of the beta’s is not equal to zero. In other words, as a package the x’s exhibit a relationship with the y variable. The next step is to do the t tests on each slope value B1 and B2 (even here we tend to ignore the test on Bo because we typically do not have much data with all the x’s = 0) separately. Here the p-values on both have values less than .05 so we reject the null and conclude each variable has an impact on y.
Repair time Electrical y = (.9305 + 1.2627) + .3876x1 Mechanical y = .9305 +.3876x1 .9305 + 1.2627 .9305 Months since last repair
On the previous slide I reproduced the graph I had before, and I added the equations for repair time under each value of x2. When x2 = 0 we have the line for mechanical types of repair. When x2 = 1 we have the line for electrical types of repair. Ultimately the difference in the two lines here is in the intercept. But, the slope of each line is the same. This means that months since the last repair has the same impact on repair under either type of repair. Since b2 = 1.2627 (really since we rejected the null that B2 = 0) the electrical line has a higher intercept. We can use each equation to predict repair time given the value of months since last repair, and given the type of repair. Of course, if the type is mechanical we use the mechanical line and we use the electrical line for the electrical type. The next thing we would do is evaluate R square. Here the value is .8592 and this indicates that just over 85% of the variation in y is explained by the x’s.
The qualitative variable In our example we had a qualitative variable with two categories. Note we added 1 x variable for this 1 qualitative variable. The reason is because the 1 variable had 2 categories. Now if the 1 qualitative variable has 3 categories we would have to have 2 x variables. Say we had mechanical, electrical and industrial repair types. We would need x2 and x3 variables, in addition to repair time, x1. With 3 categories we would have 3 lines. When x2 = 0 and x3 = 0 the intercept would be Bo for the mechanical line. When x2 = 1 and x3 = 0 the intercept would b Bo + B2 for the electrical line (assuming the tests had us reject the null). When x2 = 0 and x3 = 1 the intercept would be B0 + B3 for the industrial line.
In general, if the 1 qualitative variable has k categories, we add k-1 x’s. When all the x’s are zero we have intercept Bo and the line represents the equation for 1 of the categories and then the other x’s account for the change from Bo the other k-1 category values have. Summary 1 qualitative variable would have k lines associated with it (assuming tests reject Ho) and we add k-1 x’s of the 0,1 type to account for all the k categories. 1 category is made the “base” category and its line will have intercept Bo and the other categories will have intercept Bo + Bt, where the t would be different for each case of the other categories on the variable.