Lecture 13: Categorical Variables

Lecture 13: Categorical Variables February 26th, 2014

Question • A simple regression model is to be expanded by the addition of a new explanatory variable. Which of the following would be an indication of collinearity in the new model? • The increase in the value of R2 is much more than expected. • The variance inflation factors (VIFs) for the explanatory variables are near a value of one. • The standard error of the partial slope of the original variable is larger than the standard error of the marginal slope for this variable. • When compared to the marginal slope, the partial slope of the original variable has changed very little or not at all. • More than one of the above

Administrative • Mid-semester grades posted after Spring Break • Have a good break and don’t think about Regression • Today’s quiz probably not included. • Workshop after the break. • Non-punative quiz at the end of Monday’s class. So pay attention and follow along. • Mid-semester FCEs?

Last time: • Collinearity • VIFs • Two ways to calculate • Until your confident that you’re inverting the matrix correctly, do at least one by hand and then check by inverting the correlation matrix of the explanatory variables.

Collinearity Signs of Collinearity: • R2 increases less than we would expect • Slopes of correlated explanatory variables in the model change dramatically • The F-statistic is more impressive than the individual t-statistics. • Standard errors for partial slopes are larger than those for marginal slopes • VIFs increase • No hard and fast rules for VIF thresholds. Some people say 5, some say 10.

Collinearity • Perfect collinearity: is it possible? • Any examples? • Definitely possible; you need to make sure you don’t include a perfectly collinear relationship by accident: • Eg: imagine SAT total score = SAT Math + SAT Writing + SAT CR • Including all 4 variables as explanatory variables will be a perfectly collinear relationship (3 of them define the 4th) • Why is this a problem? • We can’t estimate all 4 coefficients at once; the model isn’t identified. Multiple sets of coefficients produce the same answer. • Easier mistake to make (and more common) with categorical explanatory variables.

Collinearity Remedies for collinearity: • Remove redundant explanatory variables • Re-express explanatory variables • Eg: use the average of (Market %change + Dow %change) as an alternative explanatory variable • Do nothing • Not a joke, but only if the explanatory variables are sensible estimates. Realize that some collinearity will exist.

Removing Explanatory Vars • After adding several explanatory variables to a model, some of those added and some of those originally present may not be statistically significant. • Remove those variables for which both statistics and substance indicate removal (e.g., remove Dow % Change rather than Market % Change).

Multiple Regression:Choosing Independent Vars Several kinds of Specification Error (error in specifying the model): • Not including a relevant variable: omitted variable bias • Could lead to entire regression equation being suspect; might positively or negatively bias estimates, depending on the correlations with the omitted variable. • Including a redundant variable: • Less precise estimates, increased collinearity. Lower adjusted R2 • Incorrect functional form (non-linear) • Already dealt with this to some degree. • Simultaneity / endogeneity bias • More on this when we come to causality.  Theory, not statistical fit, should be the most important criterion for the inclusion of a variable in a regression equation.

Choosing Independent Vars Choice of explanatory variables: • Causality language helps (jumping ahead slightly) • Imagine we’re really interesting in one coefficient in particular, the “treatment” variable but want to build a model estimating a dependent variable:

Choosing Independent Vars Relationships between variables: Type A: Affects the dependent variable but uncorrelated with the treatment

Choosing Independent Vars Relationships between variables: Type B: Affects the dependent variable but correlated with the treatment due to a common cause.

Choosing Independent Vars Relationships between variables: Type C: Affects the dependent variable but correlated with the treatment by chance.

Choosing Independent Vars Relationships between variables: Type D: Affects the dependent variable directly but also indirectly via the treatment variable.

Choosing Independent Vars Relationships between variables: Type E: Affects the dependent variable directly but are influenced by the treatment variable. Problematic variable!! Don’t include in model

Choosing Independent Vars When deciding whether to include a variable: theory is key. If you have a good understanding of the theoretical relationships between the variables (often we do): • Include types A-D • Avoid including type E. • Also known as a “post-treatment” variable. • Even if including it increases your R2 and/or your standard error of the regression is lower, avoid it. Including it will bias the estimate on the treatment variable.

Transformed variables What about transformed explanatory and response variables? • We can still transform independent or dependent variables. • Same rules as before. • Some independent (explanatory) variables could be transformed while others are not. • Be cautious – the same caveats about partial vs marginal slopes apply.

Categorical Variables Let’s return to simple regression for a moment: • What is the meaning of the intercept in the following fitted model? • Female = 1 if the respondent is female, = 0 if male. • Recall what regression is: E(Y|X = x) • Expected Earnings for males = 1451.36. • For females = 1451.36 – 251.47 = 1199.89

Categorical Variables Female in the previous example is an Indicator variable • Sometimes called a “dummy” variable • Indicates if a condition is true or not. • Allows for many kinds of qualitative (or non-quantitative) data to be incorporated into regression analysis. • Allows for group comparisons. • Be careful about possible omitted variables that would account for differences. E.g., if men and women differ in experiencethat might account for the differences in salary.

Indicator variables • So now consider a multiple regression with an indicator and a variable for years of experience: • What is the intercept? • What does the coefficient on Female mean? • relative to when the variable =0, i.e., Men. • What does the coefficient on Years mean?

Indicator variables • The dummy variable basically shifts the intercept of the regression line (next week we’ll allow the slope to change): b2

Categorical Variables • What if your categorical data doesn’t just include 2 possible values? For example, “Grade in Prob and Stats” • Slightly more complicated, but manageable. • We deal with this by splitting the variable up into multiple dummy variables: • Did the student get an A? • Did the student get an B? • Did the student get an C? • Did the student get an D?

Categorical Variables • When you split a categorical variable into multiple indicator variables there are a couple of things to always remember: • You can NOT include all possible indicator variables in the regression equation. Why? • There will then be perfect collinearity. Therefore you must exclude one group (or dummy variable) • Because you’re excluding one of the possible dummy variables, all of your coefficients will be relative to that group.

Categorical Variables Example: • Could you fit the following model as is, with no constructed variables? • Yes, no problems • Yes, but the results wouldn’t make much sense • No, Excel (or your favorite stats software) won’t let you

Categorical Var example • Yes, you can because the variable Style is coded 1, 2, 3, 4. But it wouldn’t make much sense to analyze it as such since the categories (split-level, ranch, colonial, tudor) don’t have an order. • Therefore we can create multiple dummy variables and include 2 of the 3 in a regression model: • Then the additional value of a Tudor-style home is embedded into the intercept. • Let’s assume b1 = -120350 • Then a Split Level home sells for 120350 less than a Tudor style home.

After Spring Break • More on categorical variables • Allowing the slopes to vary by group (interacting variables) • Categorical explanatory variables will be on Exam 2

Lecture 13: Categorical Variables