1 / 23

Lecture 15: Categorical Variables

Lecture 15: Categorical Variables. March 12 th , 2014. Question. In general, how did you find the hypothesis testing class on Monday? Very useful Somewhat useful Not somewhat useful Very not useful I don’t know. Administrative. Mid-semester grades posted 30% quizzes 70% exam.

blanca
Download Presentation

Lecture 15: Categorical Variables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 15: Categorical Variables March 12th, 2014

  2. Question In general, how did you find the hypothesis testing class on Monday? • Very useful • Somewhat useful • Not somewhat useful • Very not useful • I don’t know.

  3. Administrative • Mid-semester grades posted • 30% quizzes 70% exam. • Very noisy signal about final grade • Don’t freak out: lots left in the course, • …but often a predictor of future behavior is past behavior • Homework 6 posted at noon; due Tuesday (18th) at noon • Quiz 4 next Wednesday (week from today) • Exam 2 – 2 weeks from Monday

  4. Administrative • Mid-semester FCEs • Main positive: helpful, applied, lectures • Main thing to improve: • More clicker questions to probe understanding – I’ll try. • Class takes too much time • Sorry… can’t do much. 9–unit class implies that you should expect to spend at an additional 6 hours per week. And if you want to improve, put in more time. • Trust me: the homework is good practice. I can make them shorter but then you get less practice (and won’t do as well on the exams)

  5. Last time: • Monday: hypothesis testing • Wednesday: • Variable selection • Path/Influence diagrams + problems with including a post-treatment variable.

  6. Categorical Variables Let’s return to simple regression for a moment: • What is the meaning of the intercept in the following fitted model? • Female = 1 if the respondent is female, = 0 if male. • Recall what regression is: E(Y|X = x) • Expected Earnings for males = 1451.36. • For females = 1451.36 – 251.47 = 1199.89

  7. Categorical Variables Female in the previous example is an Indicator variable • Sometimes called a “dummy” variable • Indicates if a condition is true or not. • Allows for many kinds of qualitative (or non-quantitative) data to be incorporated into regression analysis. • Allows for group comparisons. • Be careful about possible omitted variables that would account for differences. E.g., if men and women differ in experiencethat might account for the differences in salary.

  8. Indicator variables • So now consider a multiple regression with an indicator and a variable for years of experience: • What is the intercept? • What does the coefficient on Female mean? • relative to when the variable =0, i.e., Men. • What does the coefficient on Years mean?

  9. Indicator variables • The dummy variable basically shifts the intercept of the regression line (next week we’ll allow the slope to change): b2

  10. Categorical Variables • What if your categorical data doesn’t just include 2 possible values? For example, “Grade in Prob and Stats” • Slightly more complicated, but manageable. • We deal with this by splitting the variable up into multiple dummy variables: • Did the student get an A? • Did the student get an B? • Did the student get an C? • Did the student get an D?

  11. Categorical Variables • When you split a categorical variable into multiple indicator variables there are a couple of things to always remember: • You can NOT include all possible indicator variables in the regression equation. Why? • There will then be perfect collinearity. Therefore you must exclude one group (or dummy variable) • Because you’re excluding one of the possible dummy variables, all of your coefficients will be relative to that group.

  12. Categorical Variables Example: • Could you fit the following model as is, with no constructed variables? • Yes, no problems • Yes, but the results wouldn’t make much sense • No, Excel (or your favorite stats software) won’t let you • I don’t know

  13. Categorical example • Yes, you can because the variable Style is coded 1, 2, 3, 4. But it wouldn’t make much sense to analyze it as such since the categories (split-level, ranch, colonial, tudor) don’t have an order (or more specifically, a 1-unit increase). • Therefore we can create multiple dummy variables and include 3 of the 4 in a regression model: • Additional value of a Tudor-style home (the excluded group) is embedded into the intercept. • Let’s assume b1 = -120350 • Then a split-level home sells for 120350 less than a tudor style home.

  14. Categorical Variables • With a categorical variable, your estimates for the intercept and slope are relative to the group you’ve chosen to exclude. • I.e., if we excluded the SplitLevel dummy and included Tudor, then the intercept and slopes on the other groups would change to be relative to the split-level homes. • In general j-1 dummy variables are needed for jgroups.

  15. Example Let’s estimate managerial earnings: Data: managers.csv • Construct a dummy variable to indicate whether the employee is Female or not • Estimate earnings by Female • Do women in the data earn more or less than men? • Less and the difference is statistically significant • Less but it is not statistically significant • More but not statistically significant • More and it is statistically significant • I have no idea

  16. Example Now estimate a model predicting earnings by years of experience and Sex of the employee • Do women in the data earn more or less than men, once controlling for experience? • Less and the difference is statistically significant • Less but it is not statistically significant • More but not statistically significant • More and it is statistically significant • I have no idea

  17. Example • Remember that including a dummy allows the intercept to vary by group:

  18. Varying Slopes:Interactions between explanatory variables Including a dummy variable allows the intercept to vary by group • What if we wanted to slopes of the explanatory variables to vary as well? • If we want all the slopes of our explanatory variables to vary by group then we could do a subset analysis • Estimate the model twice: • once the subset of data where the indicator = 1 • once the subset of data where the indicator = 0 • Alternatively if we just want one slope to vary, we “interact” the dummy with the explanatory • In Excel create a new variable = Years * Male • Estimate the model including ALL THREE variables: Male, YearsExperience, Years*Male.

  19. Interactions In Excel you’ll need to create the variable yourself:

  20. Interactions Warning: Estimating a model with interactions is easy…but interpreting the estimated intercept and slopes can get tricky. • Not hard, but caution is urged: Think. • The equation for the group coded as 0 forms the baseline for comparison. • The slope of the dummy variable is the difference between the intercepts (as before). The slope of the interaction is the difference between the estimated slopes

  21. Interpreting Interactions We’re estimating (Group = 1 if the manager is Male): • The estimated salary for a woman with 7 years experience? • = β0 + β1 * (7) + β2 * (0)+ β3* (7 * 0) • = β0 + β1 * (7) • The estimated salary for a man with 6 years experience? • = (β0 + β2)+ (β1 + β3 )* (6) • The slope of the dummy variable is the difference between the intercepts (as before). The slope of the interaction is the difference between the estimated slopes

  22. Varying slope / varying intercept • Plotting both (not super easy in Excel):

  23. Next time • Collinearity with Interactions • Short version: Yup, it’s there. By our own doing. • Important stuff not in the book: I’ll post a reading tonight • Interactions with quantitative explanatory variables • Diagnostics in Multiple Regression

More Related