Class 20: Thurs., Nov. 18

Class 20: Thurs., Nov. 18 • Specially Constructed Explanatory Variables • Dummy variables for categorical variables • Interactions involving dummy variables • I will e-mail you HW8 tomorrow. It will be due Tuesday, Nov. 30th. • Schedule: • Tuesday, Nov. 23rd: One-way ANOVA • Tuesday, Nov. 30th: Review • Thursday, Dec. 2nd: Midterm II • Tuesday, Dec. 7th, Thursday, Dec. 9th: Two-way ANOVA

Categorical variables • Categorical (nominal) variables: Variables that define group membership, e.g., sex (male/female), color (blue/green/red), county (Bucks County, Chester County, Delaware County, Philadelphia County). • How to use categorical variables as explanatory variables in regression analysis: • If the variable has two categories (e.g., sex (male/female), rain or not rain, snow or not snow), we have defined a variable that equals 1 for one of the categories and 0 for the other category.

Predicting Emergency Calls to the AAA Club Rain forecast=1 if rain is in forecast, 0 if not Snow forecast=1 if snow is in forecast, 0 if not Weekday=1 if weekday, 0 if not

Comparing Toy Factory Managers • An analysis has shown that the time required to complete a production run in a toy factory increases with the number of toys produced. Data were collected for the time required to process 20 randomly selected production runs as supervised by three managers (A, B and C). Data in toyfactorymanager.JMP. • How do the managers compare?

Marginal Comparison • Marginal comparison could be misleading. We know that large production runs with more toys take longer than small runs with few toys.

How can we be sure that Manager c’s advantage is not due to simply having supervised smaller production runs? • Solution: Run a multiple regression in which we include size of the production run as an explanatory variable, along with manager, in order to control for size of the production run.

Including Categorical Variable in Multiple Regression: Wrong Approach • We could assign codes to the managers, e.g., Manager A = 0, Manager B=1, Manager C=2. • This model says that for the same run size, Manager B is 31 minutes faster than Manager A and Manager C is 31 minutes faster than Manager B. • This model restricts the difference between Manager A and B to be the same as the difference between Manager B and C – we have no reason to do this. • If we use a different coding for Manager, we get different results, e.g., Manager B=0, Manager A=1, Manager C=2 Manager A 5 min. faster than Manager B

Including Categorical Variable in Multiple Regression: Right Approach • Create an indicator (dummy) variable for each category. • Manager[a] = 1 if Manager is A 0 if Manager is not A • Manager[b] = 1 if Manager is B 0 if Manager is not B • Manager[c] = 1 if Manager is C 0 if Manager is not C

For a run size of length 100, the estimated time for run of Managers A, B and C are • For the same run size, Manager A is estimated to be on average 38.41-(-14.65)=53.06 minutes slower than Manager B and 38.41-(-23.76)=62.17 minutes slower than Manager C.

Categorical Variables in Multiple Regression in JMP • Make sure that the categorical variable is coded as nominal. To change coding, right clock on column of variable, click Column Info and change Modeling Type to nominal. • Use Fit Model and include the categorical variable into the multiple regression. • After Fit Model, click red triangle next to Response and click Estimates, then Expanded Estimates (the initial output in JMP uses a different, more confusing, coding of the dummy variables).

The coefficients on Manager A, Manager B and Manager C add up to zero. So the positive coefficient on Manager A means that Manager A is slower than the average (of Manager A, B and C) and the negative coefficients on Manager B and Manager C mean that these two managers are faster than the average (of Manager A, B and C). • The coefficients on the indicator variables will always add up to zero in JMP. • Caution: Different software uses different coding for indicator variables. It doesn’t change the predictions from the multiple regression but does change the interpretation.

Equivalence of Using One 0/1 Dummy Variable and Two 0/1 Dummy Variables when Categorical Variable has two categories Two models give equivalent predictions. The difference in mean number of Emergency calls between a day with a rain forecast and a day without a rain forecast holding all other variables fixed is 429.71=214.85-(-214.85).

Effect Tests • Effect test for manager: vs. Ha: not all manager[a],manager[b],manager[c] equal. Null hypothesis is that all managers are the same (in terms of mean run time) when run size is held fixed, alternative hypothesis is that not all managers are the same (in terms of mean run time) when run size is held fixed. • p-value for Effect Test <.0001. Strong evidence that not all managers are the same when run size is held fixed. • Note: equivalent to because JMP has constraint that manager[a]+manager[b]+manager[c]=0. • Effect test for Run size tests null hypothesis that Run Size coefficient is 0 versus alternative hypothesis that Run size coefficient isn’t zero. Same p-value as t-test.

Effect tests shows that managers are not equal. • For the same run size, Manager C is best (lowest mean run time), followed by Manager B and then Manager C. • The above model assumes no interaction between Manager and run size – the difference between the mean run time of the managers is the same for all run sizes.

Interaction Model

Interaction Model in JMP • To add interactions involving categorical variables in JMP, follow the same procedure as with two continuous variables. Run Fit Model in JMP, add the usual explanatory variables first, then highlight one of the variables in the interaction in the Construct Model Effects box and highlight the other variable in the interaction in the Columns box and then click Cross in the Construct Model Effects box.

Interaction Model • Interaction between run size and Manager: The effect on mean run time of increasing run size by one is different for different managers. • Effect Test for Interaction: • Manager*Run Size Effect test tests null hypothesis that there is no interaction (effect on mean run time of increasing run size is same for all managers) vs. alternative hypothesis that there is an interaction between run size and managers. p-value =0.0333. Evidence that there is an interaction.

The runs supervised by Manager A appear abnormally time consuming. Manager b has higher initial fixed setup costs than Manager c (186.565>149.706) but has lower per unit production time (0.136<0.259).

Interaction Profile Plot Lower left hand plot shows mean time for run vs. run size for the three managers a, b and c.

Interactions Involving Categorical Variables: General Approach • First fit model with an interaction between categorical explanatory variable and continuous explanatory variable. Use effect test on interaction to see if there is evidence of an interaction. • If there is evidence of an interaction (p-value <0.05 for effect test), use interaction model. • If there is not strong evidence of an interaction (p-value >0.05 for effect test), use model without interactions.

Example: A Sex Discrimination Lawsuit • Did a bank discriminatorily pay higher starting salaries to men than to women. Harris Trust and Savings Bank was sued by a group of female employees who accused the bank of paying lower starting salries to women. The data in harrisbank.JMP are the starting salaries for all 32 male and all 61 female skilled, entry-level clerical employees hired by the bank between 1969 and 1977, as well as the education levels and sex of the employees.

No evidence of an interaction between Sex and Education. Fit model without interactions.

Discrimination Case Regression Results • Strong evidence that there is a difference in the mean starting salaries of women and men of the same education level. • Estimated difference: Men have 345.904+345.904=$691.81 higher mean starting salaries than women of the same education level. • 95% confidence interval for mean difference = (2*$214.55,2*$477.25)=($429.10,$854.50). • Bank’s defense: Omitted variable bias. Variables such as Seniority, Age, Experience also need to be controlled for.

Class 20: Thurs., Nov. 18