Multiple Regression

Multiple Regression

Multiple Regression The test you choose depends on level of measurement: Independent Variable Dependent Variable Test Dichotomous Continuous Independent Samples t-test Dichotomous Nominal Nominal Cross Tabs Dichotomous Dichotomous Nominal Continuous ANOVA Dichotomous Dichotomous Continuous Continuous Bivariate Regression/Correlation Dichotomous Two or More… Continuous or Dichotomous Continuous Multiple Regression

Multiple Regression • Multiple Regression is very popular among sociologists. • Most social phenomena have more than one cause. • It is very difficult to manipulate just one social variable through experimentation. • Sociologists must attempt to model complex social realities to explain them.

Multiple Regression • Multiple Regression allows us to: • Use several variables at once to explain the variation in a continuous dependent variable. • Isolate the unique effect of one variable on the continuous dependent variable while taking into consideration that other variables are affecting it too. • Write a mathematical equation that tells us the overall effects of several variables together and the unique effects of each on a continuous dependent variable. • Control for other variables to demonstrate whether bivariate relationships are spurious

Multiple Regression • For example: A sociologist may be interested in the relationship between Education and Income and Number of Children in a family. Independent Variables Education Family Income Dependent Variable Number of Children

Multiple Regression • For example: • Research Hypothesis: As education of respondents increases, the number of children in families will decline (negative relationship). • Research Hypothesis: As family income of respondents increases, the number of children in families will decline (negative relationship). Independent Variables Education Family Income Dependent Variable Number of Children

Multiple Regression • For example: • Null Hypothesis: There is no relationship between education of respondents and the number of children in families. • Null Hypothesis: There is no relationship between family income and the number of children in families. Independent Variables Education Family Income Dependent Variable Number of Children

Multiple Regression • Bivariate regression is based on fitting a line as close as possible to the plotted coordinates of your data on a two-dimensional graph. • Trivariate regression is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph. Case: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Children (Y): 2 5 1 9 6 3 0 3 7 7 2 5 1 9 6 3 0 3 7 14 2 5 1 9 6 Education (X1) 12 16 2012 9 18 16 14 9 12 12 10 20 11 9 18 16 14 9 8 12 10 20 11 9 Income 1=$10K (X2): 3 4 9 5 4 12 10 1 4 3 10 4 9 4 4 12 10 6 4 1 10 3 9 2 4

Multiple Regression Y Plotted coordinates (1 – 10) for Education, Income and Number of Children 0 X2 X1 Case: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Children (Y): 2 5 1 9 6 3 0 3 7 7 2 5 1 9 6 3 0 3 7 14 2 5 1 9 6 Education (X1) 12 16 2012 9 18 16 14 9 12 12 10 20 11 9 18 16 14 9 8 12 10 20 11 9 Income 1=$10K (X2): 3 4 9 5 4 12 10 1 4 3 10 4 9 4 4 12 10 6 4 1 10 3 9 2 4

Multiple Regression Y What multiple regression does is fit a plane to these coordinates. 0 X2 X1 Case: 1 2 3 4 5 6 7 8 9 10 Children (Y): 2 5 1 9 6 3 0 3 7 7 Education (X1) 12 16 2012 9 18 16 14 9 12 Income 1=$10K (X2): 3 4 9 5 4 12 10 1 4 3

Multiple Regression • Mathematically, that plane is: Y = a + b1X1 + b2X2 a = y-intercept, where X’s equal zero b=coefficient or slope for each variable For our problem, SPSS says the equation is: Y = 11.8 - .36X1 - .40X2 Expected # of Children = 11.8 - .36*Educ - .40*Income  

Multiple Regression • Let’s take a moment to reflect… Why do I write the equation: Y = a + b1X1 + b2X2 Whereas KBM often write: Yi = a + b1X1i + b2X2i + ei One is the equation for a prediction, the other is the value of a data point for a person. 

Multiple Regression 57% of the variation in number of children is explained by education and income!  Y = 11.8- .36X1- .40X2

Multiple Regression r2   (Y – Y)2 - (Y – Y)2  (Y – Y)2  Y = 11.8- .36X1- .40X2 161.518 ÷ 261.76 = .573

Multiple Regression So what does our equation tell us? Y = 11.8 - .36X1 - .40X2 Expected # of Children = 11.8 - .36*Educ - .40*Income Try “plugging in” some values for your variables. 

Multiple Regression So what does our equation tell us? Y = 11.8 - .36X1 - .40X2 Expected # of Children = 11.8 - .36*Educ - .40*Income If Education equals: If Income Equals: Then, children equals: 0 0 11.8 10 0 8.2 10 10 4.2 20 10 0.6 20 11 0.2 ^

Multiple Regression If graphed, holding one variable constant produces a two-dimensional graph for the other variable. 11.44 Y 11.40 Y b = -.36 b = -.4 6.00 5.44 0 15 0 15 X1 = Education X2 = Income

Multiple Regression • An interesting effect of controlling for other variables is “Simpson’s Paradox.” • The direction of relationship between two variables can change when you control for another variable.  + Education Crime Rate Y = -51.3 + 1.5X

Multiple Regression • “Simpson’s Paradox”  + Education Crime Rate Y = -51.3 + 1.5X1 + Education Urbanization (is related to both) + Crime Rate Regression Controlling for Urbanization - Education  Crime Rate Y = 58.9 - .6X1 + .7X2 + Urbanization

Multiple Regression Crime Original Regression Line Looking at each level of urbanization, new lines Rural Small town Suburban City Education

Multiple Regression Now… More Variables! • What happens when you have even more variables? • The social world is very complex. • For example: A sociologist may be interested in the effects of Education, Income, Sex, and Gender Attitudes on Number of Children in a family. Independent Variables Education Family Income Sex Gender Attitudes Dependent Variable Number of Children

Multiple Regression • Research Hypotheses: • As education of respondents increases, the number of children in families will decline (negative relationship). • As family income of respondents increases, the number of children in families will decline (negative relationship). • As one moves from male to female, the number of children in families will increase (positive relationship). • As gender attitudes get more conservative, the number of children in families will increase (positive relationship). Independent Variables Education Family Income Sex Gender Attitudes Dependent Variable Number of Children

Multiple Regression • Null Hypotheses: • There will be no relationship between education of respondents and the number of children in families. • There will be no relationship between family income and the number of children in families. • There will be no relationship between sex and number of children. • There will be no relationship between gender attitudes and number of children. Independent Variables Education Family Income Sex Gender Attitudes Dependent Variable Number of Children

Multiple Regression • Bivariate regression is based on fitting a line as close as possible to the plotted coordinates of your data on a two-dimensional graph. • Trivariate regression is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph. • Regression with more than two independent variables is based on fitting a shape to your constellation of data on an multi-dimensional graph.

Multiple Regression • Regression with more than two independent variables is based on fitting a shape to your constellation of data on an multi-dimensional graph. • The shape will be placed so that it minimizes the distance (sum of squared errors) from the shape to every data point.

Multiple Regression • Regression with more than two independent variables is based on fitting a shape to your constellation of data on an multi-dimensional graph. • The shape will be placed so that it minimizes the distance (sum of squared errors) from the shape to every data point. • The shape is no longer a line, but if you hold all other variables constant, it is linear for each independent variable.

Multiple Regression Y Imagining a graph with four dimensions! 0 X2 X1

Multiple Regression For our problem, our equation could be: Y = 7.5 - .30X1 - .40X2 + 0.5X3 + 0.25X4 E(Children) = 7.5 - .30*Educ - .40*Income + 0.5*Sex + 0.25*Gender Att. 

Multiple Regression So what does our equation tell us? Y = 7.5 - .30X1 - .40X2 + 0.5X3 + 0.25X4 E(Children) = 7.5 - .30*Educ - .40*Income + 0.5*Sex + 0.25*Gender Att. Education: Income: Sex: Gender Att: Children: 10 5 0 0 2.5 10 5 0 5 3.75 10 10 0 5 1.75 10 5 1 0 3.0 10 5 1 5 4.25 ^

Multiple Regression Each variable, holding the other variables constant, has a linear, two-dimensional graph of its relationship with the dependent variable. Here we hold every other variable constant at “zero.” 7.5 Y Y 7.5 b = -.3 b = -.4 4.5 3.5 0 10 0 10 X2 = Education X1 = Income ^ Y = 7.5 - .30X1 - .40X2 + 0.5X3 + 0.25X4

Multiple Regression Each variable, holding the other variables constant, has a linear, two-dimensional graph of its relationship with the dependent variable. Here we hold every other variable constant at “zero.” 8.75 b = .25 Y Y 8 b = .5 7.5 7.5 0 1 0 5 X3 = Sex X4 = Gender Attitudes ^ Y = 7.5 - .30X1 - .40X2 + 0.5X3 + 0.25X4

Multiple Regression • R2 • TSS – SSE / TSS • TSS = Distance from mean to value on Y for each case • SSE = Distance from shape to value on Y for each case • Can be interpreted the same for multiple regression—joint explanatory value of all of your variables (or “your model”) • Can request a change in R2 test from SPSS to see if adding new variables improves the fit of your model

Multiple Regression • R • The correlation of your actual Y value and the predicted Y value using your model for each person • Adjusted R2 • Explained variation can never go down when new variables are added to a model. • Because R2 can never go down, some statisticians figured out a way to adjust R2 by the number of variables in your model. • This is a way of ensuring that your explanatory power is not just a product of throwing in a lot of variables. Average deviation from the regression shape.

Multiple Regression • The BLUE Regression Criteria (KBM pages: 256 – 257) • We should recognize that we are forcing a model (a shape) onto our data. If that model is sensible, we should proceed with regression. • Violating the BLUE assumptions may result in biased estimates or incorrect significance tests. OLS is robust to most violations. • Assumptions we make about our data include: • The relationship between the dependent variable and its predictors is linear, and no irrelevant variables are either omitted from or included in the equation. • All variables are measured without error.

Multiple Regression • Assumptions we make about our data include: • The relationship between the dependent variable and its predictors is linear, and no irrelevant variables are either omitted from or included in the equation. • All variables are measured without error. • The error term (ei) for a single regression equation has the following properties: • Error is normally distributed • The mean of the errors is zero • The errors are independently distributed with constant variances (homoscedasticity) • Each predictor is uncorrelated with the equation’s error term • In systems of interrelated equations, the error in one equation is assumed to be uncorrelated with the errors in the other equations.

Multiple Regression Multicollinearity Controlling for other variables means finding how one variable affects the dependent variable at each level of the other variables. So what if two of your independent variables were highly correlated with each other??? Crime Education =Urbanization

Multiple Regression Multicollinearity So what if two of your independent variables were highly correlated with each other??? (this is the problem called multicollinearity) How would one have a relationship independent of the other? As you hold one constant, you in effect hold the other constant! Each variable would have the same value for the dependent variable at each level, so the partial effect on the dependent variable for each may be 0. Crime Education = Years Studying Math

Multiple Regression Multicollinearity Some solutions for multicollinearity: Remove one of the variables Create a scale out of the two variables (making one variable out of two) Run separate models with each independent variable Crime Education = Years Studying Math

Multiple Regression • Dummy Variables • They are simply dichotomous variables that are entered into regression. They have 0 – 1 coding where 0 = absence of something and 1 = presence of something. E.g., Female (0=M; 1=F) or Southern (0=Non-Southern; 1=Southern). What are dummy variables?!

Multiple Regression Dummy Variables are especially nice because they allow us to use nominal variables in regression. But YOU said we CAN’Tdo that! A nominal variable has no rank or order, rendering the numerical coding scheme useless for regression.

Multiple Regression • The way you use nominal variables in regression is by converting them to a series of dummy variables. Recode into different Nomimal VariableDummy Variables Race 1. White 1 = White 0 = Not White; 1 = White 2 = Black 2. Black 3 = Other 0 = Not Black; 1 = Black 3. Other 0 = Not Other; 1 = Other

Multiple Regression • The way you use nominal variables in regression is by converting them to a series of dummy variables. Recode into different Nomimal VariableDummy Variables Religion 1. Catholic 1 = Catholic 0 = Not Catholic; 1 = Catholic 2 = Protestant 2. Protestant 3 = Jewish 0 = Not Prot.; 1 = Protestant 4 = Muslim 3. Jewish 5 = Other Religions 0 = Not Jewish; 1 = Jewish 4. Muslim 0 = Not Muslim; 1 = Muslim 5. Other Religions 0 = Not Other; 1 = Other Relig.

Multiple Regression • When you need to use a nominal variable in regression (like race), just convert it to a series of dummy variables. • When you enter the variables into your model, you MUST LEAVE OUT ONE OF THE DUMMIES. Leave Out OneEnter Rest into Regression White Black Other

Multiple Regression • The reason you MUST LEAVE OUT ONE OF THE DUMMIES is that regression is mathematically impossible without an excluded group. • If all were in, holding one of them constant would prohibit variation in all the rest. Leave Out OneEnter Rest into Regression Catholic Protestant Jewish Muslim Other Religion

Multiple Regression • The regression equations for dummies will look the same. For Race, with 3 dummies, predicting self-esteem: Y = a + b1X1 + b2X2  a = the y-intercept, which in this case is the predicted value of self-esteem for the excluded group, white. b1 = the slope for variable X1, black b2 = the slope for variable X2, other

Multiple Regression • If our equation were: For Race, with 3 dummies, predicting self-esteem: Y = 28 + 5X1 – 2X2 Plugging in values for the dummies tells you each group’s self-esteem average: White = 28 Black = 33 Other = 26  a = the y-intercept, which in this case is the predicted value of self-esteem for the excluded group, white. 5 = the slope for variable X1, black -2 = the slope for variable X2, other When cases’ values for X1 = 0 and X2 = 0, they are white; when X1 = 1 and X2 = 0, they are black; when X1 = 0 and X2 = 1, they are other.

Multiple Regression • Dummy variables can be entered into multiple regression along with other dichotomous and continuous variables. • For example, you could regress self-esteem on sex, race, and education: Y = a + b1X1 + b2X2 + b3X3 + b4X4 How would you interpret this? Y = 30 – 4X1 + 5X2 – 2X3 + 0.3X4  X1 = Female X2 = Black X3 = Other X4 = Education 

Multiple Regression How would you interpret this? Y = 30 – 4X1 + 5X2 – 2X3 + 0.3X4 • Women’s self-esteem is 4 points lower than men’s. • Blacks’ self-esteem is 5 points higher than whites’. • Others’ self-esteem is 2 points lower than whites’ and consequently 7 points lower than blacks’. • Each year of education improves self-esteem by 0.3 units. X1 = Female X2 = Black X3 = Other X4 = Education 

Multiple Regression