Adding additional variables : multiple regression

Addingadditionalvariables:multiple regression

Last time • Regression introduction • Basics of OLS – • calcualtionsof beta, alpha, error term, etc. • bivariate analysis • Basic model diagnositics: R2, F-tests, MSE Today (and Monday): • multivariate regression • Assumptions of OLS, detection of violations and what to do about it.. • Check your mail and/or GUL for data today!

Back to Sir Galton– (diagram from Galtonpublished in Nature, 1898)

Multivariate regression • So far, we’ve just kept it simple with bivariate regression models: Withmultiple regression, we’reofcourseaddingmorevariables (’parameters’), toourmodel. In stats terms, we’reestimating a more’constrained’ or ’restricted’ model: We’rethusabletoaccount for a greaternumberofexplanations as towhy Y varies. Additionalvariablescan be included for a numberofreasons: controls, additionaltheoretical, interactions (later)

Hownowto interpret ourcoefficients? = the change in Y for a one unit change in xn, holding all other variables constant, (or ‘all things being equal’, or ‘ceteris paribus’). In other words, one x’s effect is the average marginal effect across all values of additional X’s in the model (intercept) =is the estimated value of Y when all other X’s are held at ‘0’. This may or may not be realistic depending on the scales of your IV’s

G: The variation in the dependentvariable NOT explained by the independent variables – this is the variation thatcould be explained by additional independent variables (RSS) Circle Y: The total variation in the dependentvariable (TSS) A: The uniquecovariancebetween the independent variable x1 and y B: The uniquecovariancebetween the indepdentvariable x2 and y C: Sharedcovariancebetween all threevariables D: Covariancebetween the two independent variablesnotincluding the dependentvariable Circle x2: The total variation in the second independent variable Circle x1: The total variation in the first independent variable Variation in x1 (E) and x2 (F) respectivelythat is not associatedwith the othervariables

βcoefficients in multiple regression: howwe get Beta for x1 whenwealsohave x2 Essentially: Regression for y (dependent) och x2(independent) Area C and B arepredicted by the equation: Area A and G areshown in w (error), whichequals: Areas A and G aresecured in y throughw. Now, wecancalculate the uniqueeffect of x1 on y under control for x2(and vice versa for x2)

Simple ex. We’llusethis data later on, butour DV is ’% female in parliament’ & IVsareexp on primaryschool (x1) and level of corruption (x2). Our full model for reference: Now, let’s check the independent effect of x1 on the DV

Simple ex. The we get the residuals from thisbivariatemodel (Y – Yhat) predict hat gen res = ipu_l_sw- hat Thenwe check the independent effect of x2 on the residuals, controlling for x1…. voilà!

Simple ex. If wecomparewith the original…

Calculation of the βcoefficients in multiple regression

Starting simple: dummy variables in regression A dummy variable is a dichotomous variable coded 0 and 1 (based on an original nominal or ordinal variable). Aka - binary variable If an independent variable is nominal, we can still use them by creating dummy variables (if >2 categories) The number of dummy variables included in the regression needed depends on the number of categories on the original variable = # of categories on the original variable - 1 Ex. occupation: 1. public sector, 2. private sector, 3. not working – we would include a dummy for 2 groups and these βs are compared with the third (omitted) group We can also do this for ordinal IV’s, like low, middle and high f/e. Excluding ‘low means that the βs for ’middle’ and ’high’ are in relation to ’low’

Starting simple: dummy variables in regression In any regression, the intercept will equal the mean on the dependent variable when X’s =0, thus for a dummy variable this =Y for the reference category (RC). Very relevant in survey research.. The coefficients shows each category’s difference from the mean relative to the RC If we add other independent variables in our model, the interpretations of the intercept is when ALL independent variables are 0.

Dummy/ categoricalvariables in STATA In STATA, when we have a categorical/ordered variable, we can do this in two ways: Simply write i.xvar in any regression (‘i’ tells STATA that this is a categorical variable, and it will omit the lowest #) tab xvar, gen (new_xvar) – will create a binary 0/1 for each category. Then put all but one dummy variable in the regression

Example: support for EU integration, EES_EUsuportdata_s13.dta (on GUL) • Let’s saywe’reinterested in explainingwhy support for further EU integration varies at the individuallevel in Sweden. • DV: Some say European unification should be pushed further. Others say it already has gone too far. What is your opinion? Please indicate your views using a scale from 0 to 10, where '0‘ means unification "has already gone too far" and '10' means it "should be pushed further". • 3 IV’s: gender(0=men, 1=female), education (1=some post-secondary+, 0 if otherwise) and European identity (attachment, 0-3, greater, 0=very unattached, 3=very attachment)

300 200 Frequency 100 0 0 2 4 6 8 10 Supp_EU_int Summary stats • histxvar • DV ranges from 0-10 • 2 binaryIV’s • 1 ordinal IV

intercept: ? the predictedlevel of the DV when all IVs = 0 (men, w/out college, whoarestronglydetached from Europe) 2. female: the effectof gender is significant. Holding constanteducation and European identity, females support further EU integration by -0.4 on averagecomparedto men 3. Education: the effect is alsosignficant. Havingsome post-secondaryeducationincreases support for EU integration by 0.37 holding gender and andEuropeanidentityconstant 4. European attachment: is signficant: Holding constanteducation and gender, a oneunitincrease in attachment results in an increase in suppport for the DV by 1.05 on average.

A visualwith gender and identity (additiveeffects)

Somepredictions from ourmodel • Sometimes, wemightwanttodrawout ’ideal types’ for high/lowlevels of our DV & show the averagepredictedvalues.. • What is the predictedlevelof support for further EU integration for a: • malewithsomeuniversity and a strong Europeanidentity (3) ? = 2.21 -0.40(0) + 0.37(1) + 1.05(3) = 5.73 2. Femalewith no university and a veryweakEuropean attachment (0) ? = 2.21 -0.40(1) + 0.37(0) + 1.05(0) = 1.81

Alterantively, wecancreatedummies for EU_attach • With the ’i’ before the variable, STATA creates dummy variables for n-1 categories. • Lowestcategory is used as reference (in thiscase, 0). All Betas are in comparisonto ’0’ • Notice no change in othervariables or model stats, • but the constant is different • To change the ref. categoryto 3, write ’ib3.xvar’

Comparing marginal effects Significance values - not always interesting ...most everything tends to become significant with many observations, like in large survey data or large CSTS dataset… Another great feature of OLS is that we can compare both marginal and total effects of all B’s when you are about to publish your results you often want to say which variables have the greatest impact in this model? Here we can show boththe marginal effects (showed in the regression output). These effects/b-values only show the change in Y caused by on unit change in X, AND, the total effects (min to max effect, or the effects within a certain range) one has to consider the scale. Question:what is the marginal and total effectofour 3 variables?

Answer.. • For binaryvariables, marginal and total are the same • For ordinal/continuousvariables, wecan do a fewthingsto check this: • ’normalize’ (re-scale) the variable to 0/1 (see do file ’day 2 ESS ex.do’ for this) • Comparestandardizedcoefficients (just addcommand ’beta’) • Alternative – use ’margins’ or ’mchange’ command (more later..)

For ourmodel….

Directcomparison: Standardizedcoefficients Standardized coefficients can be used to make direct comparison of effects of IV’s When standardized coefficients are used (beta values), the scale unit of all variables are deviations from the mean – number of standard deviations Thus, we gain comparison but loose the intuitive feeling of our interpretation of the results, but we can always report both ‘regular’ betas and standardized.

STANDARDIZED COEFFICIENTS (BETAS) The standardization of a variable is just subtracting the mean from each observation and dividing by the st. dev. . The standardized variable will have a mean of ‘0’ and a st. dev. of ‘1’ standardized scores are also known as z-scores, so often they are labeled with a ‘z’ In STATA (for variables ‘y’ and ‘x’) gen zy=(y - r(mean))/r(sd) gen zx=(x - r(mean))/r(sd)

Lab exercise #1 – see GUL Work in 2s, please try to mix so thatpeoplewith STATA expereinceworkwithpeoplewithout it  Open Word file: ”multivariate regression STATA exercise” Open Data file ”states.dta”

Ordinaryleastsquares regressionAssumptions, violations and whatto do

OLS is ’BLUE’ • What is this? • It is the Best LinearUnbiasedEstimator • Aka ’Gauss–Markov theorem’ • states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator. Here "best" means giving the lowest variance of the estimate, as compared to other unbiased, linear estimators.

Again, OLS is fantastic if our data meets several assumptions, and before we make any inferences, we should always check: In order to make generalizable inferences: The linear model is suitable (correct specification) The conditional standard deviation is the same for all levels of X (homoskedasticity) Error terms are normally distributed for all levels of X There are no severe outliers There is no autocorrelation No multicollinearity Our sample is representative of the population (random selection in case of survey for ex.) Assumptions of OLS

1) Model specification: main issues a) - causality in the relationships - not so much a problem for the statistical model but rather a theoretical &/or research design problem. Better data and modelling - use panel data, experiments &theory! b) is the best way to define the relationships between DV and IV LINEAR? ifnot - OLS regression willgivebiasedresults c) all theoretically relevant variables should be included. Ideal is randomly drawn sample w/equal proportion of all relevant characteristics as per an experiment.. Not always possible.. - if they are not this will lead to "omitted variable bias", - if an important variable is being left out in a model - this will influence the coefficients of the other variables in the model. remedy? Theory, previous literature. Motivate all variables. Some statistical tests/checks

1. Linear model is suitable When 1 or more IV’s has a non-linear effect on the DV, thus a relationship exists, but cannot be properly detected in standard OLS This one is one of the easiest problems to detect: Bivariate Scatterplot: If the scatter plot doesn’t show an approximately linear pattern, the fitted line may be almost useless. Ramsey RESET test (F-test) theory If X and Y do not fit a linear pattern, there are several measures you can take

Non-linearitycan be detected!

Checking for this: US States data (from exercise) Scatter looks ok, butlet’s check moreformallywith the Ramsey RESET test:3 steps: Run regression in STATA Runcommandlinktest Runcommandovtest The linktestestimatesyour DV with the residual and squaredresidualofyourmodel as IVs. Ovtest, Ho: model is specifiedcorrectly A significant p-valueimpliesthatthe model is incorrectlyspecified If sig., make adjustment and re-run regression & test

Examplewith US states data • The 3 steps • 1. run regression • What do yousee?

Examplewith US states data • The 3 steps • 2. runlinktest • What do yousee?

Examplewith US states data • The 3 steps • 3. check ovtest • What do yousee? • Ho = original modelcorrectlyspecified. • Sig result = something’swrong!!

Issueswith non-linearity • Problems with curve-linear relationships - we will under- or overestimate the effect in the dependent variable for different values of the independent variable. • However, this is a ’sexy’ problem tohave at times.. • OLS can be used for relationships that are not strictly linear in y and x by using non-linear functions of y and x 3 standard approaches depending on the data: 1. natural log of x, y or both (e.g. logarithm)*. This is the inverse of exponentiation 2. quadratic forms of x or y (e.g. squared terms) 3. interactions of x variables • Or addingmore data/observations… *the natural logarithm will downplay extreme values and make it more normally distributed.

Variable transformation: naturalloggorithm • ‘Log models’ are invariant to the scale of the variables since they are now measuring percent changes. • Sometimes done to constrain extreme outliers, and downplay their effect in the model, and/or make the distribution more ‘compact’. • Standard variables in social science that researchers tend to log: 1. Positive variables representing wealth (personal income, country GDP, etc.). 2. Other variables that take large values – population, geographic area size, etc • Importantto note- the rank order does not change from the original scale!

Transforming your variables Using the naturallogarithm (e.g. the inverseof the exponentialfunction). Only for x>0. Ex. corruptionexplained by country size (population) Population and corruptionlogged population and corruption In Stata: reg DV IV gen logIV= log(IV) reg DV logIV

Interpretation oftransformations with logs 1. Logged DV and non-logged IV:ln(y) = β0 + β1x + u • β1 is approximately the percentage change in y given an absolute change in x. A 1-step increase in the IV gives the coefficient*100 percent increase in the DV. (%Δy=100⋅β1) 2. Logged IV and non-logged DV: y = β0 + β1ln(x) + u β1 is approximately the absolute change in y for a percentage change in x. A 1% increase in the IV gives the coefficient/100 increase in the DV in absolute terms. (Δy=(β1/100)%Δx) 3. Logged DV and IV: ln(y) = β0 + β1ln(x) + u • β1is the elasticity of y with respect to x (%Δy=β1%Δx) • β1 is thus the percentage change in y for a percentage change in x • NOTE: The interpretation is only applicable for log base e (natural log) transformations.

Summary of Rules for interpreation of Beta withlogged transformed variables

Quadraticforms (e.g. squared) • Ex. Democracy versus corruption • Explained later by an interaction with economic development Charron, N., & Lapuente, V. (2010). Does democracy produce quality of government?. European Journal of Political Research, 49(4), 443-470.

Quadratic forms –capturediminishingor increasingreturns Howtomodelthis? Quite simple,add a squared term of the non-linear IV

Quadraticforms: interpretation • Analysesincludingquadratictermscan be viewed as a special case ofinteractions (more on Friday on thistopic) • Includeboth original variable and the squaredtermin yourmodel: y = β0 + β1x + β2x2+ u • For ‘u’ shaped curves, B1 should be negative, while B2 should be positive • Including the squared term means that β1 can’t be interpreted alone as measuring the change in y for a unit change in x, we need to take into account β2 as wellsince:

In stata • 2 approaches: • Generate a new squaredvariable: gen x2 = x*x 2. Tell STATA in the regression with the ’##’ sign (better solution): For continuous or ordinalvariablesweneed to add the ’c.’ prior to the variable, for binary, use ’i’: Ex. regcsatc.percent##c.percent Now, let’s go back toourexample, wesee..

Weshouldalso re-check the Ramsey Reset test after the adjustment.. What do wesee?

Quadraticforms – gettingconcretemodelpredictionsusing the marginscommandfollowing regression margins, at (percent =(4 (10)84)) marginsplot

Otherthings to watch for for OLS assumption • The sample is a simple random representative sample (SRS) from the population. • Model has correct values • Data is valid and accurately measures concepts • No omitted variables (exogeneity)…

Adding additional variables : multiple regression