Regression Assumptions

Regression Assumptions

Best Linear Unbiased Estimate (BLUE) • If the following assumptions are met: • The Model is • Complete • Linear • Additive • Variables are • measured at an interval or ratio scale • without error • The regression error term is • normally distributed • has an expected value of 0 • errors are independent • homoscedasticity • predictors are unrelated to error • In a system of interrelated equations the errors are unrelated to each other • Characteristics of OLS if sample is probability sample • Unbiased • Efficient • Consistent

The Three Desirable Characteristics • Lack of bias • E(b)=β b is the sample β is the true, population coefficient • On the average we are on target • Efficiency • Standard error will be minimum • Remember: • OLS will minimize σ2 (the error variance) • Consistency • As N increases the standard error decreases • Notice: as N increases so does Σxi2

Completeness . regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB if AVG_ED>0 & AVG_ED<6, beta Source | SS df MS Number of obs = 10082 --------------+------------------------------ -------------------------------------- F( 6, 10075) = 2947.08 Model | 65503313.6 6 10917218.9 Prob > F = 0.0000 Residual | 37321960.3 10075 3704.41293 R-squared = 0.6370 -------------+---------------------------------------------------------------------- Adj R-squared = 0.6368 Total | 102825274 10081 10199.9081 Root MSE = 60.864 ------------------------------------------------------------------------------------------------------------ API13 | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------------------------------------- MEALS | .1843877 .0394747 4.67 0.000 .0508435 AVG_ED | 92.81476 1.575453 58.91 0.000 .6976283 P_EL | .6984374 .0469403 14.88 0.000 .1225343 P_GATE | .8179836 .0666113 12.28 0.000 .0769699 EMER | -1.095043 .1424199 -7.69 0.000 -.046344 DMOB | 4.715438 .0817277 57.70 0.000 .3746754 _cons | 52.79082 8.491632 6.22 0.000 . ------------------------------------------------------------------------------------------------------------ Meals . regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ED>0 & AVG_ED<6, beta Source | SS df MS Number of obs = 10082 ----------------+-------------------------------------------------------------------- F( 13, 10068) = 1488.01 Model | 67627352 13 5202104 Prob > F = 0.0000 Residual | 35197921.9 10068 3496.01926 R-squared = 0.6577 -------------+---------------------------------------------------------------------- Adj R-squared = 0.6572 Total | 102825274 10081 10199.9081 Root MSE = 59.127 -------------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta --------------+----------------------------------------------------------------------------------------------- MEALS | .370891 .0395857 9.37 0.000 .1022703 AVG_ED | 89.51041 1.851184 48.35 0.000 .6727917 P_EL | .2773577 .0526058 5.27 0.000 .0486598 P_GATE | .7084009 .0664352 10.66 0.000 .0666584 EMER | -.7563048 .1396315 -5.42 0.000 -.032008 DMOB | 4.398746 .0817144 53.83 0.000 .349512 PCT_AA | -1.096513 .0651923 -16.82 0.000 -.1112841 PCT_AI | -1.731408 .1560803 -11.09 0.000 -.0718944 PCT_AS | .5951273 .0585275 10.17 0.000 .0715228 PCT_FI | .2598189 .1650952 1.57 0.116 .0099543 PCT_HI | .0231088 .0445723 0.52 0.604 .0066676 PCT_PI | -2.745531 .6295791 -4.36 0.000 -.0274142 PCT_MR | -.8061266 .1838885 -4.38 0.000 -.0295927 _cons | 96.52733 9.305661 10.37 0.000 . ----------------------------------------------------------------------------------------------------------- Parents’ education

Diagnosis and Remedy • Diagnosis • Theoretical • Remedy • Including new variables

Linearity • Violation of linearity • An almost perfect relationship will appear as a weak one • Almost all linear relations stop being linear at a certain point

Diagnosis & Remedy • Diagnosis: • Visual scatter plots • Comparing regression with continuous and dummied independent variable • Remedy: • Use dummies • Y=a+bX+e becomes • Y=a+b1D1+ …+bk-1Dk-1+e where X is broken up into k dummies (Di) and k-1 is included. If the R-square of this equation is significantly higher than the R-square of the original that is a sign of non-linearity. The pattern of the slopes (bi) will indicate the shape of the non-linearity. • Transform the variables through a non-linear transformation, therefore • Y=a+bX+e becomes • Quadratic: Y=a+b1X+b2X2+e • Cubic: Y=a+b1X+b2X2+b3X3+e • Kth degree polynomial: Y=a+b1X+…+bkXk+e • Logarithmic: Y=a+b*log(X)+e or • Exponential: log(Y)=a+bX+e or Y=ea+bx+e • Inverse: Y=a+b/X+e etc.

Example

Meaningless! Inflection point: -b1/2*b2 -(-3.666183)/2*.0181756=100.85425 As you approach 100% the negative effect disappears

Other non-linear functions: Example: Count Data DEPENDENT VARIABLE Underdispersion : Mean/Std.Dev.>1 Overdispersion : Mean/Std.Dev.<1 As Mean >Std. Deviation we have a case of a (small) underdispersion We care about dispersion, because it tells us something about not just how spread out is the distribution but also about its shape. Remember that count data cannot be less than 0. So if the mean is less than the standard deviation, the distribution will have to be asymmetric (often with lots of 0s to keep the mean low, but a few very large values to pull the Std.Dev. up.)

Poisson and Negative Binomial Regressions Poisson regression assumes for the depedent variable that Mean=Std.Dev (No over- or underdispersion). Then Where θ stands for all the coefficients to be estimated (constant and slopes). Use Negative Binomial regression when there is overdispersion (when mean is smaller than standard deviation). Overdispersion happens when you have a lot of 0s. alpha = 0 means no over- or underdispersion Here alpha is small but significantly different from 0 (the 95% confidence interval does not include 0). Log of expected counts is now the unit of the dependent variable In this case, given the slight underdispersion, you should opt for the Poisson regression.

Additivity • Y=a+b1X1+b2X2+e • The assumption is that both X1 and X2 each, separately add to Y regardless of the value of the other. • You cannot simply add the two. X1 works differently, depending on the value of X2 . • There are many examples of the violation of additivity: • E.g., the effect of previous knowledge (X1) and effort (X2) on grades (Y) • Less effort will bring better grades if you have previous knowledge about the material taught in the class. • The effect of gender and education on income (discrimination) • Women increase their income less by increasing their educational achievements. Education does not pay the same way for men and women. • The effect of paternal and maternal education on academic achievement • If you have an educated father, your mom’s education matters less (or if you have an educated mom, your father’s education matters less). You cannot just add the effect of the two parents’ education.

Diagnosis & Remedy • Diagnosis: • Try other functional forms and compare R-squares • Remedy: • Introducing the multiplicative term as a new variable so • Y=a+b1X1+b2X2+ebecomes • Y=a+b1X1+b2X2+b3Z+ e whereZ=X1*X2 • Suppose X2 is a dummy variable: • If X2=0 • Y=a+b1X1+b2X2+b3Z+ e’= a+b1X1+b2X2+b3X1*X2+ e’= a+b1X1+b20+b3X1*0+ e’ = • a+b1X1+ e’ • If X2 =1 • Y= a+b1X1+b2X2+b3X1*X2+ e’’= a+b1X1+b21+b3X1*1+ e’’ = • (a+b2) +(b1+b3)X1+ e’’ = a’+b1’X1 + e’’ • So when X2=0 the constant is a and the slope is b1 • And when X2=1 the constant is a’and the slope is b1’ • The difference between a and a’ is b2 • The difference between b1and b1’is b3 • Other remedy if the model is fully multiplicative: • Transforming the equation into additive form • If Y=a*X1b1*X2b2*ethen • log Y=log(a)+b1log(X1)+b2log(X2)+eso b1 Y b3 b1’ a' b2 a X1

Example with one dummy variable Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1 .720(a) .519 .519 70.918 a Predictors: (Constant), ESCHOOL, AVG_ED Does parents’ education matter more in elementary school or later? Coefficients(a) Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1(Constant) 510.030 2.738 186.250 .000 AVG_ED 87.476 .930 .649 94.085 .000 ESCHOOL 54.352 1.424 .264 38.179 .000 a Dependent Variable: API13 ESCHOOL=1 if it is an elementary school ESCHOOL=0 otherwise Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1 .730(a) .533 .533 69.867 a Predictors: (Constant), INTESXED, AVG_ED, ESCHOOL Coefficients(a) Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1(Constant) 454.542 4.151 109.497 .000 AVG_ED 107.938 1.481 .801 72.896 .000 ESCHOOL 145.801 5.386 .707 27.073 .000 AVG_ED*ESCHOOL(interaction) -33.145 1.885 -.495 -17.587 .000 a Dependent Variable: API13

Equations • Pred(API13)= 454.542+107.938*AVG_ED+ 145.801*ESCHOOL+(-33.145)*AVG_ED*ESCHOOL • IF ESCHOOL=1 i.e. school is an elementary school • Pred(API13)=454.542+107.938*AVG_ED+ 145.801*1+(-33.145)*AVG_ED*1 = • 454.542+107.938*AVG_ED+ 145.801+(-33.145)*AVG_ED = • (454.542+145.801)+ (107.938-33.145)*AVG_ED = • 600.343+74.793*AVG_ED • IF ESCHOOL=0 i.e. school is not an elementary but a middle or high school • Pred(API13)=454.542+107.938*AVG_ED+ 145.801*0+(-33.145)*AVG_ED*0 = • 454.542+107.938*AVG_ED • The effect of parental education is larger after elementary school! • Is this difference statistically significant? Yes Coefficients(a) Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1(Constant) 454.542 4.151 109.497 .000 AVG_ED 107.938 1.481 .801 72.896 .000 ESCHOOL 145.801 5.386 .707 27.073 .000 AVG_ED*ESCHOOL(interaction) -33.145 1.885 -.495 -17.587 .000 a Dependent Variable: API13

Example with continuous variables Does parents’ education work differently depending on the percent English learners? Yes. As English learners become more numerous proportionally, the less positive effect parents’ education has.

Proper Level of Measurement

Measurement Error • Take Y=a+bX+e • Suppose X*=X+e where X is the real value and e is a random measurement error • Then Y=a+b’X*+e’ Y=a+b’(X+e)+e’=a+b’X+b’e+e’  • Y=a+b’X+E where E=b’e+e’ and b’=b • The slope (b) will not change but the error will increase as a result • Our R-square will be smaller • Our standard errors will be larger  t-values smaller  significance smaller • Suppose X#=X+cW+e where W is a systematic measurement error c is a weight • Then Y=a+b’X#+e’  Y=a+b’(X+cW+e)+e’=a+b’X+b’cW+E • b’=b iff rwx=0 or rwy=0 otherwise b’≠b which means that the slope will change together with the increase in the error. Apart from the problems stated above, that means that • Our slope will be wrong

Diagnosis & Remedy • Diagnosis: • Look at the correlation of the measure with other measures of the same variable • Remedy: • Use multiple indicators and structural equation models • Confirmatory factor analysis • Better measures

Normally Distributed Error

Non-Normal Error • Our calculations of statistical significance depends on this assumption • Statistical inference can be robust even when error is non-normal • Diagnosis: • You can look at the distribution of the error. Because of the homoscedasticity assumption (see later) the error when summed up for each prediction should be also normal. (In principle, we have multiple observations for each prediction.) • Remember! Our measured variables (Y and X) do not have to have a normal distribution! Only the error for each prediction. • Remedy: • Any non-linear transformation will change the shape of the distribution of the error

1.000 .998 .996 .994 .992 .990 .988 .986 Rsq = 0.6211 Z 200 400 600 800 1000 1200 X Error Has a Non-Zero Mean • The solid line gives a negative • The dotted line a positive mean • This can happen when we have some selection problem • Diagnosis: • Visual scatter plot will not help unless we know in advance somehow the true regression line • Remedy: • If it is a selection problem try to address it.

Non-independent errors • Example 1: Suppose you take a survey of 10 people but you interview everyone 10 times. • Now your N=1000 but your errors are not independent. For the same person you will have similar errors • Example 2: Suppose you take 10 countries and you observe them in 10 different time period • Now your N=1000 but your errors are not independent. For the same country you will have similar errors • Example 3: Suppose you take 100 countries and you observe them only once. Now your N=100. But countries that are next to each other are often similar (same geography and climate, similar history, cooperation etc.). If your model underpredicts Denmark, it is likely to underpredict Sweden as well. • Example 4: Suppose you take 100 people but they are all couples, so what you really have is 50 couples. Husband and wife tend to be similar. If your model underestimates one chances are it does the same for the other. Spouses have similar errors. • Statistical inference assumes that each case is independent of the other and in the two examples above it is not the case. In fact, your N < 100. • This biases your standard error because the formula is “tricked into believing” that you have a larger sample than you actually have and larger samples give smaller standard errors and better statistical significance. • This may also bias your estimatesof the intercept and the slope. Non-linearity is a special case of correlated errors.

Diagnosis & Remedy • It is called autocorrelation because the correlation is between cases and not variables, although autocorrelations often can be traced to certain variables such as common geographic location or same country or person or family. • Diagnosis • Visual, scatterplot • Checking groups of cases that are theoretically suspect • Certain forms of serial or spatial autocorrelations can be diagnosed by calculating certain statistics (e.g., Durbin-Watson test) • Remedy: • You can include new variables in the equation • E.g.: for serial (temporal) correlation you can include the value of Y in t-1 as an independent variable • For spatial correlation we can often model the relationships by introducing an weight matrix

Heteroscedasticity • Homoscedasticity means equal variance • Heteroscedasticity means unequal variance • We assume that each prediction is not just on target on average but also that we make the same amount of error • Heteroscedasticity results in biased standard errors and statistical significance • Diagnosis: • Visual, scatter plot • Remedy: • Introducing a weight matrix (e.g. using 1/X)

Predictor Related to Error • Error represents all factors influencing Y that are not included in the regression equation • If an omitted variable is related to X the assumption is violated. This is the same as the Completeness or Omitted Variable Problem • Diagnosis: • The error will ALWAYS be uncorrelated with X, there is no way to establish the TRUE error • Theoretical • Remedy: • Adding new variables to the model

Correlated errors across interrelated equations • We sometimes estimate more than one regression. • Suppose Yt=a+b1Xt-1+b2Zt-1+e but • Xt=a’+b’1Yt-1+b’2Zt-1+e’ • e and e’ will be correlated • (whatever is omitted from both equations will show up in both e and e’ making them correlated) • This is also the case in sample selection models • S=a+b1X+b2Z+e S is whether one is selected into the sample • Y=a+b’1X+b’2Z+b’3W+b’4V+e’ Y is the outcome of interest • e and e’ will be correlated • (whatever is omitted from both equations will show up in both e and e’ making them correlated)

Regression Assumptions