1 / 22

Unit 10: Model Assumptions

Unit 10: Model Assumptions. Goals of Unit. Understand 5 assumptions of GLMs Understand consequences of assumption violations Inefficient standard errors (low power, type II errors) Inaccurate standard errors (incorrect statistical tests, type I errors) Inaccurate parameter estimates

jregina
Download Presentation

Unit 10: Model Assumptions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unit 10: Model Assumptions

  2. Goals of Unit • Understand 5 assumptions of GLMs • Understand consequences of assumption violations • Inefficient standard errors (low power, type II errors) • Inaccurate standard errors (incorrect statistical tests, type I errors) • Inaccurate parameter estimates • Learn how to detect violations (statistical tests and visual diagnosis) • Recognize the range of options available when violations are detected. Implementation of these options will be covered over the two semesters • Examination of assumptions will further inform you about your data

  3. Assessment of Assumptions for Sig. Tests All GLM procedures commonly make the 5 assumptions below. When these assumptions are met, OLS regression coefficients are MVUE (Minimum Variance Unbiased Estimators) and BLUE (Best Linear Unbiased Estimators). With the exception of #1, these assumptions are expressed (and assessed) with respect to the residuals around the prediction line. 1. Exact X: The IVs are assumed to be known exactly (i.e., without measurement error) 2. Independence: Residuals are independently distributed (prob. of obtaining a specific observation does not depend on other observations) 3. Normality: All residual distributions are normally distributed 4. Constant variance: All residual distributions have a constant variance, SEE2 5. Linearity: All residual distributions (i.e., for each Y') are assumed to have means equal to zero

  4. Exact X Violations of the Exact X assumption lead to biased (i.e., inaccurate) estimates of regression coefficients. Violations are caused by problems with reliability of measurement of your predictors. In simple, bivariate regression, how will reducing reliability affect the regression model?

  5. Exact X It will reduce b1. We will underestimate the strength of relationship between X and Y. In multiple predictor models, the bias can be either positive or negative based on the nature of the correlations among the predictors. USE RELIABLE VARIABLES. See reliability demo script on website

  6. Exact X What are the implications of unreliable X for the use of covariates to “control” variables? Covariates only “control’ from the construct they measure to the degree that they are reliable (and valid) measures of that construct. Analysis that rely on unreliable covariates are not controlling the variance for the construct well.

  7. Independence Violations of the independence of residuals assumption can compromise the validity of our statistical tests (inaccurate standard errors). Violations of residuals independence is a function of the research design caused by repeated measures on the same individual or related individuals/observations (participants in same family, school, etc). Often difficult to detect in data but clear from research design. Can be fixed by a variety of approaches including repeated measures analyses (next semester) or multi-level, mixed effects, and/or hierarchical linear models (next semester)

  8. Assessment of Residuals: General Issues Remaining three assumptions (Normality distributed residuals with a mean of 0 and constant variance) can be assessed via examination of the residuals Use of Graphical Methods is emphasized Statistical tests of assumptions exist but should be used cautiously Assessment of assumptions about residuals is an inexact science: Conclusions are tentative. The process of examining residuals will increase your understanding of your data. May suggest transformations of your data May suggest alternative analytic strategies Will increase your confidence in your conclusions.

  9. Residuals by Y: Normal, M=0, S=Constant Norm <- data.frame(Y=rep(1:25,1000),E= rnorm(25000, mean=0, sd=1)) E10 = Norm[Norm$Y==10,2] hist(E10, probability=TRUE, main= "Y' = 10",xlab="E's") lines(density(E10))

  10. Normally Distributed Errors  The errors for each Yare assumed to be normally distributed. Normally distributed errors are required for OLS regression coefficients to be MVUE but not BLUE. Central limit theory indicates that even with non-normal errors, significance tests and confidence intervals are approximately correct with large N Coefficients are still best unbiased efficient estimators among linear solutions(i.e., BLUE) but more efficient non-linear solutions may exist (e.g., Generalized Linear Models such as Poisson regression for thick tailed distributions). Mean may not be best measure of center of a highly skewed distribution Multimodal error distributions suggest the omission of one or more categorical variables that divide the data into groups. Transformations may correct shape of residuals (this semester)

  11. Normality: Quantile Comparison & Density plots modelAssumptions(m2,Type='normal', one.page=TRUE)

  12. Normality: q-q plot examples n1 <- as.data.frame(matrix(rnorm(100*1, mean=0, sd=1), ncol=1))

  13. Normality: Density Plot of Residuals varDescribe(rstudent(m2)) var n mean sd median min max skew kurtosis 1 1 94 0.00235 1.02 -0.196 -2.78 2.72 0.385 0.741

  14. Constant Variance  The errors for each Y are assumed have a constant variance (homoscedasticity). This is necessary for the OLS estimated coefficients to be BLUE. If the errors are heteroscedastic, the coefficients remain unbiased but the efficiency (precision of estimation) is impaired and the coefficient SEs become inaccurate. The degree of the problem depends on severity of violation and sample size. Rough rule is that estimation is seriously degraded if the ratio of largest to smallest variance is 10 or greater (or more conservatively, 4 or greater) 1. Transformations may fix this issue (next semester) 2. Weighted Least Squares provides an alternative to estimation when heteroscedasticity exists (maybenext semester?) 3. Corrections also exist for SEs when errors are heteroscedastic (more on this in a moment)

  15. Constant Variance: Residual Plot modeAssumptions(m2, Type='constant', one.page=FALSE) Suggested power transformation: 0.2699878

  16. Constant Variance: Spread-Level Plot A spread level plot is a plot of the log(abs(studentized residuals) vs. log(fitted values). 1-b (from the regression line to the right) is suggested power transformation for Y to stabilize variance Suggested power transformation: 0.2699878

  17. Constant Variance: Statistical Test Breusch & Pagan (1979) and Cook & Weisberg (1983) independently developed test for constant variance ncvTest() in car package. Provided by modelAssumptions() as well Do not use it blindly. All statistical tests are designed to be sensitive to specific types of violations. May miss other types of violations. May also provide false positive due to other aspects of the distribution Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 17.88416 Df = 1 p = 2.34767e-05 Breusch, T. S. and Pagan, A. R. (1979) A simple test for heteroscedasticity and random coefficient variation. Econometrica, 47, 1287-1294. Cook, R. D. and Weisberg, S. (1983) Diagnostics for heteroscedasticity in regression. Biometrika, 70, 1-10.

  18. Constant Variance: Correcting SEs Standard errors are inaccurate when variance of residuals is not constant. A procedure to provide White (1980) corrected SEs is described in Fox (2008), chapter 12, pp 275-276) modelCorrectSE(m2, digits=2) Uncorrected Tests of Coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 41.478 7.093 5.8 7.9e-08 BAC -254.979 76.812 -3.3 1.3e-03 TA 0.088 0.029 3.0 3.2e-03 Sexmale -16.112 6.033 -2.7 9.0e-03 White (1980) heteroscedasticity-corrected SEs and Tests Estimate Std. Error t value Pr(>|t|) (Intercept) 41.478 8.346 5.0 3.2e-06 BAC -254.979 79.515 -3.2 1.9e-03 TA 0.088 0.031 2.8 6.1e-03 Sexmale -16.112 6.107 -2.6 9.8e-03 Long, J. S. and Ervin, L. H. (2000) Using heteroscedasity consistent standard errors in the linear regression model. The American Statistician 54, 217–224. White, H. (1980) A heterskedastic consistent covariance matrix estimator and a direct test of heteroskedasticity. Econometrica48, 817–838.

  19. Linearity: Component + Residuals plots If Linearity assumption is not met, coefficients are biased. Plot partial residual (ei(j) = ei + bjXij) by each predictor. crPlots() Can include factors but cant include interactions with factors. Code regressors manually. modelAssumptions(m2, Type='linear')

  20. Global Test of Model Assumptions Pena, EA and Slate, EH (2006). Global validation of linear model assumptions," Journal of the American Statistical Association, 101, 341-354. Provided by modelAssumptions() using gvlma package gvlma(m2) ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM: Level of Significance = 0.05 Call: gvlma(x = m2) Value p-value Decision Global Stat 16.40260 0.0025239 Assumptions NOT satisfied! Skewness 2.23346 0.1350514 Assumptions acceptable. Kurtosis 1.61765 0.2034198 Assumptions acceptable. Link Function 0.00358 0.9522862 Assumptions acceptable. Heteroscedasticity 12.54792 0.0003966 Assumptions NOT satisfied!

  21. Transformations: The Family of Powers & Roots • Power transformations (next unit) are very useful for correcting problems with normality, constant, variance, and linearity of errors. • Polynomial regression (next semester) is useful when you have quadratic, cubic, etc. effects of Xs on Y. • Generalized linear models (e.g., Logistic regression; next semester) are also available.

  22. Summary of Violation Consequences and Solutions • Exact X • Inaccurate (biased) parameter estimates • Use reliable measures of Xs, use SEM with latent variables • Independence • Inaccurate standard errors • Use repeated measures, use multi-level models • Normally distributed errors • Inefficient standard errors • Consider omitted variables, use transformations, use generalized linear models • Constant variance for errors • Inaccurate and inefficient standard errors • Use SE corrections (but still inefficient), use transformations, use weighted least squares • Linearity (Error distributions all have mean of 0) • Inaccurate (biased) parameter estimates • Use transformations, use polynomial regression, use generalized linear models

More Related