1 / 24

Stat 391 – Lecture 14

Stat 391 – Lecture 14. Regression, Part B: Going a bit deeper Assaf Oron, May 2008. Overview. We introduced simple linear regression, and some responsible-use tips (dot your t’s and cross your i’s, etc.) Today, we go behind the scenes: Regression with binary X, and t-tests

wnicole
Download Presentation

Stat 391 – Lecture 14

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stat 391 – Lecture 14 Regression, Part B: Going a bit deeper Assaf Oron, May 2008

  2. Overview • We introduced simple linear regression, and some responsible-use tips • (dot your t’s and cross your i’s, etc.) • Today, we go behind the scenes: • Regression with binary X, and t-tests • The statistical approach to regression • Multiple regression • Regression hypothesis tests and inference • Regression with categorical X, and ANOVA • Advanced model selection in regression • Advanced regression alternatives

  3. Binary X and t-tests • It is convenient to introduce regression using continuous X • But it can also be done when X is limited to a finite number of values, or even to non-numerical values • We use the exact same formulae and framework • When X is binary – that is, divides the data into two groups (e.g., “male” vs. “female”) - the regression is completely equivalentto the two-sample t-test • (the version with the equal-variance assumption) • The regression assigns x=0 to one group, x=1 to the other, so our “slope” becomes the difference between group means, and our “intercept” is the mean of the x=0 group • Let’s see this in action:

  4. Regression: the Statistical Approach • Our treatment of regression thus far has been devoid of any probability assumptions • All we saw was least-squares optimization, partition of sums-of-squares, some diagnostics, etc. • But regression can be viewed via a probability model: • The β’s are seen (in classical statistics) as fixed constant parameters, to be estimated. The x’s are fixed as well. • The ε are random, and are different between different y’s • They have expectation 0, and under standard regression are assumed i.i.d. normal

  5. Regression: the Statistical Approach (2) • The equation in the previous slide is a simple example for a probabilistic regression model • Such models describe observations as a function of fixed explanatory variables (x) – known as covariates – plus random noise • The linear-regression formula can also be written as a conditional probability:

  6. Regression: the Statistical Approach (3) • The probability framework allows us to use the tools of hypothesis testing, confidence intervals – and statistical estimation • Under the i.i.d-normal-error assumption, the MLE’s for intercept and slope are identical to the least-squares solutions • (this is because the log-likelihood is quadratic, so the MLE mechanics are equivalent to least-squares optimization) • Hence the “hats” in the formula

  7. Multiple Regression Often, our response y can potentially be explained by more than one covariate • For example: earthquake ground movement at a specific location, is affected by both the magnitude and the distance from the epicenter (attenu dataset) • It turns out that everything we did for a single x, can be done with p covariates, using analogous formulae • Instead of finding the least-squares line in 2D, we find the least-squares hyperplane in p+1 dimensions • We have to convert to matrix-vector terminology:

  8. Multiple Regression (2) Responses: vector of length n • Where has the intercept term gone? • It is merged into X the model matrix, as the first column: a column of 1’s • (check it out) • Each covariate takes up a subsequent column Errors: i.i.d. vector of normal r.v.’s; length n Model Matrix: n rows, (p+1) columns Parameter vector quantifying the effects; length p+1

  9. Multiple Regression (3) • This was the math; conceptually, what does multiple regression do? • It calculates the “pure” effect of each covariate, while neutralizing the effect of the other covariates • (We call this “adjusting for the other covariates”) • If a covariate is NOT in the model, it is NOT neutralized – so it becomes a potential confounder • Let’s see this in action on the attenu data:

  10. Multiple Regression (4) • …and now for the first time: we actually write out the solutions • First for the parameters: • And these are the fitted values for y: Note the matrix transpose and inverse operators All this is a function of X, and can be written as a single matrix a.k.a. “the Hat Matrix” (why?)

  11. Regression Inference • Why did we bother with these matrix formulae? • To show you that both parameter estimates and fitted values are linear combinations of the original observations • Each individual estimate or fitted value can be written as a weighted sum of the y’s, • Useful fact: linear combinations of normal r.v.’s are also normal r.v.’s • So if our model assumptions hold, the beta-hats and y-hats are all normal (how convenient)

  12. Regression Inference (2) • What if the observation errors are not normal? • Well, recall that since a sum is just the mean multiplied by n, its shape also becomes “approximately normal” as n increases, due to the CLT • This holds for weighted-sums as well, under fairly general assumptions • At bottom line, if the errors are “reasonably well-behaved” and n is large enough, our estimates are still as good as normal

  13. Regression Inference (3) • So individual beta-hat or y-hat can be assumed (approximately) normal, with variance • The only missing piece is to estimate σ2, the variance of the observation errors • …And σ2 is easily estimated using the residuals • But since we estimate variance from the data, all our inference is based on the t-distribution • We lose a degree of freedom for each parameter including the intercept, to end up with n-p-1

  14. Back to that Printout again… Call: lm(formula = y1 ~ x1, data = anscombe) Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x1 0.5001 0.1179 4.241 0.00217 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.002170 The t-statistics and p-values are for tests against a null hypothesis that the true parameter value is zero (each parameter tested separately).If your null is different, you’ll have to do the test on our own Which parameter does this null usually NOT make sense for?

  15. Note: Beware The (Model) Matrix • If X has a column which is a linear function of other column(s), it is said to be a singular matrix • It cannot be inverted • The software may scream at you • Conceptually, you are asking the method to decide between two identical explanations; it cannot do this • If X has a column which is “almost” a linear function of other column(s), it is said to suffer from collinearity • Your beta-hat S.E.’s will be huge • Conceptually, you are asking the method to decide between two nearly-identical explanations; still not a good prospect

  16. Categorical X and ANOVA • We saw that simple regression with binary X is equivalent to a t-test • Similarly, we can model a categorical covariate having k>2 categories, within multiple regression • For example: ethnic origin vs. life expectancy • The covariate will take up k-1 columns in X – meaning there’ll be k-1 parameters to estimate • R interprets text covariates as categorical; you can also convert numerical values to categorical using factor • Let’s see this in action:

  17. Categorical X and ANOVA (2) • Regression on a categorical variable is equivalent to a technique called ANOVA: analysis of variance • ANOVA is used to analyze designed experiments • ANOVA’s name is derived from the fact that its hypothesis tests are performed by comparing sums of square deviations (such as those shown last lecture) • This is known as the F test, and appears in our standard regression printout • ANOVA is considered an older technology, but is still very useful in engineering, agriculture, etc.

  18. Regression Inference and Model Selection • So… we can keep fitting the data better by adding as many covariates as we want? • Not quite. If p≥n-1, you can fit the observations perfectly. This is known as a saturated model; prettyuseless for drawing conclusions • (in statistics jargon, you will have used up all your degrees of freedom) • Before reaching n-1, each additional covariate improves the fit. Where to stop? • Obviously, there is a tradeoff; we seek the optimum between over-fitting and under-fitting • From a conceptual perspective, we usually prefer the simpler models (less covariates) • However, given all possible covariates, “how to find the optimal combination?” is an open question

  19. Regression Inference: Nested Models • If two models are nested, we can make a formal hypothesis test between them, called a likelihood-ratio test (LRT) • This test checks whether the gain in explained variability is “worth” the price paid in degrees of freedom • But when are two models nested? • The simplest case: if model B = model A + some added terms, then A is nested in B • (sometimes, nesting includes simplification of more complicated multi-level covariates: e.g., region vs. state) • In R, the LRT is available via lrtest ,in the lmtestpackage (and also via the anova function)

  20. Open-Ended Model Selection • Where all else fails… use common sense. Your final model is not necessarily the best in terms of “bang for the buck” • Your covariate of interest should definitely go in • “nuisance covariates” required by the client or the accepted wisdom, should go in as well • Causal diagrams are a must for nontrivial problems; covariates with a clear causal connection to the response should go in first • (there are also model-selection tools, known as AIC, BIC, cross-validation, BMA, etc.)

  21. Open-Ended Model Selection (2) • Additionally, the goal of the model matters: • If for formal inference/policy/scientific conclusions, you should be more conservative (less covariates, less effort to fit the data closely) • If for prediction and forecasting under conditions similar to those observed, you can be a bit more aggressive • In any case, there is no magic solution • Always remember not to put too much faith in the model

  22. More Sophisticated Regressions • The assumptions of linearity-normality-i.i.d. are quite restrictive • Some violations can be handled via standard regression: • Nonlinearity – transform the variables • Unequal variances – weighted least squares • For other violations, extensions of ordinary regression have been developed

  23. More Sophisticated Regressions (2) • For some types of non-normality we have generalized linear models (GLM) • The GLM solution is also an MLE • GLM’s cover a family of distributions that includes the normal, exponential, Gamma, binomial, Poisson • The variant with binomial responses is known as Logistic Regression; let’s see it in action • If we suffer from outliers or heavy tails, there are many types of robust regression to choose from

  24. More Sophisticated Regressions (3) • If observations are not i.i.d., but are instead divided into groups, we can use hierarchical or “mixed” models (this is very common) • Of course, any regression can be done using Bayesian methods • These are especially useful for complicated hierarchical models • Finally, if y’s dependence upon x is not described well by any single function, there is nonparametric regression (“smoothing”) • Some of which we may see next week

More Related