Advanced Regression Techniques: In-Depth Analysis and Applications

Stat 391 – Lecture 14 Regression, Part B: Going a bit deeper Assaf Oron, May 2008

Overview • We introduced simple linear regression, and some responsible-use tips • (dot your t’s and cross your i’s, etc.) • Today, we go behind the scenes: • Regression with binary X, and t-tests • The statistical approach to regression • Multiple regression • Regression hypothesis tests and inference • Regression with categorical X, and ANOVA • Advanced model selection in regression • Advanced regression alternatives

Binary X and t-tests • It is convenient to introduce regression using continuous X • But it can also be done when X is limited to a finite number of values, or even to non-numerical values • We use the exact same formulae and framework • When X is binary – that is, divides the data into two groups (e.g., “male” vs. “female”) - the regression is completely equivalentto the two-sample t-test • (the version with the equal-variance assumption) • The regression assigns x=0 to one group, x=1 to the other, so our “slope” becomes the difference between group means, and our “intercept” is the mean of the x=0 group • Let’s see this in action:

Regression: the Statistical Approach • Our treatment of regression thus far has been devoid of any probability assumptions • All we saw was least-squares optimization, partition of sums-of-squares, some diagnostics, etc. • But regression can be viewed via a probability model: • The β’s are seen (in classical statistics) as fixed constant parameters, to be estimated. The x’s are fixed as well. • The ε are random, and are different between different y’s • They have expectation 0, and under standard regression are assumed i.i.d. normal

Regression: the Statistical Approach (2) • The equation in the previous slide is a simple example for a probabilistic regression model • Such models describe observations as a function of fixed explanatory variables (x) – known as covariates – plus random noise • The linear-regression formula can also be written as a conditional probability:

Regression: the Statistical Approach (3) • The probability framework allows us to use the tools of hypothesis testing, confidence intervals – and statistical estimation • Under the i.i.d-normal-error assumption, the MLE’s for intercept and slope are identical to the least-squares solutions • (this is because the log-likelihood is quadratic, so the MLE mechanics are equivalent to least-squares optimization) • Hence the “hats” in the formula

Multiple Regression Often, our response y can potentially be explained by more than one covariate • For example: earthquake ground movement at a specific location, is affected by both the magnitude and the distance from the epicenter (attenu dataset) • It turns out that everything we did for a single x, can be done with p covariates, using analogous formulae • Instead of finding the least-squares line in 2D, we find the least-squares hyperplane in p+1 dimensions • We have to convert to matrix-vector terminology:

Multiple Regression (2) Responses: vector of length n • Where has the intercept term gone? • It is merged into X the model matrix, as the first column: a column of 1’s • (check it out) • Each covariate takes up a subsequent column Errors: i.i.d. vector of normal r.v.’s; length n Model Matrix: n rows, (p+1) columns Parameter vector quantifying the effects; length p+1

Multiple Regression (3) • This was the math; conceptually, what does multiple regression do? • It calculates the “pure” effect of each covariate, while neutralizing the effect of the other covariates • (We call this “adjusting for the other covariates”) • If a covariate is NOT in the model, it is NOT neutralized – so it becomes a potential confounder • Let’s see this in action on the attenu data:

Multiple Regression (4) • …and now for the first time: we actually write out the solutions • First for the parameters: • And these are the fitted values for y: Note the matrix transpose and inverse operators All this is a function of X, and can be written as a single matrix a.k.a. “the Hat Matrix” (why?)

Regression Inference • Why did we bother with these matrix formulae? • To show you that both parameter estimates and fitted values are linear combinations of the original observations • Each individual estimate or fitted value can be written as a weighted sum of the y’s, • Useful fact: linear combinations of normal r.v.’s are also normal r.v.’s • So if our model assumptions hold, the beta-hats and y-hats are all normal (how convenient)

Regression Inference (2) • What if the observation errors are not normal? • Well, recall that since a sum is just the mean multiplied by n, its shape also becomes “approximately normal” as n increases, due to the CLT • This holds for weighted-sums as well, under fairly general assumptions • At bottom line, if the errors are “reasonably well-behaved” and n is large enough, our estimates are still as good as normal

Regression Inference (3) • So individual beta-hat or y-hat can be assumed (approximately) normal, with variance • The only missing piece is to estimate σ2, the variance of the observation errors • …And σ2 is easily estimated using the residuals • But since we estimate variance from the data, all our inference is based on the t-distribution • We lose a degree of freedom for each parameter including the intercept, to end up with n-p-1

Back to that Printout again… Call: lm(formula = y1 ~ x1, data = anscombe) Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x1 0.5001 0.1179 4.241 0.00217 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.002170 The t-statistics and p-values are for tests against a null hypothesis that the true parameter value is zero (each parameter tested separately).If your null is different, you’ll have to do the test on our own Which parameter does this null usually NOT make sense for?

Note: Beware The (Model) Matrix • If X has a column which is a linear function of other column(s), it is said to be a singular matrix • It cannot be inverted • The software may scream at you • Conceptually, you are asking the method to decide between two identical explanations; it cannot do this • If X has a column which is “almost” a linear function of other column(s), it is said to suffer from collinearity • Your beta-hat S.E.’s will be huge • Conceptually, you are asking the method to decide between two nearly-identical explanations; still not a good prospect

Categorical X and ANOVA • We saw that simple regression with binary X is equivalent to a t-test • Similarly, we can model a categorical covariate having k>2 categories, within multiple regression • For example: ethnic origin vs. life expectancy • The covariate will take up k-1 columns in X – meaning there’ll be k-1 parameters to estimate • R interprets text covariates as categorical; you can also convert numerical values to categorical using factor • Let’s see this in action:

Categorical X and ANOVA (2) • Regression on a categorical variable is equivalent to a technique called ANOVA: analysis of variance • ANOVA is used to analyze designed experiments • ANOVA’s name is derived from the fact that its hypothesis tests are performed by comparing sums of square deviations (such as those shown last lecture) • This is known as the F test, and appears in our standard regression printout • ANOVA is considered an older technology, but is still very useful in engineering, agriculture, etc.

Regression Inference and Model Selection • So… we can keep fitting the data better by adding as many covariates as we want? • Not quite. If p≥n-1, you can fit the observations perfectly. This is known as a saturated model; prettyuseless for drawing conclusions • (in statistics jargon, you will have used up all your degrees of freedom) • Before reaching n-1, each additional covariate improves the fit. Where to stop? • Obviously, there is a tradeoff; we seek the optimum between over-fitting and under-fitting • From a conceptual perspective, we usually prefer the simpler models (less covariates) • However, given all possible covariates, “how to find the optimal combination?” is an open question

Regression Inference: Nested Models • If two models are nested, we can make a formal hypothesis test between them, called a likelihood-ratio test (LRT) • This test checks whether the gain in explained variability is “worth” the price paid in degrees of freedom • But when are two models nested? • The simplest case: if model B = model A + some added terms, then A is nested in B • (sometimes, nesting includes simplification of more complicated multi-level covariates: e.g., region vs. state) • In R, the LRT is available via lrtest ,in the lmtestpackage (and also via the anova function)

Open-Ended Model Selection • Where all else fails… use common sense. Your final model is not necessarily the best in terms of “bang for the buck” • Your covariate of interest should definitely go in • “nuisance covariates” required by the client or the accepted wisdom, should go in as well • Causal diagrams are a must for nontrivial problems; covariates with a clear causal connection to the response should go in first • (there are also model-selection tools, known as AIC, BIC, cross-validation, BMA, etc.)

Open-Ended Model Selection (2) • Additionally, the goal of the model matters: • If for formal inference/policy/scientific conclusions, you should be more conservative (less covariates, less effort to fit the data closely) • If for prediction and forecasting under conditions similar to those observed, you can be a bit more aggressive • In any case, there is no magic solution • Always remember not to put too much faith in the model

More Sophisticated Regressions • The assumptions of linearity-normality-i.i.d. are quite restrictive • Some violations can be handled via standard regression: • Nonlinearity – transform the variables • Unequal variances – weighted least squares • For other violations, extensions of ordinary regression have been developed

More Sophisticated Regressions (2) • For some types of non-normality we have generalized linear models (GLM) • The GLM solution is also an MLE • GLM’s cover a family of distributions that includes the normal, exponential, Gamma, binomial, Poisson • The variant with binomial responses is known as Logistic Regression; let’s see it in action • If we suffer from outliers or heavy tails, there are many types of robust regression to choose from

More Sophisticated Regressions (3) • If observations are not i.i.d., but are instead divided into groups, we can use hierarchical or “mixed” models (this is very common) • Of course, any regression can be done using Bayesian methods • These are especially useful for complicated hierarchical models • Finally, if y’s dependence upon x is not described well by any single function, there is nonparametric regression (“smoothing”) • Some of which we may see next week

Advanced Regression Techniques: In-Depth Analysis and Applications

Advanced Regression Techniques: In-Depth Analysis and Applications

Presentation Transcript

STAT 111 Introductory Statistics

Stat 112: Lecture 7 Notes

Stat 112: Lecture 13 Notes

Stat 13 Lecture 22 comparing proportions

STAT 110 - Section 5 Lecture 23

STAT 110 - Section 5 Lecture 13

STAT 110 - Section 5 Lecture 23

Intermediate Applied Statistics STAT 460

Course Logistics

Stat 13 Lecture 19 discrete random variables, binomial

STAT 110 - Section 5 Lecture 1

STAT 110 - Section 5 Lecture 17

Stat 100, This week

STATS 330: Lecture 4

Stat 350, Lecture # 2

ENGR 224/STAT 224 Probability and Statistics Lecture 11

Line of Best Fit

STAT 3120 Statistical Methods I

CS 311 – Lecture 12 Outline

Statistics Major at Penn State