Advanced Model Building Techniques in Data Science

Stat 324 – Day 30 Review II

Last Time: Elastic Net • Penalty: • 𝛼 = 0: Ridge • 𝛼 = 1: Lasso • 𝛼 close to 1 is a popular choice since we essentially get the lasso with a little bit of singularity handling. • If two variables highly correlated Lasso probably only takes one of them

Reminders • Can be used when can’t do Best Subsets • Neither ridge nor lasso will universally dominate the other • Are there underlying population slope coefficients that are zero? • Jeremy from State Farm: Want an iphone app that includes “all of the important predictors” so will use a model that includes multicollinearity to demonstrate that you have adjusted for those important variables • Why not find a model a couple of different ways and see if the results agree?! If not, the investigate the differences more closely.

Last Time: Model Validation • If enough data are available, one way to assess the predictive power of a model is to build model on training data set and then apply model (estimated coefficients) to test data set. • Otherwise: internal validation • If MSPE (or PRESS) is not very different from s, have faith that making inference from chosen model will not be misleading • Also make sure model makes sense in context!

Study advice • Learn from the HW assignments • Preliminary solutions to Lab posted • Think globally and how everything ties together • What are the “implications” of the model? • What was the ending “moral” of some of the HW questions? • What were some of the repeated messages? • Number crunching/Art/Philosophy

What We’ve Been Doing: Multiple Regression • “Partial coefficients”: after considering combined effect of other variables already in model • Matrix scatterplots, correlation matrix • Added variable plots • Multicollinearity • No slopes, p-values, confidence intervals can be interpreted without considering what other variables are in the model

Model Building • Testing multiple coefficients simultaneously in “nested” models • Overall F test, partial F test (full vs. reduced models) • Is the simpler model “good enough”? (parsimony) • Unusual observations • Extreme in x-space; Influential to model • Special terms: polynomial, indicator variables, interaction terms • Supercedes two-sample procedures, ANOVA (pooled) • Variable selection techniques and validation

Modeling… • Example: Salary vs. education • What if • Huge increase for college education? • Decrease in pay for graduate education? • Transformation? • Indicator variables? • EV = Colleges?

Residual Plots • Start with residuals vs. y-hats • If find an issue, then might want to explore further • E.g., residuals vs. x1, residuals vs. x2, … • If normality and equal variance are a problem and have some curvature across all EVs, then transform y. If only have curvature with individual x’s, then transform the individual x’s. • Also residuals vs. order • If observations in datafile are in order

Translations • Interaction term • Does the effect of age depend on smoking status? • Quadratic term • Does the effect of age differ for younger people than older people? • Multicollinearity • Is the effect of age indistinguishable from the effect of education?

Mallow’s Cp “Picking a single model from among all those with small Cp is usually a matter of selecting the most convenient one whose coefficients all differ significantly from zero” • If want just one model, choose smallest Cp • Read list in order, stop first time Cp gets near p • Take several models where Cp is small, near p and consider them further This one probably good enough This one “better” Is age statistically significant?

Interpretations • Interpreting standardized coefficients • A one SD change in x is associated with a b SD change in y • Interpreting R2adj • You don’t really, use it to compare models with different numbers of parameters • Use R2 to describe % of variability in y explained

Interpreting Plots • Added variable plot • Use to assess utility of adding variable to model • Tells you the approximate slope and p-value if that variable is added to the model • Tells you what percentage of the currently unexplained variability will be explained

Interaction Plots • One quant, one cat (coded scatterplot) • Both categorical (anova)

Tests of significance • Stating hypotheses • About parameters • How many terms are different between the two models? • Why look at collection vs. one at a time • Match interpretation (all vs. at least one) • Corresponding degrees of freedom • When have output, when by hand

Parameterization • (0,1) “indicator parameterization” • Reference group corresponds to x = 0 • x = 1 corresponds to group listed • Binary: b is the difference in population means • (-1, 1) “effect parameterization” • Estimates correspond to how far from “average” • Missing coefficient found by seeing what makes all the coefficients sum to zero • Binary: -b and b are the two “effects,” so 2b is difference (also use 2 x SE)

Penalized Regression • What you need to know • There is more in this world than “least squares estimation” • Want to find the “best” estimates of the predictors • Another key idea is how variable those estimates are and how far we expect the estimates to be from the population values/true model values • These procedures can be used when have very large number of predictor variables to sort through

Comments • “Several methods have been proposed for dealing with collinear data. Although these methods are sometimes useful, none can be recommended generally… Methods that are commonly (and, more often than not, unjustifiably) employed with collinear data include model respecification, variable selection (stepwise and subset methods), biased estimation (e.g., ridge regression), and the introduction of additional prior information”

Piecewise functions

What would you consider here?

Splines • Think polynomial models but with more flexibility • Only have so much control where those turns happen in a polynomial function • Splines fit different functions between the knots, often with similar number of degrees of freedom • Still parametric models (vs. LOESS smoother)

Big Lessons • Writing out prediction equations • Making predictions/parameterization • Be precise “I see the condition is met” • Interpretations, justifications • Partial F tests • Interpret confidence intervals • Factors that affect widths of intervals • Does model make sense in context? • curvature • Model simplicity vs. model fit

Big Lessons • Using slopes to compare groups • Use Ha to reflect research question • one-sided or two-sided p-value • Be precise • 54% of variability is explained • “change” vs. increase or decrease • Prediction intervals vs. Confidence intervals

Confounding • A variable is confounding if including it changes the effect of the other variable. • Is a smoking effect but how view that smoking effect changes when include age in the model • Original smoking effect was really age effect • To be a confounding variable needs to be • Related to response variable • Related to other explanatory variable

REGRESS test • Most missed questions • Q2: finding the intercept • When not to draw conclusions (extrapolation, observational studies) • Assessing patterns • Curvature vs. outliers (Q11) • Q13-Q15 (which model) • Q16 (paired data) • Assessing normality (Q 18)

REGRESS test • Most missed questions • Q19 (confounder) • Q20 (inflated p-values) • Q23-Q24 (nested models, collinearity) • Q27 (confounder)

Advanced Model Building Techniques in Data Science