Chapter 3: Linear Methods for Regression The Elements of Statistical Learning Aaron Smalter

Chapter 3:Linear Methods for RegressionThe Elements of Statistical LearningAaron Smalter

Chapter Outline • 3.1 Introduction • 3.2 Linear Regression Models and Least Squares • 3.3 Multiple Regression from Simple Univariate Regression • 3.4 Subset Selection and Coefficient Shrinkage • 3.5 Computational Considerations

3.1 Introduction • A linear regression model assumes the regression function, E(Y|X)is linear on the inputs X1, ..., Xp • Simple, precomputer model • Can outperform nonlinear models when low # training cases, low signal-noise ration, sparse • Can be applied to transformations of the input

3.2 Linear Regression Models and Least Squares • Vector of inputs: X = (X1, X2, ..., Xp)‏ • Predict real-valued output: Y

3.2 Linear Regression Models and Least Squares (cont'd)‏ • Inputs derived from various sources • Quantitative inputs • Transformations of inputs (log, square, sqrt)‏ • Basis expansions (X2 = X1^2, X3 = X1^3)‏ • Numeric coding (map one multi-level input into many Xi's)‏ • Interactions between variables (X3 = X1 * X2)‏

3.2 Linear Regression Models and Least Squares (cont'd)‏ • Set of training data used to estimate parameters, (x1,y1) ... (xN,yN)‏ • Each xi is a vector of feature measurements,xi = (xi1, xi2, ... xip)^T

3.2 Linear Regression Models and Least Squares (cont'd)‏ • Least Squares • Pick a set of coefficients,B = (B0, B1, ..., Bp)^T • In order to minimize the residual sum of squares (RSS):

3.2 Linear Regression Models and Least Squares (cont'd)‏ • How do we pick B so that we minimize the RSS? • Let X be the N x (p+1) matrix • Rows are input vectors (1 in first position)‏ • Cols are feature vectors • Let y be the N x 1 vector of outputs

3.2 Linear Regression Models and Least Squares (cont'd)‏ • Rewrite RSS as, • Differentiate with respect to B, • Set first derivative to zero (assume X has full column rank, X^TX is positive definite), • Obtain unique solution: • Fitted values are:

3.2.2 The Gauss-Markov Theorum • Asserts: least squares estimates have the smallest variance among all linear unbiased estimates. • Estimate any linear combination of the parameters,

3.2.2 The Gauss-Markov Theorum (cont'd) • Least squares estimate is, • If X is fixed, this is a linear function c0^Ty of response vector y • If linear model is correct, a^TB is unbiased. • Gauss-Markov Theorum:

3.2.2 The Gauss-Markov Theorum (cont'd)‏ • However, unbiased estimator is not always best. • Biased estimator may exist with better MSE, trading bias for reduction in variance. • In reality, most models are approximations of the truth and therefore biased; need to pick a model that balances bias and variance.

3.3 Multiple Regression from Simple Univariate Regression • Linear model with p > 1 inputs is a multiple linear regression model. • Suppose we first have univariate model (p = 1), • Least squares estimate and residuals are,

3.3 Multiple Regression from Simple Univariate Reg. (cont'd)‏ • With inner product notation we can rewrite as, • Will use univariate regression to build multiple least squares regression.

3.3 Multiple Regression from Simple Univariate Reg. (cont'd)‏ • Suppose inputs x1,x2,...,xp are orthogonal (<xi,xj> = 0, for all j != k)‏ • Multiple least squares estimates Bj are equal to the univariate estimates <xj,y> / <xj,xj> • When inputs are orthogonal, they have no effect on each others parameter estimates.

3.3 Multiple Regression from Simple Univariate Reg. (cont'd)‏ • Observed data is not usually orthogonal, so we must make it so. • Given intercept and single input x, the least squares coefficient of x has the form, • This is the result of two simple regression applications: • 1. regress x on 1 to produce residual z = x – x^-1 • 2. regress y on the residual z to give coefficient B1^^

3.3 Multiple Regression from Simple Univariate Reg. (cont'd)‏ • In this procedure, the phrases ”regress b on a” or ”b is regressed on a” refers to: • A simple univariate regression of b on a with no intercept, • producing coefficient y^^ = <a,b> / <a,a>, • And residual vector b – y^^a. • We say b is adjusted for a, or is ”orthogonalized” with respect to a.

3.3 Multiple Regression from Simple Univariate Reg. (cont'd)‏ • We can generalize this approach to cases of p inputs (Gram-Schmidt procedure): • The result of the algorithm is:

3.3.1 Multiple Outputs • Suppose we have multiple outputs Y1,Y2,...,Yk, • Predicted from inputs X0,X1,X1,...,Xp. • Assume a linear model for each output: • In matrix notation: • Y is N x K response matrix, ik entry is yik • X is N x (p+1) input matrix • B is (p+1) x K parameter matrix • E is N x K matrix of errors

3.3.1 Multiple Outputs (cont'd)‏ • Generalization of the univariate loss function: • Least squares estimate has same form: • Multiple outputs do not affect each other's least squares estimates

3.4 Subset Selection and Coefficient Shrinkage • As mentioned, unbiased estimators are not always the best. • Two reasons why not satisfied by least squares estimates: • 1. prediction accuracy – low bias but large variance; improve by shrinking or settings coefficients to zero. • 2. interpretation – with a large number of predictors, signal gets lost in the noise, we want to find a subset with strongest effects.

3.4.1 Subset Selection • Improve estimator performance by retaining only a subset of variables. • Use least squares to estimate coefficients of remaining inputs. • Several strategies: • Best subset regression • Forward stepwise selection • Backwards stepwise selection • Hybrid stepwise selection

3.4.1 Subset Selection (cont'd)‏ • Best subset regression • For each k in {0,1,2,...,p}, • find subset of size k that gives smallest residual sum • Leaps and Bounds procudure (Furnival and Wilson 1974)‏ • Feasible for p up to 30-40. • Typically choose k such that estimate of expected prediction error is minimized.

3.4.1 Subset Selection (cont'd)‏ • Forward stepwise selection • Searching all subsets is time consuming • Instead, find a good path through them. • Start with intercept, • Sequentially add predictor that most improves the fit. • ”Improved fit” based on F-statistic, add predictor that gives largest value of F • Stop adding when no predictor gives a significantly greater F value.

3.4.1 Subset Selection (cont'd)‏ • Backward stepwise selection • Similar to previous procedure, but starts with the full model • Sequentially deletes predictors. • Drop predictor giving the smallest F value. • Stop when dropping any other predictor leads to significanty decrease in F value.

3.4.1 Subset Selection (cont'd)‏ • Hybrid stepwise selection • Consider both forward and backward moves at each iteration • Make ”best” move • Need parameter to set threshold between choosing ”add” or ”drop” operation. • The stopping rule (for all stepwise methods) leads to only a local max/min, not guarenteed to find the best model.

3.4.3 Shrinkage Methods • Subset selection is a discrete process (variables retained or not) so may give high variance. • Shrinkage methods are similar, but use a continuous process to avoid high variance. • Examine two methods: • Ridge regression • Lasso

3.4.3 Shrinkage Methods (cont'd)‏ • Ridge regression • Shrinks regression coefficients by imposing a penalty on their size. • Ridge coefficients minimize the penalized sum of squares, • Where s is a complexity parameter controlling the amount of shrinkage. • Mitigates high variance produced when many input variables are correlated.

3.4.3 Shrinkage Methods (cont'd)‏ • Ridge regression • Can be reparameterized using centered inputs, replacing each xij with • Estimate, • Ridge regression solutions give by,

3.4.3 Shrinkage Methods (cont'd)‏ • Singular value decomposition (SVD) of centered input matrix X has form, • U and V are N x p and p x p orthogonal matrices • Cols of U span col space of X • Cols of V span row space of X • D is p x p diagonal matrix with d1 >= ... >= dp >= 0 (singular values)‏

3.4.3 Shrinkage Methods (cont'd)‏ • Using SVD, rewrite least squared fitted vector: • Now the ridge solutions are: • Greater amount of shrinkage is applied to vectors with smaller dj^2

3.4.3 Shrinkage Methods (cont'd)‏ • SVD of centered matrix X expresses the principle components of the variables in X. • The the eigen decomposition of X^TX is • The eigenvectors vj are the principle components directions of X. • First principle component has largest sample variance, last one has minimum variance.

3.4.3 Shrinkage Methods (cont'd)‏ • Ridge regression • Assume that response varies most in direction of high variance of the inputs. • Shrink the coefficients of the low-variance components more than high-variance ones

3.4.3 Shrinkage Methods (cont'd)‏ • Lasso • Shrinks coefficients, like ridge regression, but has some differences. • Estimate is defined by, • Again, reparameterize and fid model without intercept. • Ridge penalty replaced by lasso penalty

3.4.3 Shrinkage Methods (cont'd)‏ • Compute using quadratic programming. • Lasso is kind of continuous subset selection • Making t very small will make some coefficients zero • If t is larger than some threshold, lasso estimates are same as least squares • If t chosen as half of the this threshold, coefficients are shrunk by about 50%. • Choose t to minimize estimate of prediction error.

3.4.4 Methods Using Derived Input Directions • Sometimes we have a lot of inputs, and want to instead produce a smaller number of inputs which are a linear combinations of the original inputs. • Different methods for constructing linear combintaions: • Principal component regression • Partial least squares

3.4.4 Methods Using Derived Input Directions (cont'd)‏ • Principal Components Regression • Constructs linear combinations using principal components • Regression is a sum of univariate regressions: • zm are each linear combinations of original xj • Similar to ridge regression, instead of shrinking PC we discard p – M smallest eigenvalue components.

3.4.4 Methods Using Derived Input Directions (cont'd)‏ • Partial least squares • Linear combinations use y in addition to X. • Begin by computing univariate regression coefficient of y on each xi, • Construct the derived input • In construction of each zm, inputs are weighted by strength of their univariate effect on y. • The outcome y is regressed on z1 • We orthogonalize with respect to z1 • Iterate until M <= p directions obtained.

3.4.6 Multiple Outcome Shrinkage and Selection • Least squares estimates in multi-output linear model are collection of individual estimates for each output. • To apply selection and shrinkage we could simply apply a univariate technique • individually to each outcome • or simultaneously to all outcomes

3.4.6 Multiple Outcome Shrinkage and Selection (cont'd)‏ • Canonical correlation analysis (CCA)‏ • Data reduction technique that combines responses • Similar to PCA • Finds • sequence of uncorrelated linear combinations of inputs, • corresponding sequence of uncorrelated linear combinations of the responses, • That maximizes correlations:

3.4.6 Multiple Outcome Shrinkage and Selection (cont'd)‏ • Reduced-rank regression • Formalizes CCA approach using regression model that explicitly pools information • Solve multivariate regression problem:

Questions

Chapter 3: Linear Methods for Regression The Elements of Statistical Learning Aaron Smalter