


Chapter 3:Linear Methods for RegressionThe Elements of Statistical LearningAaron Smalter
Chapter Outline • 3.1 Introduction • 3.2 Linear Regression Models and Least Squares • 3.3 Multiple Regression from Simple Univariate Regression • 3.4 Subset Selection and Coefficient Shrinkage • 3.5 Computational Considerations
3.1 Introduction • A linear regression model assumes the regression function, E(Y|X)is linear on the inputs X1, ..., Xp • Simple, precomputer model • Can outperform nonlinear models when low # training cases, low signal-noise ration, sparse • Can be applied to transformations of the input
3.2 Linear Regression Models and Least Squares • Vector of inputs: X = (X1, X2, ..., Xp) • Predict real-valued output: Y
3.2 Linear Regression Models and Least Squares (cont'd) • Inputs derived from various sources • Quantitative inputs • Transformations of inputs (log, square, sqrt) • Basis expansions (X2 = X1^2, X3 = X1^3) • Numeric coding (map one multi-level input into many Xi's) • Interactions between variables (X3 = X1 * X2)
3.2 Linear Regression Models and Least Squares (cont'd) • Set of training data used to estimate parameters, (x1,y1) ... (xN,yN) • Each xi is a vector of feature measurements,xi = (xi1, xi2, ... xip)^T
3.2 Linear Regression Models and Least Squares (cont'd) • Least Squares • Pick a set of coefficients,B = (B0, B1, ..., Bp)^T • In order to minimize the residual sum of squares (RSS):
3.2 Linear Regression Models and Least Squares (cont'd) • How do we pick B so that we minimize the RSS? • Let X be the N x (p+1) matrix • Rows are input vectors (1 in first position) • Cols are feature vectors • Let y be the N x 1 vector of outputs
3.2 Linear Regression Models and Least Squares (cont'd) • Rewrite RSS as, • Differentiate with respect to B, • Set first derivative to zero (assume X has full column rank, X^TX is positive definite), • Obtain unique solution: • Fitted values are:
3.2.2 The Gauss-Markov Theorum • Asserts: least squares estimates have the smallest variance among all linear unbiased estimates. • Estimate any linear combination of the parameters,
3.2.2 The Gauss-Markov Theorum (cont'd) • Least squares estimate is, • If X is fixed, this is a linear function c0^Ty of response vector y • If linear model is correct, a^TB is unbiased. • Gauss-Markov Theorum:
3.2.2 The Gauss-Markov Theorum (cont'd) • However, unbiased estimator is not always best. • Biased estimator may exist with better MSE, trading bias for reduction in variance. • In reality, most models are approximations of the truth and therefore biased; need to pick a model that balances bias and variance.
3.3 Multiple Regression from Simple Univariate Regression • Linear model with p > 1 inputs is a multiple linear regression model. • Suppose we first have univariate model (p = 1), • Least squares estimate and residuals are,
3.3 Multiple Regression from Simple Univariate Reg. (cont'd) • With inner product notation we can rewrite as, • Will use univariate regression to build multiple least squares regression.
3.3 Multiple Regression from Simple Univariate Reg. (cont'd) • Suppose inputs x1,x2,...,xp are orthogonal (<xi,xj> = 0, for all j != k) • Multiple least squares estimates Bj are equal to the univariate estimates <xj,y> / <xj,xj> • When inputs are orthogonal, they have no effect on each others parameter estimates.
3.3 Multiple Regression from Simple Univariate Reg. (cont'd) • Observed data is not usually orthogonal, so we must make it so. • Given intercept and single input x, the least squares coefficient of x has the form, • This is the result of two simple regression applications: • 1. regress x on 1 to produce residual z = x – x^-1 • 2. regress y on the residual z to give coefficient B1^^
3.3 Multiple Regression from Simple Univariate Reg. (cont'd) • In this procedure, the phrases ”regress b on a” or ”b is regressed on a” refers to: • A simple univariate regression of b on a with no intercept, • producing coefficient y^^ = <a,b> / <a,a>, • And residual vector b – y^^a. • We say b is adjusted for a, or is ”orthogonalized” with respect to a.
3.3 Multiple Regression from Simple Univariate Reg. (cont'd) • We can generalize this approach to cases of p inputs (Gram-Schmidt procedure): • The result of the algorithm is:
3.3.1 Multiple Outputs • Suppose we have multiple outputs Y1,Y2,...,Yk, • Predicted from inputs X0,X1,X1,...,Xp. • Assume a linear model for each output: • In matrix notation: • Y is N x K response matrix, ik entry is yik • X is N x (p+1) input matrix • B is (p+1) x K parameter matrix • E is N x K matrix of errors
3.3.1 Multiple Outputs (cont'd) • Generalization of the univariate loss function: • Least squares estimate has same form: • Multiple outputs do not affect each other's least squares estimates
3.4 Subset Selection and Coefficient Shrinkage • As mentioned, unbiased estimators are not always the best. • Two reasons why not satisfied by least squares estimates: • 1. prediction accuracy – low bias but large variance; improve by shrinking or settings coefficients to zero. • 2. interpretation – with a large number of predictors, signal gets lost in the noise, we want to find a subset with strongest effects.
3.4.1 Subset Selection • Improve estimator performance by retaining only a subset of variables. • Use least squares to estimate coefficients of remaining inputs. • Several strategies: • Best subset regression • Forward stepwise selection • Backwards stepwise selection • Hybrid stepwise selection
3.4.1 Subset Selection (cont'd) • Best subset regression • For each k in {0,1,2,...,p}, • find subset of size k that gives smallest residual sum • Leaps and Bounds procudure (Furnival and Wilson 1974) • Feasible for p up to 30-40. • Typically choose k such that estimate of expected prediction error is minimized.
3.4.1 Subset Selection (cont'd) • Forward stepwise selection • Searching all subsets is time consuming • Instead, find a good path through them. • Start with intercept, • Sequentially add predictor that most improves the fit. • ”Improved fit” based on F-statistic, add predictor that gives largest value of F • Stop adding when no predictor gives a significantly greater F value.
3.4.1 Subset Selection (cont'd) • Backward stepwise selection • Similar to previous procedure, but starts with the full model • Sequentially deletes predictors. • Drop predictor giving the smallest F value. • Stop when dropping any other predictor leads to significanty decrease in F value.
3.4.1 Subset Selection (cont'd) • Hybrid stepwise selection • Consider both forward and backward moves at each iteration • Make ”best” move • Need parameter to set threshold between choosing ”add” or ”drop” operation. • The stopping rule (for all stepwise methods) leads to only a local max/min, not guarenteed to find the best model.
3.4.3 Shrinkage Methods • Subset selection is a discrete process (variables retained or not) so may give high variance. • Shrinkage methods are similar, but use a continuous process to avoid high variance. • Examine two methods: • Ridge regression • Lasso
3.4.3 Shrinkage Methods (cont'd) • Ridge regression • Shrinks regression coefficients by imposing a penalty on their size. • Ridge coefficients minimize the penalized sum of squares, • Where s is a complexity parameter controlling the amount of shrinkage. • Mitigates high variance produced when many input variables are correlated.
3.4.3 Shrinkage Methods (cont'd) • Ridge regression • Can be reparameterized using centered inputs, replacing each xij with • Estimate, • Ridge regression solutions give by,
3.4.3 Shrinkage Methods (cont'd) • Singular value decomposition (SVD) of centered input matrix X has form, • U and V are N x p and p x p orthogonal matrices • Cols of U span col space of X • Cols of V span row space of X • D is p x p diagonal matrix with d1 >= ... >= dp >= 0 (singular values)
3.4.3 Shrinkage Methods (cont'd) • Using SVD, rewrite least squared fitted vector: • Now the ridge solutions are: • Greater amount of shrinkage is applied to vectors with smaller dj^2
3.4.3 Shrinkage Methods (cont'd) • SVD of centered matrix X expresses the principle components of the variables in X. • The the eigen decomposition of X^TX is • The eigenvectors vj are the principle components directions of X. • First principle component has largest sample variance, last one has minimum variance.
3.4.3 Shrinkage Methods (cont'd) • Ridge regression • Assume that response varies most in direction of high variance of the inputs. • Shrink the coefficients of the low-variance components more than high-variance ones
3.4.3 Shrinkage Methods (cont'd) • Lasso • Shrinks coefficients, like ridge regression, but has some differences. • Estimate is defined by, • Again, reparameterize and fid model without intercept. • Ridge penalty replaced by lasso penalty
3.4.3 Shrinkage Methods (cont'd) • Compute using quadratic programming. • Lasso is kind of continuous subset selection • Making t very small will make some coefficients zero • If t is larger than some threshold, lasso estimates are same as least squares • If t chosen as half of the this threshold, coefficients are shrunk by about 50%. • Choose t to minimize estimate of prediction error.
3.4.4 Methods Using Derived Input Directions • Sometimes we have a lot of inputs, and want to instead produce a smaller number of inputs which are a linear combinations of the original inputs. • Different methods for constructing linear combintaions: • Principal component regression • Partial least squares
3.4.4 Methods Using Derived Input Directions (cont'd) • Principal Components Regression • Constructs linear combinations using principal components • Regression is a sum of univariate regressions: • zm are each linear combinations of original xj • Similar to ridge regression, instead of shrinking PC we discard p – M smallest eigenvalue components.
3.4.4 Methods Using Derived Input Directions (cont'd) • Partial least squares • Linear combinations use y in addition to X. • Begin by computing univariate regression coefficient of y on each xi, • Construct the derived input • In construction of each zm, inputs are weighted by strength of their univariate effect on y. • The outcome y is regressed on z1 • We orthogonalize with respect to z1 • Iterate until M <= p directions obtained.
3.4.6 Multiple Outcome Shrinkage and Selection • Least squares estimates in multi-output linear model are collection of individual estimates for each output. • To apply selection and shrinkage we could simply apply a univariate technique • individually to each outcome • or simultaneously to all outcomes
3.4.6 Multiple Outcome Shrinkage and Selection (cont'd) • Canonical correlation analysis (CCA) • Data reduction technique that combines responses • Similar to PCA • Finds • sequence of uncorrelated linear combinations of inputs, • corresponding sequence of uncorrelated linear combinations of the responses, • That maximizes correlations:
3.4.6 Multiple Outcome Shrinkage and Selection (cont'd) • Reduced-rank regression • Formalizes CCA approach using regression model that explicitly pools information • Solve multivariate regression problem: