Linear Methods for Regression (2)

Linear Methods for Regression (2) Yi Zhang, Kevyn Collins-Thompson Advanced Statistical Learning Seminar 11741 fall 2002

What We Have Discussed • Lecture 1 (Kevyn): Unrestricted models • Linear Regression: Least-squares estimate • Confidence of Parameter Estimates • Gauss-Markov Theorem • Multiple Regression in terms of Univariates

Outline • Subset selection (feature selection) • Coefficient Shrinkage (smoothing) • Ridge Regression • Lasso • Using derived input direction • Principal component regression • Partial Least Squares • Compare subset selection with shrinkage • Multiple outcome shrinkage and selection

Subset Selection and Shrinkage: Motivation • Bias Variance Trade Off • Goal: choose model to minimize error • Method: sacrifice a little bit of bias to reduce the variance • Better interpretation: find the strongest factors from the input space

Subset Selection • Produces model that is interpretable and has possibly lower prediction error. • Forces some dimensions of x to zero, thus probably decrease

Subset Selection Methods • Find the global optimal model: Best subset regression (too computationally expensive) • Greedy search for the optimal model (practical): • Forward stepwise selection • Begin with empty set, and sequentially adds predictors • Backward stepwise selection • Begin with full model, and sequentially deletes predictors • Stepwise selection: combination of forward and backward move

Adding/Dropping Feature Criteria • Goal: Minimize RSS() • F-test • Tests hypothesis that two samples have different variances • Forward selection: Use F-test to find the feature that decreases RSS() most, and add it to the feature set • Backward selection: Use F-test to find the feature that increases RSS() least, and delete it from the feature set

Shrinkage • Intuition: continuous version of subset selection • Goal: imposing penalty on complexity of model to get lower variance • Two example: • Ridge regression • Lasso

Ridge Regression • Penalize by sum-of-squares of parameters • Or

Understanding of Ridge Regression • Find the orthogonal principal components (basis vectors), and then apply greater amount of shrinkage to basis vectors with small variance. • Assumption: y vary most in the directions of high variance • Intuitive example: stop words in text classification if assuming no covariance between words • Relates to MAP Estimation If:  ~ N(0, I) , y ~ N(X, 2I) Then:

Lasso • Penalize by absolute value of parameter

Using Derived Input Directions • Goal: Using linear combinations of inputs as inputs in the regression • Usually the derived input directions are orthogonal to each other • Principle component regression • Get vm using SVD • Use as inputs in the regression

More questions • (Jian’s question) Q: Chap3 talks about how to balance the tradeoff between Bias and Variance.However, all those works are based on a known distribution (linear inmost cases). What could we do to balance them if we do NOT know the distribution, which is more common in reality? How about break down into local regressions/kernel ones, and compute and combine them?(Then, how many bins? How to choose the roughness penalty?) A: Try linear first because it is simple and normal. It can represent most of the relationships and contain less variance (fewer parameters compared to complicated model). Also you can try kernel methods without model assumptions: none parametric models such as KNN. More we will learn in the later classes. NN: semi-parametric model…

Partial Least squares • Idea: find directions that have high variance and have high correlation with y • In the construction of each zm, the inputs are weighted by the strength of their univariate effect on y • Step 1: • Step 2: regress y on z1 • Step 3: orthoganize x1,x2,..xp with respect to z1 • Continue on step 1, get z1,z2,..zM

PCR discards the smallest eigenvalue components (low-variance direction). The mth component vm solves: PLS shrink the low-variance direction, while inflate high variance direction. The mth component vm solves: Ridge Regression: Shrinks coefficients of the principle components. Low-variance direction is shrinked more PCR vs. PLS vs. Ridge Regression

Compare Selection and Shrinkage PCR PLS Least Squares ridge lasso Best subset

Multiple Outcome Shrinkage and Selection • Option 1: do not consider the correlation in different outcomes, and apply single outcome shrinkage and selection to each outcome • Option 2: Exploit correlations in different outcomes

Canonical Correlation Analysis • Derived input and outcome space based on canonical correlation analysis (CCA) that maximize

Reduced Rank Regression • Regression in derived directions • Step 1: Map y into derived directions • Step 2: Do regression in the derived space • Step 3: Map back to y’s origianl space

Summary • Bias Variance trade off: • Subset selection (feature selection, discrete) • Coefficient hrinkage (smoothing) • Using derived input direction • Multiple outcome shrinkage and selection • Most of the algorithms are sensitive to scaling of the parameters • Standardize the inputs, such as normalizing input directions to the same variance

Linear Methods for Regression (2)

Linear Methods for Regression (2)

Presentation Transcript

Linear methods for regression

Linear Methods for Classification

Lecture 8,9 – Linear Methods for Classification

LINEAR PROGRAMMING: GRAPHICAL METHODS

Linear Methods for Regression

Linear Models for Classification : Probabilistic Methods

Methods for Dummies General Linear Model

Linear Programming Models and Graphical Methods

3. Linear Methods for Regression

Direct Methods for Linear Systems

Linear Methods for Regression

LINEAR CLASSIFICATION METHODS

Chapter 1 Direct Methods for Solving Linear Systems

Simple Linear Regression and Correlation: Inferential Methods

Linear Methods For Classification Chapter 4

Linear Methods for Regression (2)

Advanced Statistical Methods: Beyond Linear Regression

3. Linear Methods for Regression

Linear Methods for Classification

Advanced Statistical Methods: Beyond Linear Regression