Download Presentation
Linear Methods for Regression (2)

Loading in 2 Seconds...

1 / 20

# Linear Methods for Regression (2) - PowerPoint PPT Presentation

Linear Methods for Regression (2). Yi Zhang, Kevyn Collins-Thompson Advanced Statistical Learning Seminar 11741 fall 2002. What We Have Discussed . Lecture 1 (Kevyn): Unrestricted models Linear Regression: Least-squares estimate Confidence of Parameter Estimates Gauss-Markov Theorem

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

## Linear Methods for Regression (2)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Linear Methods for Regression (2)

Yi Zhang, Kevyn Collins-Thompson

Advanced Statistical Learning Seminar

11741 fall 2002

What We Have Discussed
• Lecture 1 (Kevyn): Unrestricted models
• Linear Regression: Least-squares estimate
• Confidence of Parameter Estimates
• Gauss-Markov Theorem
• Multiple Regression in terms of Univariates
Outline
• Subset selection (feature selection)
• Coefficient Shrinkage (smoothing)
• Ridge Regression
• Lasso
• Using derived input direction
• Principal component regression
• Partial Least Squares
• Compare subset selection with shrinkage
• Multiple outcome shrinkage and selection
Subset Selection and Shrinkage: Motivation
• Bias Variance Trade Off
• Goal: choose model to minimize error
• Method: sacrifice a little bit of bias to reduce the variance
• Better interpretation: find the strongest factors from the input space
Subset Selection
• Produces model that is interpretable and has possibly lower prediction error.
• Forces some dimensions of x to zero, thus probably decrease
Subset Selection Methods
• Find the global optimal model: Best subset regression (too computationally expensive)
• Greedy search for the optimal model (practical):
• Forward stepwise selection
• Begin with empty set, and sequentially adds predictors
• Backward stepwise selection
• Begin with full model, and sequentially deletes predictors
• Stepwise selection: combination of forward and backward move
Adding/Dropping Feature Criteria
• Goal: Minimize RSS()
• F-test
• Tests hypothesis that two samples have different variances
• Forward selection: Use F-test to find the feature that decreases RSS() most, and add it to the feature set
• Backward selection: Use F-test to find the feature that increases RSS() least, and delete it from the feature set
Shrinkage
• Intuition: continuous version of subset selection
• Goal: imposing penalty on complexity of model to get lower variance
• Two example:
• Ridge regression
• Lasso
Ridge Regression
• Penalize by sum-of-squares of parameters
• Or
Understanding of Ridge Regression
• Find the orthogonal principal components (basis vectors), and then apply greater amount of shrinkage to basis vectors with small variance.
• Assumption: y vary most in the directions of high variance
• Intuitive example: stop words in text classification if assuming no covariance between words
• Relates to MAP Estimation

If:  ~ N(0, I) , y ~ N(X, 2I)

Then:

Lasso
• Penalize by absolute value of parameter
Using Derived Input Directions
• Goal: Using linear combinations of inputs as inputs in the regression
• Usually the derived input directions are orthogonal to each other
• Principle component regression
• Get vm using SVD
• Use as inputs in the regression
More questions
• (Jian’s question)

Q: Chap3 talks about how to balance the tradeoff between Bias and Variance.However, all those works are based on a known distribution (linear inmost cases). What could we do to balance them if we do NOT know the distribution, which is more common in reality? How about break down into local regressions/kernel ones, and compute and combine them?(Then, how many bins? How to choose the roughness penalty?)

A: Try linear first because it is simple and normal. It can represent most of the relationships and contain less variance (fewer parameters compared to complicated model). Also you can try kernel methods without model assumptions: none parametric models such as KNN. More we will learn in the later classes. NN: semi-parametric model…

Partial Least squares
• Idea: find directions that have high variance and have high correlation with y
• In the construction of each zm, the inputs are weighted by the strength of their univariate effect on y
• Step 1:
• Step 2: regress y on z1
• Step 3: orthoganize x1,x2,..xp with respect to z1
• Continue on step 1, get z1,z2,..zM
PCR discards the smallest eigenvalue components (low-variance direction). The mth component vm solves:

PLS shrink the low-variance direction, while inflate high variance direction. The mth component vm solves:

Ridge Regression: Shrinks coefficients of the principle components. Low-variance direction is shrinked more

PCR vs. PLS vs. Ridge Regression
Compare Selection and Shrinkage

PCR

PLS

Least Squares

ridge

lasso

Best subset

Multiple Outcome Shrinkage and Selection
• Option 1: do not consider the correlation in different outcomes, and apply single outcome shrinkage and selection to each outcome
• Option 2: Exploit correlations in different outcomes
Canonical Correlation Analysis
• Derived input and outcome space based on canonical correlation analysis (CCA) that maximize
Reduced Rank Regression
• Regression in derived directions
• Step 1: Map y into derived directions
• Step 2: Do regression in the derived space
• Step 3: Map back to y’s origianl space
Summary
• Bias Variance trade off:
• Subset selection (feature selection, discrete)
• Coefficient hrinkage (smoothing)
• Using derived input direction
• Multiple outcome shrinkage and selection
• Most of the algorithms are sensitive to scaling of the parameters
• Standardize the inputs, such as normalizing input directions to the same variance