Penalized regression part 2
This presentation is the property of its rightful owner.
Sponsored Links
1 / 38

Penalized Regression, Part 2 PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

Penalized Regression, Part 2. Penalized Regression. Recall in penalized regression, we re-write our loss function to include not only the squared error loss but a penalty term Our goal then becomes to minimize our a loss function (i.e. SS)

Download Presentation

Penalized Regression, Part 2

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Penalized regression part 2

Penalized Regression, Part 2


Penalized regression

Penalized Regression

Recall in penalized regression, we re-write our loss function to include not only the squared error loss but a penalty term

Our goal then becomes to minimize our a loss function (i.e. SS)

In the regression setting we can write M(q ) in terms of our regression parameters b as follows

The penalty function takes the form


Ridge regression

Ridge Regression

Last class we discussed ridge regression as an alternative to OLS when covariates are collinear

Ridge regression can reduce the variability and improve accuracy of a regression model

However, there is not a means of variable selection in ridge regression

Ideally we want to be able to reduce the variability in a model but also be able to select which variables are most strongly associated with our outcome


The lasso versus ridge regression

The Lasso versus Ridge Regression

In ridge regression, the new function is

Consider instead the estimator which minimizes

The only change is to the penalty function and while the change is subtle, is has a big impact on our regression estimator


The lasso

The Lasso

The name lasso stands for “Least Absolute Shrinkage and Selection Operator”

Like ridge regression, penalizing the absolute values of the coefficients shrinks them towards zero

But in the lasso, some coefficients are shrunk completely to zero

Solutions where multiple coefficient estimates are identically zero are called sparse

Thus the penalty performs a continuous variable selection, hence the name


Geometry of ridge versus lasso 2 dimensional case

Geometry of Ridge versus Lasso2-dimensional case

Solid areas represent the constraint regions

The ellipses represent the contours of the least square error function


The lasso1

The Lasso

Because the lasso penalty has an absolute value operation, the objective function is not differentiable and therefore lacks a closed form

As a result, we must use optimization algorithms to find the minimum

Examples of these algorithms include

-Quadratic programming (limit ~100 predictors)

-Least Angle Regression/LAR (limit ~10,000 predictors)


Selection of l

Selection of l

Since lasso is not a linear estimator, we have no H matrix such that

Thus determining the degrees of freedom are more difficult to estimate

One means is to estimate the degrees of freedom based on the number of non-zero parameters in the model and then use AIC, BIC or Cp to select the best l

Alternatively (and often more preferred) we could select l via cross-validation


Forward stagewise selection

Forward Stagewise Selection

Alternative method for variable subset selection designed to handle correlated predictors

Iterative process that begins with all coefficients equal to zero and build regression function in successive small steps

Similar algorithm to forward selection in that predictors added successively to the model

However, it is much more cautious than forward stepwise model selection

-e.g. for a model with 10 possible predictors stepwise takes 10 steps at most, stagewise may take 5000+


Forward stagewise selection1

Forward Stagewise Selection

Stagewise algorithm:

(1) Initialize model such that

(2) Find the predictor Xj1 that is most correlated with r and add it to the model (here )

(3) Update

-Note, h is a small constant controlling step-length

(4) Update

(5) Repeat steps 2 thru 4 until


Stagewise versus lasso

Stagewise versus Lasso

Although the algorithms look entirely different, their results are very similar!

They will trace very similar paths for addition of predictors to the model

They both represent special cases of a method called least angle regression (LAR)


Least angle regression

Least Angle Regression

LAR algorithm:

(1) Initialize model such that

Also initialize an empty “active set” A (subset of indices)

(2) Find the predictor that is most correlated with r where

; update the active set to include

(3) Move toward until some other covariate has the same correlation with r that does. Update the active set to include

(4) Update rand move along towards the joint OLS direction for the regression of r on until a third covariate

is as correlated with r as the first two predictors.Update the active set to include

(5) Continue until all k covariates have been added to the model


In pictures

In Pictures

Consider a case where we have 2 predictors…

Efron et al. 2004


Relationship between lar and lasso

Relationship Between LAR and Lasso

LAR is a more general method than lasso

A modification of the LAR algorithm produces the entire lasso path for l varied from 0 to infinity

Modification occurs if a previously non-zero coefficient estimated to be zero at some point in the algorithm

If this occurs, the LAR algorithm is modified such that the coefficient is removed from the active set and the joint direction is recomputed

This modification is the most frequently implements version of LAR


Relationship bt lar and stagewise

Relationship Bt/ LAR and Stagewise

LAR is also a more general method than stagewise selection

Can also reproduce stagewise results using modified LAR

Start with the LAR algorithm and determine the best direction at each stage

If the direction for any predictor in the active set doesn’t agree in sign with the correlation between r and Xj, adjust to move in the direction of corr(r, Xj)

As step sizes go to 0, we get a modified version of the LAR algorithm


Summary of the three methods

Summary of the Three Methods

  • LARS

    • Uses least square directions in the active set of variables

  • Lasso

    • Uses the least square directions

    • If the variable crosses 0, it is removed from the active set

  • Forward stagewise

    • Uses non-negative least squares directions in the active set


Degrees freedom in lar and lasso

Degrees Freedom in LAR and lasso

Consider fitting a LAR model with k < pparameters

Equivalently use a lasso bound t that constrains the full regression fit

General definition for the effective degrees of freedom (edf) for an adaptively fit model:

For LARS at the kth step, the edf for the fit vector is exactly k

For lasso, at any stage in the fit the effective degrees of freedom is approximately the number of predictors in the model


Software packages

Software Packages

What if we consider lasso, forward stagewise, or LAR as alternatives?

There are 2 packages in R that will allow us to do this

-lars

-glmnet

The lars package has the advantage of being able to fit all three model types (plus a typical forward stepwise selection algorithm)

However, the glmnet package can fit lasso regression models for different types of regression

-linear, logistic, cox-proportional hazards, multinomial, and poisson


Body fat example

Body Fat Example

Recall our regression model

> summary(mod13)

Call:

lm(formula = PBF ~ Age + Wt + Ht + Neck + Chest + Abd + Hip + Thigh + Knee + Ankle + Bicep + Arm + Wrist, data = bodyfat, x = T)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -18.18849 17.34857 -1.048 0.29551

Age 0.06208 0.03235 1.919 0.05618 .

Wt -0.08844 0.05353 -1.652 0.09978 .

Ht -0.06959 0.09601 -0.725 0.46925

Neck -0.47060 0.23247 -2.024 0.04405 *

Chest -0.02386 0.09915 -0.241 0.81000

Abd 0.95477 0.08645 11.04 < 2e-16 ***

Hip -0.20754 0.14591 -1.422 0.15622

Thigh 0.23610 0.14436 1.636 0.10326

Knee 0.01528 0.24198 0.063 0.94970

Ankle 0.17400 0.22147 0.786 0.43285

Bicep 0.18160 0.17113 1.061 0.28966

Arm 0.45202 0.19913 2.270 0.02410 *

Wrist -1.62064 0.53495 -3.030 0.00272 **

Residual standard error: 4.305 on 238 degrees of freedom. Multiple R-squared: 0.749,

Adjusted R-squared: 0.7353 . F-statistic: 54.65 on 13 and 238 DF, p-value: < 2.2e-16


Body fat example1

Body Fat Example

LAR:

>library(lars)

>par(mfrow=c(2,2))

>object <- lars(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), type="lasso")

>plot(object, breaks=F)

>object2 <- lars(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), type="lar")

>plot(object2, breaks=F)

>object3 <- lars(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), type=“for")

>plot(object3, breaks=F)

>object4 <- lars(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), type=“stepwise")

>plot(object4, breaks=F)


Body fat example2

Body Fat Example

A closer look at the model:

>object <- lars(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), type="lasso")

> names(object)

[1] "call" "type" "df" "lambda" "R2" "RSS" "Cp" "actions"

[9] "entry" "Gamrat" "arc.length" "Gram" "beta" "mu" "normx" "meanx"

> object$beta

Age Wt Ht Neck Chest Abd Hip Thigh Knee Ankle

0 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.000000000 0.00000000 0.00000000 0.0000000

1 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.5164924 0.000000000 0.00000000 0.00000000 0.0000000

2 0.00000000 0.00000000 -0.04395065 0.00000000 0.00000000 0.5314218 0.000000000 0.00000000 0.00000000 0.0000000

3 0.01710504 0.00000000 -0.13752803 0.00000000 0.00000000 0.5621288 0.000000000 0.00000000 0.00000000 0.0000000

4 0.04880181 0.00000000 -0.15894236 0.00000000 0.00000000 0.6550929 0.000000000 0.00000000 0.00000000 0.0000000

5 0.04994577 0.00000000 -0.15905246 -0.02624509 0.00000000 0.6626603 0.000000000 0.00000000 0.00000000 0.0000000

6 0.06499276 0.00000000 -0.15911969 -0.25799496 0.00000000 0.7079872 0.000000000 0.00000000 0.00000000 0.0000000

7 0.06467180 0.00000000 -0.15921694 -0.26404701 0.00000000 0.7118167 -0.004720494 0.00000000 0.00000000 0.0000000

8 0.06022586 -0.01117359 -0.14998300 -0.29599536 0.00000000 0.7527298 -0.022557736 0.00000000 0.00000000 0.0000000

9 0.05710956 -0.02219531 -0.14039586 -0.32675736 0.00000000 0.7842966 -0.035675017 0.00000000 0.00000000 0.0000000

10 0.05853733 -0.04577935 -0.11203059 -0.39386199 0.00000000 0.8425758 -0.101022340 0.09657784 0.00000000 0.0000000

11 0.06132775 -0.07889636 -0.07798153 -0.45141574 0.00000000 0.9142944 -0.171178163 0.20141924 0.00000000 0.1259630

12 0.06214695 -0.08452690 -0.07220347 -0.46528070 -0.01582661 0.9402896 -0.194491760 0.22553958 0.00000000 0.1586161

13 0.06207865 -0.08844468 -0.06959043 -0.47060001 -0.02386415 0.9547735 -0.207541123 0.23609984 0.01528121 0.1739954

Bicep Arm Wrist

0 0.00000000 0.0000000 0.000000

1 0.00000000 0.0000000 0.000000

2 0.00000000 0.0000000 0.000000

3 0.00000000 0.0000000 0.000000

4 0.00000000 0.0000000 -1.169755

5 0.00000000 0.0000000 -1.198047

6 0.00000000 0.2175660 -1.535349

7 0.00000000 0.2236663 -1.538953

8 0.00000000 0.2834326 -1.535810

9 0.04157133 0.3117864 -1.534938

10 0.09096070 0.3635421 -1.522325

11 0.15173471 0.4229317 -1.587661

12 0.17055965 0.4425212 -1.607395

13 0.18160242 0.4520249 -1.620639


Body fat example3

Body Fat Example

A closer look at the model:

> names(object)

[1] "call" "type" "df" "lambda" "R2" "RSS" "Cp" "actions"

[9] "entry" "Gamrat" "arc.length" "Gram" "beta" "mu" "normx" "meanx"

> object$df

Intercept

1 2 3 4 5 6 7 8 9 10 11

12 13 14

> object$Cp

0 1 2 3 4 5 6 7 8 9 10

698.4 93.62 85.47 65.41 30.12 30.51 19.39 20.91 18.68 17.41 12.76

11 12 13

10.47 12.06 14.00


Body fat example4

Body Fat Example

Glmnet:

>fit<-glmnet(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), alpha=1)

>fit.cv<-cv.glmnet(x=as.matrix(bodyfat[,3:15]), y=as.vector(bodyfat[,2]), alpha=1)

>plot(fit.cv, sign.lambda=-1)

>fit<-glmnet(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), alpha=1, 0.02123575)


Body fat example5

Body Fat Example

Glmnet:

>fit<-glmnet(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), alpha=1)

>names(fit)

[1] "a0" "beta" "df" "dim" "lambda" "dev.ratio" "nulldev" "npasses" "jerr"

[10] "offset" "call" "nobs"

> fit$lambda

[1] 6.793883455 6.190333574 5.640401401 5.139323686 4.682760334 4.266756812 3.887709897 3.542336464 3.227645056

[10] 2.940909965 2.679647629 2.441595119 2.224690538 2.027055162 1.846977168 1.682896807 1.533392893 1.397170495

[19] 1.273049719 1.159955490 1.056908242 0.963015426 0.877463790 0.799512325 0.728485854 0.663769178 0.604801754

[28] 0.551072833 0.502117041 0.457510347 0.416866389 0.379833128 0.346089800 0.315344136 0.287329832 0.261804242

[37] 0.238546274 0.217354481 0.198045308 0.180451508 0.164420694 0.149814013 0.136504949 0.124378225 0.113328806

[46] 0.103260988 0.094087566 0.085729086 0.078113150 0.071173793 0.064850910 0.059089734 0.053840365 0.049057335

[55] 0.044699216 0.040728261 0.037110075 0.033813318 0.030809436 0.028072411 0.025578535 0.023306209 0.021235749

[64] 0.019349224 0.017630292 0.016064066 0.014636978 0.013336669 0.012151876 0.011072337 0.010088701 0.009192449

[73] 0.008375817 0.007631733 0.006953750 0.006335998 0.005773126 0.005260257


Body fat example6

Body Fat Example

Glmnet:

>fit.cv<-cv.glmnet(x=as.matrix(bodyfat[,3:15]), y=as.vector(bodyfat[,2]), alpha=1)

> names(fit.cv)

[1] "lambda" "cvm" "cvsd" "cvup" "cvlo" "nzero" "name" "glmnet.fit"

[9] "lambda.min" "lambda.1se"

> fit.cv$lambda.min

[1] 0.02123575


Ridge versus lasso coefficient paths

Ridge versus Lasso Coefficient Paths

Ridge


Trace plot

Trace Plot


Penalized regression part 2

Lasso

LARS

Stagewise

Stepwise


Body fat example7

Body Fat Example


If we remove the outliers and clean up the data before analysis

If we remove the outliers and clean up the data before analysis…


Body fat example8

Body Fat Example

What can we do in SAS?

SAS can also do cross-validation

However, it only fits linear regression

Here’s the basic SAS code

ods graphics on;

proc glmselect data=bf plots=all;

model pbf=age wt ht neck chest abd hip thigh knee ankle bicep arm wrist/selection=lasso(stop=none choose=AIC);

run;

ods graphics off;


Penalized regression part 2

The GLMSELECT Procedure

LASSO Selection Summary

Effect Effect Number

Step Entered Removed Effects In AIC

0 Intercept1 11325.7477

-----------------------------------------------------------------------------------------------

Abd2 21070.4404

Ht 3 31064.8357

Age4 41049.4793

Wrist5 51019.1226

Neck6 61019.6222

Arm7 71009.0982

Hip8 81010.6285

Wt9 91008.4396

Bicep10 101007.1631

Thigh11 111002.3524

Ankle12 12999.8569*

Chest13 131001.4229

13 Knee14 141003.3574


Penalized regression part 2

Penalized regression methods are most useful when

-high collinearity exists

-when p >> n

Keep in mind you still need to look at the data first

Could also consider other forms of penalized regression, though in practice alternatives are not used


  • Login