Gibbs Variable Selection

Gibbs Variable Selection Xiaomei Pan, Kellie Poulin, Jigang Yang, Jianjun Zhu

Topics 1. Overview of Variable Selection Procedures 2. Gibbs Variable Selection (GVS) 3. How to implement in WinBUGS 4. Recommendations 5. Appendices

Overview of Variable Selection Procedures • Selecting the best model • The best likelihood • The link function • Priors • Variable Selection

Overview of Variable Selection Procedures • The type of response variable typically narrows our choices for: • The best likelihood • The link function • Also, we often only have a relatively small number of priors to choose from • If no prior information is known non-informative priors are chosen for the model parameters. • If prior information is known, this will also narrow our choices for priors

Overview of Variable Selection Procedures • However there may be many choices for what variables should be included in the model • In many real world problems the number of candidate variables is in the tens to hundreds. • For example 10 candidate variables leads to 1024 different possible linear models • More models are possible if considering interactions and non-linear terms!

Overview of Variable Selection Procedures • Things to keep in mind when doing variable selection • P-values for variables are no longer valid. • Variable selection is a form of “data snooping” (recall “multiple comparison procedures”) • Correlation between the predictors could lead to less than optimal results. • It is best to use cross-validation techniques as a safeguard • Best used in “prediction models” rather than “effect models”

Overview of Variable Selection Procedures • Variable Selection Methods • Frequentists use methods such as • Stepwise Regression • Mallow’s CP • Maximum R-squared • What methods do Bayesians use?

Overview of Bayesian Variable Selection Procedures * Refer to reverence slide

Overview of Bayesian Variable Selection Procedures * Thanks to Dr. Matt Bognar for insights into reversible jump. His thesis in 241 SH contains examples of using this method.

Gibbs Variable Selection • GVS Sampling Procedure

How to Implement in WinBUGS • We adapted code from Ntzoufras, I. (2003) • This code and the paper is available on the WinBUGS web site • The example showed variable selection for a model with 3 candidate predictor variables • This code required the user to modify the code extensively if they wanted to use it for their own data

How to Implement in WinBUGS • Our WinBUGS code • Requires no changes in the model specification • The user must only insert their data and modify initial values. • p=number of x variables • N=Number of observations • Models=number of models (2^p) • Initial values for beta’s

How to Implement in WinBUGS • Provided at the end of this presentation • Full WinBUGS code for variable selection • Code for fitting the full model in WinBUGs (for development of Pseudopriors) • This code also only requires the user to change the data and initial values • R code to assist in interpreting output from WinBUGS • SAS Code used to develop sample data

How to Implement in WinBUGS EXAMPLE • Data used for example • Validated our code using published results for a model with 3 variables • Created a simulated data set with 500 observations and 10 predictors • 9 predictors were continuous • 1 predictor was binary • Created a version of the file with correlation between predictors to test robustness of GVS to non-orthogonal data • Code used to create simulated data is in Appendix 1 • Full correlation matrix (for correlated data) is in Appendix 2

How to Implement in WinBUGS Target model created in simulated data Y= 2 * X1 + .7 * X6 + .22 * X10 -.7 * X5 + random error (normal 0,1)

How to implement in WinBUGS Prior to Running Our Code Step 1 Step 2 Step 3 • Determine Likelihoods and priors Fit the full model to develop information for pseudo-priors Standardize X Matrix Orthoganalize X Matrix** ** see recommendations

How to Implement GVS in WinBUGS Step 1 • Standardize X matrix • Centering the covariates will remove correlation between the model coefficients • Dividing by the standard deviation allows for comparison of coefficients on the same scale • May also make it easy to assign proper non-informative priors • If the user wants to use informative priors this may be a nuisance

How to Implement GVS in WinBUGS Step 1 • Standardize X matrix • In SAS procstandard data=input_file_name mean=0 std=1 out=output_file_name; var variable_names; run; • In Winbugs # This is at the top of the model code and is done automatically for (j in 1:p) { b[j] <- beta[j]/sd(x[,j]) for (i in 1:N) { z[i,j] <- (x[i,j] - mean(x[,j]))/sd(x[,j]) } temp_mean[j]<-b[j]*mean(x[,j]) } b0 <- beta0-sum(temp_mean[])

How to Implement GVS in WinBUGS Step 1 • Orthoganalize X matrix** • This is recommended by those who designed GVS to make variable selection more accurate • However, this makes interpretation of the coefficients impossible • We do not include this step in the WinBUGS code but suggest some alternative methods for handling correlation in our recommendations

How to Implement GVS in WinBUGS Step 1 • Orthogonalize X matrix** • In our example we simulated data with correlations and GVS seems to be able to handle some correlation • More analysis needs to be done to determine the sensitivity of the procedure to correlations in the x matrix • We recommend variable clustering to remove highly correlated variables • If you want to orthogonalize your X matrix, this is the appropriate SAS code procprincomp data=input_file_name out=output_file_name; var variable_names; run;

How to Implement GVS in WinBUGS Step 2 • Fit full model to get Pseudo-priors • In SAS procreg data=input_file_name; model y=variable_list; run; • In Winbugs (full code is at end of document) Model { # Standardize code here #likelihood for (i in 1:N){ Y[i] ~ dnorm(mu[i],tau) for(j in 1:p){ temp[i,j]<-beta[j]*z[i,j] } mu[i] <- beta0 + sum(temp[i,]) } beta0 ~ dnorm(0,0.00001) for (j in 1:p) { beta[j] ~dnorm(0, 1.0E-6) } tau ~ dgamma(1.0E-3,1.0E-3) sigma <- sqrt(1/tau)}

How to Implement GVS in WinBUGS Step 2 • Fit full model to get Pseudo-priors • In SAS Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 7.99442 0.04431 180.42 <.0001 x1 1 2.03194 0.05115 39.72 <.0001 x2 1 -0.01728 0.05096 -0.34 0.7347 x3 1 -0.09008 0.04458 -2.02 0.0439 x4 1 0.02241 0.04495 0.50 0.6183 x5 1 -0.38520 0.04590 -8.39 <.0001 x6 1 0.71948 0.04489 16.03 <.0001 x7 1 0.03432 0.04496 0.76 0.4456 x8 1 0.04153 0.04471 0.93 0.3534 x9 1 -0.01484 0.04682 -0.32 0.7515 x10 1 0.20915 0.04488 4.66 <.0001

How to Implement GVS in WinBUGS Step 2 • Fit full model to get Pseudo-priors • In WinBugs (note estimates are very similar and SAS runs in a fraction of the time)

How to Implement GVS in WinBUGS Step 3 • Determine the Likelihood and Priors • Our GVS code is set up with the standard likelihood and non-informative priors • Edit GVS code before running to put in data for pseudo-priors

How to implement in WinBUGS Running Our Adapted Code model { # Standardize x's and coefficients for (j in 1:p) { b[j] <- beta[j]/sd(x[,j]) ; for (i in 1:N) { z[i,j] <- (x[i,j] - mean(x[,j]))/sd(x[,j]) ; } temp_mean[j]<-b[j]*mean(x[,j]) } b0 <- beta0-sum(temp_mean[])

How to implement in WinBUGS #likelihood for (i in 1:N){ Y[i] ~ dnorm(mu[i],tau) for(j in 1:p){ temp[i,j]<-g[j]*beta[j]*z[i,j] } mu[i] <- beta0 + sum(temp[i,]) # residuals stres[i] <- (Y[i] - mu[i])/sigma #if standardized residual is greater than 2.5, outlier outlier[i] <- step(stres[i] - 2.5) + step(-(stres[i]+2.5) ) }

How to implement in WinBUGS for (j in 1:p){ # Create indicators for the possible variables in the model # for example x[3] show after intercept, x1,x2,x1+x2, which is pow(2,3-1) TempIndicator[j]<-g[j]*pow(2, j-1) } #Create a model number for each possible model mdl<- 1+sum(TempIndicator[]) # calculate the percentage of time each model is selected for (j in 1 : models){ pmdl[j]<-equals(mdl, j) }

How to implement in WinBUGS # diffuse normal prior on the intercept beta0 ~ dnorm(0,0.00001) # if the parameter is not in the model an informative prior is used # this prior is calculated using a prior run of the full model for (j in 1:p) { bprior[j]<-(1-g[j])*mean[j] tprior[j] <-g[j]*0.001+(1-g[j])/(se[j]*se[j]) beta[j] ~ dnorm(bprior[j],tprior[j]) g[j]~ dbern(0.5) } tau ~ dgamma(1.0E-3,1.0E-3) sigma <- sqrt(1/tau) }

How to implement in WinBUGS # Fit the full model and put the information for the mean and se of each # standardized beta here for use in the pseudo-priors DATA mean[] se[] 2.013 0.112 0.149 0.112 -0.259 0.094 -0.161 0.096 -0.293 0.134 0.564 0.097 0.094 0.098 -0.023 0.097 -0.018 0.133 0.214 0.088 END

How to implement in WinBUGS # Set the initial values for the beta's, the overall # precision and the variable selection indicators Initial Values list(beta0 = 0, beta=c(0,0, 0,0,0,0,0,0,0,0), tau = .1, g=c(1,1,1,1,1,1,1,1,1,1)) # P=number of parameters # N=the number of obs # models= the number of possible models (2^p) DATA list(p = 10, N = 500, models = 1024 Y =c(6.339,…..

How to implement in WinBUGS • Output • You should monitor at least all of the Beta’s and PMDL which is the percentage of the time each model was picked • Models will be numbered 1 – 1024 in our example (the total number of models) • To see which model number corresponds to which variables selected we created an R function which outputs the model names in order (appendix 5) • To use this fuction type print.ind(number of x variables) Example: print.ind(10)

How to implement in WinBUGS Correlated Data • Our Example • Ran 50,000 iterations • Some models not visited • beta are the standardized coefficients • B are the raw coefficients • Simulated • b1=2, • b5=-.7, • b6=.7, • b10=.22

How to implement in WinBUGS Correlated Data • Our Example • The model visited 97% of the time was the target model • This is due to the strong correlations between the response and predictors. • We also modeled data with weaker correlations. • in these cases the target model was usually in the top 5 models and several models were visited with higher frequency

How to implement in WinBUGS Correlated Data • Our Example • Frequentist Stepwise method picks appropriate model but also includes x3 at the .05 level • Note that the p-value of .0473 is not accurate (since we were data snooping) • Parameter estimates are very similar

How to implement in WinBUGS Non-Correlated Data • Our Example • Ran 50,000 iterations • Some models not visited • beta are the standardized coefficients • B are the raw coefficients • Simulated • b1=2, • b5=-.7, • b6=.7, • b10=.22

How to implement in WinBUGS Non-Correlated Data • Our Example • Oddly, the coefficient for x10 and x5 are farther off than in the correlated data. This is by chance (found in simulated data review) • The model visited most often is the target model • This model is visited a slightly higher percentage of the time than in the correlated data

How to implement in WinBUGS Non-Correlated Data • Our Example • Frequentist Stepwise method picks appropriate model • Parameter estimates are very similar

Recommendations • Orthogonalizing the X matrix makes interpretation of the coefficients impossible • If the correlations between the response and explanatory variables are stronger then the correlations between the explanatory variables you may not need to orthogonalize the X matrix • If correlations are high within the x matrix we recommend using a variable clustering procedure and then pick an explanatory variable from each cluster • In SAS use Proc VARCLUS

Recommendations • Remember the number of candidate models is 2 raised to the number of variables you have. • In our example 2^10-1024. • Sampler may have to many iterations

Conclusions • Using the adapted WinBUGS code, it is now easy to implement GVS in the standard regression setting • We are working on also adapting other code from the Ntzoufras paper (like Ridge Regression) to make it easy to implement • In the case of standard linear regression with non-informative priors, the GVS method appears to give almost identical results to frequentist methods implemented in SAS. • SAS took seconds to fit these models while WinBUGS took hours • The additional time to use GVS may only be warranted when wanting to use informative priors.

Appendix 1 Simulated Data Creation Code *Correlated Data; data hw.simulate (drop= i j); array x{10}; do i= 1 to 500; * Make 10 normal(J,1) variables; do j= 1 to 10; x{j}=rannor(12458)+j; end; * Turn one into a binary variable; if x5 le 5 then x5=1; else x5=0; *create correlation between two predictors; x2=.5*x1+rannor(5); if x5=1 then x9=rannor(5)+.5; else x9=rannor(5); *create y from x1, x5, x6, and x10; y=2*x1+.7*x6+.22*x10-.7*x5+rannor(1); output ; end; run; *Non-correlated Data; data hw.simulate_nocorr (drop= i j); array x{10}; do i= 1 to 500; * Make 10 normal(J,1) variables; do j= 1 to 10; x{j}=rannor(12458)+j; end; * Turn one into a binary variable; if x5 le 5 then x5=1; else x5=0; *create y from x1, x5, x6, and x10; y=2*x1+.7*x6+.22*x10-.7*x5+rannor(1); output ; end; run;

Appendix 2 Simulated Data Correlation Matrix

Appendix 3 Code to Fit full model in WinBUGS #This code can be run with modification # only to the data and initial values Model { # Standardize x's and coefficients for (j in 1:p) { b[j] <- beta[j]/sd(x[,j]) for (i in 1:N) { z[i,j] <- (x[i,j] - mean(x[,j]))/sd(x[,j]) } temp_mean[j]<-b[j]*mean(x[,j]) } b0 <- beta0-sum(temp_mean[]) #likelihood for (i in 1:N){ Y[i] ~ dnorm(mu[i],tau) for(j in 1:p){ temp[i,j]<-beta[j]*z[i,j] } mu[i] <- beta0 + sum(temp[i,]) } #priors beta0 ~ dnorm(0,0.00001) for (j in 1:p) { beta[j] ~dnorm(0, 1.0E-6) } tau ~ dgamma(1.0E-3,1.0E-3) sigma <- sqrt(1/tau) } DATA #p= the number of explanatory variables #N=the number of observations #Put your response variables and explanatory variables here list(p = 10, N = 500, Y =c(6.339,…, 8.4532), x = structure(.Data =c(0.3534,…, 10.6885),.Dim = c(500,10))) Initial Values #enter initial values for the betas. list(beta0 = 0, beta=c(0,0, 0,0,0,0,0,0,0,0), tau = .1)

Appendix 4 GVS variable selection code #Create a model number for each possible model mdl<- 1+sum(TempIndicator[]) # calculate the percentage of time each model is selected for (j in 1 : models){ pmdl[j]<-equals(mdl, j) } # Priors # diffuse normal prior on the intercept beta0 ~ dnorm(0,0.00001) # if the parameter is not in the model an informative prior is used # this prior is calculated using a prior run of the full model for (j in 1:p) { bprior[j]<-(1-g[j])*mean[j] tprior[j] <-g[j]*0.001+(1-g[j])/(se[j]*se[j]) beta[j] ~ dnorm(bprior[j],tprior[j]) g[j]~ dbern(0.5) } tau ~ dgamma(1.0E-3,1.0E-3) sigma <- sqrt(1/tau) } ************************************************************************** # Code from Gibbs Variable Selection Using BUGS # code of Ioannis Ntzoufras for variable selection with 3 # predictor variables was modified to allow for # variable selection for any number of variables # the user only need to modify the inits and data model { # Standardize x's and coefficients for (j in 1:p) { b[j] <- beta[j]/sd(x[,j]) ; for (i in 1:N) { z[i,j] <- (x[i,j] - mean(x[,j]))/sd(x[,j]) ; } temp_mean[j]<-b[j]*mean(x[,j]) } b0 <- beta0-sum(temp_mean[]) #likelihood for (i in 1:N){ Y[i] ~ dnorm(mu[i],tau) for(j in 1:p){ temp[i,j]<-g[j]*beta[j]*z[i,j] } mu[i] <- beta0 + sum(temp[i,]) # residuals stres[i] <- (Y[i] - mu[i])/sigma #if standardized residual is greater than 2.5, outlier outlier[i] <- step(stres[i] - 2.5) + step(-(stres[i]+2.5) ) } for (j in 1:p){ # Create indicators for the possible variables in the model TempIndicator[j]<-g[j]*pow(2, j-1) }

Appendix 4 GVS variable selection code # Fit the full model and put the information for # the mean and se of each beta here for use in # the pseudo priors mean[] se[] 2.013 0.112 0.149 0.112 -0.259 0.094 -0.161 0.096 -0.293 0.134 0.564 0.097 0.094 0.098 -0.023 0.097 -0.018 0.133 0.214 0.088 END # Set the initial values for the beta's, the overall # precision and the variable selection indicators Initial Values list(beta0 = 0, beta=c(0,0, 0,0,0,0,0,0,0,0), tau = .1, g=c(1,1,1,1,1,1,1,1,1,1)) # P=number of parameters # N=the number of obs # models= the number of possible models (2^p) list(p = 10, N = 500, models=1024, Y =c(6.339, 4.4347, …), x = structure(.Data =c(0.3534,…10.6885),.Dim = c(500,10)))

Appendix 5 R code to name models ind<-function(p) { if(p == 0) { return(t <- 0) } else if(p == 1) {return(t <- rbind(0, 1)) } else if(p == 2) { return(t <- rbind(c(0, 0), c(1, 0), c(0, 1), c(1, 1))) } else { t <- rbind(cbind(ind(p - 1), rep(0, 2^(p - 1))), cbind( ind(p - 1), rep(1, 2^(p - 1)))) return(t) }} print.ind<-function(p) { t <- ind(p) print("intercept") for(i in 2:nrow(t)) { e <- NULL L <- T for(j in 1:ncol(t)) { if(t[i, j] == 1 & L == T) { e <- paste(e, "x", j, sep = "") L <- F } else if(t[i, j] == 1 & L == F) { e <- paste(e, "+ x", j, sep = "") }} print(e)}}

References • Carlin, B.P. and Chib S. (1995), “Bayesian Model Choice via Markov Chain Monte Carlo Methods”, Journal of Royal Statistics Society, B, 57, 473-484. • Dellaportas, P., Forster, J., & Ntzoufras, I. (2000), “Bayesian Variable Selection Using the Gibbs Sampler” In Dey, K., Ghosh, S., Mallick, B. (Eds.), Generalized Linear Models: A Bayesian Perspective. New York, NY, Marcel Drekker Inc. • Dellaportas, P., Forster, J., & Ntzoufras, I. (2002), “On Bayesian Model and Variable Selection using MCMC”, Statistics and Computing 12:27-36. • Green, P. (1995) “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination”, Biometrika, Vol. 82 No. 4 711-732. • George, E. & McCullock, R. (1993) “Variable Selection Via Gibbs Sampling”, Journal of the American Statistical Association, September 1993, Volume 88, No. 423, Theory and Methods. • Kuo, L. and Mallick, B. (1998), “Variable Selection for Regression Models”, Sankhya, B, 60, Part 1, 65-81. • Ntzoufras, I. (2003) “Gibbs Variable Selection Using BUGS”, Journal of Statistical Software, Volume 7, Issue 7.

Gibbs Variable Selection

Gibbs Variable Selection

Presentation Transcript

Variable Selection in R

Peter Gibbs

Gibbs sampling

Russ Gibbs

Biostatistics-Lecture 7 Variable selection methods

Variable Selection for Tailoring Treatment

Variable Selection for Optimal Decision Making

Gibbs Sampling

GIBBS PARADOX

Graham Gibbs

Variable - / Feature Selection in Machine Learning (Review)

Variable Selection for Optimal Decision Making

Graham Gibbs

“When” rather than “Whether”: Developmental Variable Selection

Variable Selection in R

Variable Selection for Tailoring Treatment

Variable selection and model building

CONTROLLED VARIABLE AND MEASUREMENT SELECTION

Variable Selection for Individualized Treatment Decision-Making

Optimal controlled variable selection for individual process units

Optimal controlled variable selection for individual process units

Limited Dependent Variable Models and Sample Selection Corrections