560 likes | 580 Views
Explore different penalization methods for prediction leveraging side information such as sparsity, group structure, and hierarchy in data analysis, including L1-penalty, Composite Absolute Penalty (CAP), and more. Discover how these techniques enhance prediction accuracy for large datasets with modern data set challenges. |
E N D
Prediction using Side Information Bin Yu Department of Statistics, UC Berkeley Joint work with Peng Zhao, Guilherme Rocha, and Vince Vu
Outline • Motivation • Background • Penalization methods (building in side information through penalty) • L1-penalty (sparsity as the side information) • Group and hierarchy as side information: Composite Absolute Penalty (CAP) • Building Blocks – L-norm Regularization • Definition • Interpretation • Algorithms • Examples and Results • Unlabeled data as side information: semi-supervised learning Motivating example: image-fMRI problem in neuroscience Penalty based on population covariance matrix Theoretical result to compare with OLS Experimental results on image-fMRI data
Characteristics of Modern Data Set Problems • Goal: efficient use of data for: • Prediction • Interpretation • Larger number of variables: • Number of variables (p) in data sets is large • Sample sizes (n) have not increased at same pace • Scientific opportunities: • New findings in different scientific fields
Regression and classification Data Example: image-fMRI problem Predictor : 11,000 features of an image Response: (preprocessed) fMRI signal at a voxe n=1750 samplesl Minimization of an empirical loss (e.g. L2) leads to • ill-posed computational problem, and • bad prediction
Regularization improves prediction • Penalization -- linked to computation L2 (numerical stability: ridge; SVM) Model selection (sparisty, combinatorial search) L1 (sparsity, convex optimization) • Early stopping: tuning parameter is computational Neural nets Boosting • Hierarchical modeling (computational considerations)
Lasso: L1-norm as a penalty • The L1 penalty is defined for coefficients • Used initially with L2 loss: • Signal processing: Basis Pursuit (Chen & Donoho,1994) • Statistics: Non-Negative Garrote (Breiman, 1995) • Statistics: LASSO (Tibshirani, 1996) • Properties of Lasso • Sparsity (variable selection) • Convexity (convex relaxation of L0-penalty)
Lasso: L1-norm as a penalty Computation: the “right” tuning parameter unknown so “path” is needed (discretized or continuous) • Initially: quadratic program for each a grid on . QP is called for each . • Later: path following algorithms homotopy by Osborne et al (2000) LARS by Efron et al (2004) Theoretical studies: much work recently on Lasso …
General Penalization Methods • Given data : • Xi : a p-dimensional predictor • Yi : response variable • The parameters are defined by the penalized problem: • where • is the empirical loss function • is a penalty function • is a tuning parameter
Beyond Sparsity of Individual Predictors:Natural Structures among predictors Rationale: side information might be available and/or additional regularization is needed beyond Lasso for p>>n • Groups: • Genes belonging to the same pathway; • Categorical variables represented by “dummies”; • Polynomial terms from the same variable; • Noisy measurements of the same variable. • Hierarchy: • Multi-resolution/wavelet models; • Interactions terms in factorial analysis (ANOVA); • Order selection in Markov Chain models;
Composite Absolute Penalties (CAP)Overview • The CAP family of penalties: • Highly customizable: • ability to perform grouped selection • ability to perform hierarchical selection • Computational considerations: • Feasibility: Convexity • Efficiency: Piecewise linearity in some cases • Define groups according to structure • Combine properties of L-norm penalties • Encompass and go beyond existing works: • Elastic Net (Zou & Hastie, 2005) • GLASSO (Yuan & Lin, 2006) • Blockwise Sparse Regression (Kim, Kim & Kim, 2006)
Composite Absolute PenaltiesReview of L Regularization Given data and loss function : • L Regularization: • Penalty: • Estimate: • where >0 is a tuning parameter • For the squared error loss function: • Hoerl & Kennard (1970): Ridge (=2) • Frank & Friedman (1993): Bridge (general ) • LASSO (1996): (=1) • SCAD (Fan and Li, 1999): (<1)
Composite Absolute PenaltiesDefinition • The CAP parameter estimate is given by: • Gk's, k=1,…,K - indices of k-th pre-defined group • Gk – corresponding vector of coefficients. • || . ||k – group Lk norm: Nk = ||k||k; • || . ||0 – overall norm: T() =||N||0 • groups may overlap (hierarchical selection)
Composite Absolute PenaltiesA Bayesian interpretation • For non-overlapping groups: • Prior on group norms: • Prior on individual coefficients:
Contour plot for 0=1, 1=2, 2=2 Composite Absolute PenaltiesGroup selection • Tailoring T() for group selection: • Define non-overlapping groups • Setk>1, for all k 0: • Group norm k tunes similarity within its group • k>1 causes all variables in group i to be included/excluded together • Set0=1: • This yields grouped sparsity • k=2 has been studied by Yuan and Lin(Grouped Lasso, 2005).
Composite Absolute PenaltiesHierarchical Structures • Tailoring T() for Hierarchical Structure: • Set 0=1 • Set i>1, i • Groups overlap: • If2appears in all groups where 1is included • Then X2 enters the model after X1 • As an example:
X1 X2 Composite Absolute PenaltiesHierarchical Structures • Represent Hierarchy by a directed graph: • Then construct penalty by: • For graph above, 0=1, r=:
Composite Absolute PenaltiesComputation • CAP with general L norms • Approximate algorithms available for tracing regularization path • Two examples: • Rosset (2004) • Boosted Lasso (Zhao and Yu, 2004): BLASSO • CAP with L1–L norms • Exact algorithms fortracing regularization path • Some applications: • Grouped Selection: iCAP • Hierarchical Selection: hiCAP for ANOVA and wavelets
iCAP:Degrees of Freedom (DFs) for tuning par. selection Two ways for selecting the tuning parameter in iCAP: 1. Cross-validation 2. Model selection criterion AIC_c where DF used is a generalization of Zou et al (2004)’s df for Lasso to iCAP.
Simulation Studies (p>n) (partially adaptive grouping)Summary of Results • Good prediction accuracy • Extra structure results in non-trivial reduction of model error • Sparsity/Parsimony • Less sparse models in l0 sense • Sparser in terms of degrees of freedom • Estimated degrees of freedom (Group, iCAP only) • Good choices for regularization parameter • AICc: model errors close to CV
ANOVA Hierarchical SelectionSimulation Setup • 55 variables (10 main effects, 45 interactions) • 121 observations • 200 replications in results that follow
Summary on CAP: Group and Hierarchical Sparsity • CAP penalties: • Are built from L “blocks” • Allow incorporation of different structures to fitted model: • Group of variables • Hierarchy among predictors • Algorithms: • Approximation using BLASSO for general CAP penalties • Exact and efficient for particular cases (L2 loss, L1 and L norms) • Choice of regularization parameter : • Cross-validation • AICc for particular cases (L2 loss, L1 and L norms)
Regularization using unlabeled data: semisupervised learning Motivating example: image-fMRI problem in neuroscience (Gallant Lab at UCB) Goal: to understand how natural images relate to fMRI signals
Stimuli Natural image stimuli
Stimulus to response Natural image stimuli drawn randomly from a database of 11,499 images Experiment designed so that response from different presentations are nearly independent Response is pre-processed and roughly Gaussian
Linear model Separate linear model for each voxel Y = Xb + e Model fitting • X: p=10921 dimensions (features) • n = 1750 training samples Fitted model tested on 120 validation samples • Performance measured by correlation
Ordinary Least Squares (OLS) Minimize empirical squared error risk Notice that OLS estimate is a function of estimates of covariance of X (Σxx)and covariance X with Y (Σxy)
OLS Sample covariance matrix of X is often nearly singular and so inversion is ill-posed. Some existing solutions • Ridge regression • Pseudo-inverse (or truncated SVD) • Lasso (closely related to L2boosting -- current method at Gallant Lab)
Semi-supervised Abundant unlabeled data available • samples from the marginal distribution of X Book on “semisupervised learning” (2006) (eds. Chapelle, Scholkopf, and Zien) Stat. science article (2007) (Liang, Mukherjee and Westl) Image-fMRI: images in the database are unlabeled data Semi-supervised linear regression • Use • labeled (Xi,Yi) i=1,…, n, and • unlabeled data Xi i=n+1,…,n+m to fit
Semi-supervised Does marginal distribution of X play a role? • For fixed design X, marginal dist of X plays no role. • (Brown 1990) shows that OLS estimate of the intercept is inadmissible if X assumed random.
Refining OLS The unknown parameter satisfies So OLS can be seen as a plug-in estimate for this equation Can plug-in an improved estimate of Σxx ?
A first approach Suppose population covariance of X is known • (infinite amount of unlabeled data) Use a linear combination of the sample and population covariances. (Ledoit and Wolf 2004) considered convex combinations of sample covariance and another matrix from a parametric model
Semi-supervised OLS Plug in the improved estimate of Σxx, we get “semi-OLS”:
Semi-supervised OLS Equivalent to penalized least squares Equivalent to ridge regression in pre-whitened covariates
Spectrally semi-supervised OLS Ridge regression in (W,Y) is just a transformation of Λ, where W has spectral decomposition: More generally, can consider arbitrary transformations of the spectrum of W Resulting estimator
Spectrally semi-supervised OLS Examples: • OLS h(s) = 1/s • Semi-OLS = Ridge on pre-whitened predictors: h(s) = 1/(s+α) • Truncated SVD on pre-whitened predictors (PCA reg): h(s) = 1/s if s>c, otherwise 0
Large n,p asymptotic MSPE Assumptions • Σ non-degenerate • Z = X Σ-1/2 is n-by-p with IID entries satisfying: • mean 0, variance 1 • finite 4th moment • h is a bounded function • βT Σxx β / σ2 has finite limit SNR2as p,n tend to ∞ • p/n has finite, strictly positive limit r
Large n,p MSPE Theorem The Mean Squared Prediction Error satisfies where Fr is the Marchenko-Pastur law with index r and
Consequences Asymptotically optimalh Asymptotically better than OLS and truncated SVD Reminiscent of shrinkage factor in James-Stein estimate SNR might be easily estimated
Back to image-fMRi problem Fitting details: Regularization parameter selected by 5-fold cross validation L2 boosting applied to all 10,000+ features -- L2 boosting is the method of choice in Gallant Lab Other methods applied to 500 features pre-selected by correlation
Other methods k = 1: semi OLS (theoretically better than OLS) k = 0: ridge k = -1: semi OLS (inverse)
Features used by L2boost Features used by L2boosting
Comparison of the feature locations Semi methods L2boost
Further work Image-fMRI problem based on a linear model Compare methods for other voxels Use fewer features for semi-methods? (average # features for L2boosting = 120 # features for semi-methods = 500, by design) Interpretation of the results of different methods Theoretical results for ridge and semi inverse OLS? Image-fMRI problem: non-linear modeling understanding the image space (clusters? Manifolds?) different linear models on different clusters (manifolds)? non-linear models on different clusters (manifolds)? …
CAP Codes: www.stat.berkeley.edu/~yugroup Paper: www.stat.berkeley.edu/~binyu to appear in Annals of Statistics Thanks: Gallant Lab at UC Berkeley
Proof Ingredients Can show that MSPE decomposes as: Results in random matrix theory can be applied: • BIAS term is a quadratic form in sample covariance matrix • VARIANCE term is an integral wrt empirical spectral distribution of sample covariance matrix