1 / 33

Sparse Principal Component Analysis

Sparse Principal Component Analysis. Zou, Hastie, Tibshirani. presented by Banu Dost. Outline. Microarray expression data PCA on expression data SPCA Definition Motivation Method Linear regression Lasso/elastic net 2 Examples. DNA Microarrays. Gene Protein

bayle
Download Presentation

Sparse Principal Component Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse Principal Component Analysis Zou, Hastie, Tibshirani presented by Banu Dost

  2. Outline • Microarray expression data • PCA on expression data • SPCA • Definition • Motivation • Method • Linear regression • Lasso/elastic net • 2 Examples

  3. DNA Microarrays • Gene Protein • Monitors gene expression levels on a genomic scale. (~abundance of each protein in the cell) From “Genome-Wide Expression in Escheria Coli K-12”, Blattner et al., 1999 expression

  4. DNA Microarrays • One (micro-)array = one snapshot of the cell • Each spot represents a gene • Intensity of colors Indicate expression level of the corresponding gene in a certain experiment • Noisy, each expression level is only an approximation.

  5. Microarray Expression Data n genes m arrays m arrays • n in order of thousands, while m is in order of hundreds. • n>>m, n>1000, m>100.

  6. PCA on Expression Data n genes eigen-genes m arrays eigen-arrays linear transformation

  7. PCA on Expression Data n genes Each eigen-gene is expressed only in the corresponding eigen-array with the corresponding eigen-expression level. m arrays E eigen-genes n eigen-arrays D VT = U x x m eigen- expressions (Alter et al., 2000)

  8. Drawbacks of PCA eigen-genes n loadings D VT E = U x x eigen- expressions • PCs (eigen-genes) are linear combinations of all n variables (genes). • Each PC corresponds to a loading vector (columns of V) • Loadings = coefficients corresponding to variables in the linear combination • Difficult to interpret

  9. SPCA: Motivation • Idea: Reduce the number of explicitly used variables (genes). • Approach: Modify PCA so that PCs have sparse loadings = sparse PCA (SPCA)

  10. SPCA • Writes PCA as a regression-type optimization problem. • Uses lasso • a variable selection technique • produces sparse models • Result: modified PCA • Modified PCs with sparse loadings.

  11. Linear Regression Problem • Input variables • Response variable • Regression coefficients • Multivariate linear model

  12. Linear Regression Problem Y X ε β Error Training Data Coefficients • N observations, p predictors. • Goal: Estimate the coefficients, β.

  13. Least Squares Solution Y Y X

  14. Lasso Solution • Pros: • Lasso continously shrinks the coefficients toward zero. • Produces a sparse model. • Variable selection method.

  15. Lasso Solution • Cons: • #selected variables is limited by n, the number of observations. e.g. microarray expression data n(arrays)<<p(genes) • Selects only one of the highly correlated variables, does not care which one is in the final model.

  16. Elastic Net Solution

  17. Elastic Net Solution • Pros: • Limitation of lasso removed by ridge constraint. All variables are included in the model. • Grouping effect: • Selects a group of highly correlated variables once one variable among them is selected. (lasso selects only one of them, does not care which.) • Good way as a gene selection method in microarray data analysis.

  18. SPCA • Goal: • Construct a regression framework in which PCA can be reconstructed exactly. • Use lasso/elastic-net to construct modified PCs with sparse loadings.

  19. Reconstruction of PCA in a Regression Framework Idea: Each PC is a linear combination of the p variables. Its loadings can be recovered by regression PC on the p variables. p variables … Training Data ith PC p ith loading D VT = U x x Coefficients Response

  20. Theorem 1 Yi p Vi D VT X = U x x

  21. Reconstruction of PCA in a Regression Framework • By theorem 1, we can reconstruct the loadings of PCs exactly by a linear regression problem. • not an alternative to PCA as it uses its results. • Ridge penalty does not penalize the coefficients, but ensure the reconstruction of PCs. • Now, add lasso penalty to the problem to penalize for the absolute values of coefficients.

  22. Construction of SPCA in a Regression Framework

  23. Summary • Goal: Construct PCs with sparse loadings. • Method: • Step 1: Perform PCA. • Step 2: Solve below to obtain sparse approximations of PCs and loadings.

  24. Transformation of PCA to a Regression Problem • Question: “Can we derive PCs from a ‘self-contained’ regression type criterion?”

  25. Theorem 2

  26. Theorem 3(generalization of Theorem 2)

  27. SPCA criterion • Add lasso penalty for sparseness.

  28. Simulation Example: PCA vs. SPCA • Data points X = (X1, X2, …, X10) • 10 variables • Model to generate data: • 3 hidden factors: V1, V2, V3 • V1 ~ N(0, 290) • V2 ~ N(0, 300) • V3 = -0.3 V1 + 0.925 V2 + e, e~ N(0, 1) • Xi = V1 + ei1, ei1 ~ N(0, 1), i=1,2,3,4 • Xi = V2 + ei2, ei2 ~ N(0, 1), i=5,6,7,8 • Xi = V3 + ei3, ei3 ~ N(0, 1), i=9,10 variances: 290,300,298 4 vars associated with V1 4 vars associated with V2 2 vars associated with V3

  29. Simulation Example: PCA vs. SPCA • How many observations? • PCA and SPCA performed on exact covariance matrix. => infinitely many data points. • We expect to derive 2 PCs with right sparse loadings: • One from (X5, X6, X7, X8) recovering V2 • One from (X1, X2, X3, X4) recovering V1 PCs? X1 X2 …. X10 C X1 X2 …. X10

  30. Table of Loadings V1 V2

  31. SPCA on Microarray Data • Ramaswamy dataset: • p=16063 genes, n=144 samples • p>>n • SPCA applied to find leading PC. • λ = ∞, a sequence of used λ1 • number of non-zero loadings varies

  32. Sparse leading PC of Microarray Data Percentage of explained variance Number of non-zero loadings

  33. Conclusion • Two methods proposed to derive PCs with sparse loadings • Direct approximations from PCs • Formulating PCA as regression problem and deriving PCs with ridge and lasso constraints.

More Related