1 / 40

Smoothing Proximal Gradient Method for General Structured Sparse Learning

Smoothing Proximal Gradient Method for General Structured Sparse Learning. Eric Xing epxing@cs.cmu.edu School of Computer Science Carnegie Mellon University Joint work with: Xi Chen, Qihang Lin, Seyoung Kim, Jaime Carbonell. Moder n High-Dimensional Problems. Genomics:

Download Presentation

Smoothing Proximal Gradient Method for General Structured Sparse Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Smoothing Proximal Gradient Methodfor General Structured Sparse Learning Eric Xing epxing@cs.cmu.edu School of Computer Science Carnegie Mellon University Joint work with: Xi Chen, Qihang Lin, Seyoung Kim, Jaime Carbonell

  2. Modern High-Dimensional Problems Genomics: 3 Billion Bases, millions of “mutations”, in a linear ordering 20K genes, related by a network Computer Vision: 3.6 billion photos, with million-dimensional features labeled with ~104 classes, organized by a gigantic taxonomy

  3. Genome Wide Association (GWA) GWA mapping Single Nucleotide Polymorphism genotyping became affordable due to high throughput sequencing technologies This problem is extremely difficult Number of samples << Number of SNPs Complex genetic architecture, population structures, other confounders What would you do if you need to perform GWA for multiple traits, e.g., ALL gene expressions, ALL clinical phenotypes, on MANY SNPs? And you know their STRUCTURES? It is desirable to incorporate rich structures of multiple traits and SNPs via Sparse Structured I/O Modelsto improve GWA mapping! 7/27/2011 3

  4. Human-level Image classification The knowledge ontology • large scale in 3 dimensions • Data: 12 million images • Features: ~1 million (number comes from the top performing system in ILSVRC10, [Lin et al. 2011]) • Classes: 17k classes Courtesy L. Fei-Fei

  5. Toward Large Scale Problems: • Large data size: • Stochastic/online methods • Parallel computation, e.g., Map-Reduce • Large feature dimension: • Sparsity-inducing regularization • Structured sparsity • Sparse coding • Large concept space: • Multi-task and transfer learning • Structured sparsity

  6. Outline Structured Sparse Learning Problems Smoothing Proximal-Gradient Descent Method Extension: Multi-task Structured Sparse Learning Experimental Results: Simulation Study Experimental Results: Real Genetic Data Analysis

  7. Sparse Learning • Linear Model: • Sparse Linear Regression (Lasso) [R.Tibshirani 96] Individual Feature Level Sparsity Regression Loss Group Structure Graph Structure

  8. Structured Prediction • Binary classification: black-and-white decisions • Multi-class classification: the world of technicolor • can be reduced to several binary decisions, but... • often better to handle multiple classes directly • how many classes? 2? 5? exponentially many? • Structured prediction: many classes, strongly interdependent • Example: sequence labeling (number of classes exponential in the sequence length 8

  9. LD Dogs Birds Multivariate Regression for Multi-task Classification Input features Feature strength Shepherd Penguin Duck Husky bulldog Feature strength between featurejand class i:βj,i x (0 0 1 0 0) = ? |βj,i| + How to combine information across multiple classes to increase the power?

  10. LD Dog Birds Multivariate Regression for Multi-task Classification Input features Feature strength Shepherd Penguin Duck Husky bulldog Feature strength between featurejand class i:βj,i x (0 0 1 0 0) = |βj,i| + We introduce Graph- or tree-guided penalty +

  11. Graph-Guided Fusion Penalty • Fusion Penalty: | βjk - βjm| • For two correlated concepts (connected in the network), the association strengths may have similar values. • Fusion effect propagates to the entire network • Association between features and subnetworks of concepts Feature j Strength between feature jand concept m:βjm Strength between feature jand concept k:βjk … concept m concept k Kim and Xing, PLOS G 2009

  12. Tree-Guided Group Lasso • For a general tree h2 Select the child nodes jointly or separately? h1 Tree-guided group lasso Joint selection Separate selection 12

  13. Sparse Coding (unsupervised) • Let X be a signal, e.g., speech, image, etc. • Let b be a set of normalized “basis vectors” • We call it dictionary • b is “adapted” to x if it can represent it with a few basis vectors • There exists a sparse vector q such that x ≈ b q • We call q the sparse code Sailboat response = X … q X … Bear response Water response

  14. Hierarchical Image Coding Unsupervised or supervised feature learning Sailboat response Structured Object Dictionary Structured Object Dictionary Pooling … Bear response … … … Water response L.-J. Li, J. Zhu, H. Su, E.P. Xing, & L. Fei-Fei. Under preparation

  15. Challenge ? • How to solve the optimization problem for overlapping group lasso & graph-guided fused lasso Overlapping Group Lasso: Graph-guided Fused Lasso

  16. Optimization • Existing Methods:

  17. Smoothing Proximal Gradient (SPG) Descent • Fast and Scalable Algorithm: Gradient Method • Non-separabilityand non smoothness of the structured sparsity-inducing penalty • Idea: • Reformulate the structured sparsity-inducing penalty (via the dual norm) • Introduce its smooth approximation • Plug the smooth approximation back into the problem and solve it by accelerate gradient method (FISTA: fast iterative shrinkage-thresholding algorithm) [Y.Nesterov 05] [Beck and Teboulle, 09]

  18. Reformulation of Fusion Penalty • Graph Structured Sparsity edge-vertex incident matrix Dual Norm:

  19. Reformulation of Group Penalty • Group Structured Sparsity Dual Norm: Row Index: Column Index:

  20. Approximation to the Penalty Smoothing Parameter: Max Gap: Smooth Lower Bound Graph: Group:

  21. Geometric Interpretation Smooth approximation Uppermost Line Nonsmooth Uppermost Line Smooth

  22. Proximal Gradient Descent Original Problem: Smooth Non-smooth with complicated structure Non-smooth with good separability Approximation Problem: Smooth function Gradient of h:

  23. Accelerated Gradient Descent [Beck and Teboulle, 09] Smooth Non-smooth with good separability (FISTA) Closed-form Solution

  24. Convergence Rate If we require and set , the number of iterations is upper bounded by: Proof Idea: Subgradient Method:

  25. Time Complexity • Pre-compute: • Per-iteration Complexity (computing gradient) Group: Graph: Proximal-Gradient: Independent of Sample Size

  26. Multi-Task Extension

  27. Multi-Task Time Complexity • Pre-compute: • Per-iteration Complexity (computing gradient) Group: Graph: Proximal-Gradient: Independent of Sample Size Linear in

  28. Experiment • Multi-task Overlapping Group Lasso (Tree-structured) Binary Tree Ground Truth Lasso L1/L2 Multi-task Lasso Group Structure

  29. Experiment • Multi-task Overlapping Group Lasso (Tree-structured) SOCP: Out of memory for storing Newton Linear System Cannot scale up

  30. Experiment • Multi-task Graph-guided Fused Lasso Input: SNPs in Hapmap CEU panel Graph Fused L1/L2 Ground Truth lasso

  31. Experiment • Multi-task Graph-guided Fused Lasso SOCP/ QP: Out of memory for storing Newton Linear System Cannot scale up

  32. The ImageNet Problem • ILSVRC10: 1.2 million images / 1000 categories • 1000 visual words in dictionary • Locality-constrained linear coding • Max pooling on spatial pyramid • Each image represented as a vector in 21000 dimensional space Zhao, Fei-Fei and Xing, in preparation

  33. Classification Results • Flat error & hierarchical error

  34. Effects of Augmented Loss Function • APPLET vs. LR • Classification results of APPLET significantly more informative

  35. Summary • Smoothing Proximal Gradient (SPG) Descent • Reformulate the structured sparsity-inducing penalty (via the dual norm) • Introduce its smooth approximation • Plug the smooth approximation back into the problem and solve it by accelerate gradient method (FISTA: fast iterative shrinkage-thresholding algorithm)

  36. Thank You! Q& A

  37. Accelerated Gradient Descent (FISTA) • Generalized Gradient Descent Step (Projection Step) • Closed–form Solution (soft-thresholding operation) Euclidean Distance Exact Sparse (Zero) Solution

  38. Biological Applications • Genome-Wide Association Studies (GWAS) 1,260 genotypes(inputs), expression levels(output) of 3,684 genes, 114 yeast strains. Multi-task Overlapping Group Lasso: Group defined among genes by hierarchical clustering tree. Training:Test=2:1 (5-folds) 368 Iterations, 1366 seconds Previous Method can only handle no more than 100 genotypes [S. Kim 10]

  39. Multi-Task Time Complexity • Pre-compute: • Per-iteration Complexity (computing gradient) Tree: Graph: Proximal-Gradient: Independent of Sample Size Linear in #.of concepts Parallelizable 42

  40. Proximal Gradient Descent Original Problem: Approximation Problem: Gradient of the Approximation:

More Related