Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications

Thesis ProposalLearning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry Wasserman, Robert Tibshirani Machine Learning Department Carnegie Mellon University

Modern Data Analysis Web-text data: Characteristic: Both high-dimensional & massive amount Structures of word features (e.g., synonym) Challenges : High-dimensions Complex & Dynamic Structures Gene expression data for tumor classification: Characteristic: High-dimensional; Very few samples; complex structure Climate Data Characteristic: Dynamic complex structure

Solutions: Sparse Learning [Tibshirani96] Smooth Convex Loss L1-regularization [Jenattonet al., 09, Penget al., 09 Tibshirani et al., 05 Friedman et al., 10 Kim et al., 10] Structured Penalty (e.g., group, hierarchical tree, graph) Additive Model [Ravikumar et al., 09] Sparse regression for feature selection & prediction Incorporating Structural Prior Knowledge Nonparametric Sparse Regression: flexible model

Sparse Learning in Graphical Models Pairwise model for image Gene Graph Graphical Lasso (gLasso) ( Yuan et al. 06, Friedman et al. 07, Banerjee et al. 08) Iterated Lasso (Meishausen and Buhlmann, 06) Forest Density Estimator (Liu et al. 10) Undirected Graphical Model (Markov Random Fields) Learn Sparse Structure of Graphical Models

Thesis Overview High-dimensional Sparse Learning with Structures Nonparametric Sparse Regression Learning Sparse Structures for Undirected Graphical Models Sparse Single/Multi-task Regression with General Structured- Penalty Existing: Additive Models Challenge: (1) Generalized Models, (2) Structures Existing: Static or Time-varying Graph Challenge: Dynamic Structures Challenge: Computation Completed Work: Conditional Gaussian Graphical Model Kernel Smoothing Method for Spatial-Temporal Graphs [AAAI 10] Partition-Based Method [NIPS 10] Completed Work: Unified Optimization Framework: Smoothing Proximal Gradient [UAI 11, AOAS] Completed Work Generalized Forward Regression [NIPS 09] Penalized Tree Regression [NIPS 10] Future Work: (1) Online Learning for Massive Data (2) Incorporate Structured-Penalty in Other Models (e.g. PCA, CCA) Future Work: Relax Conditional Gaussian Assumption: Continuous & Discrete Future Work: Incorporating Rich Structures Application areas: tumor classification using gene expression data [UAI 11, AOAS], climate data analysis [AAAI 10, NIPS 10], web-text mining [ICDM 10, SDM 10]

Roadmap Smoothing Proximal Gradient for Structured Sparse Regression Structure Learning in Graphical Models Nonparametric Sparse Regression Summary and Timeline Q & A

Useful Structures and Structured Penalty Application: pathway selection for gene-expression data in tumor classification [Yuan 06] [Peng et al 09, Kim et al 10] Example: WordNet [Bach et al., 09] Group Structure (group-wise selection)

Useful Structure and Structured Penalty Piece-wise constant Graph smoothness [Kim et al., 10] • Graph Structure (to enforce smoothness) [Tibshirani 05]

Challenge Single-task Regression Nonsmooth Nonseparable Multi-task Regression Unified, Efficient and Scalable Optimization Framework for Solvingall these Structured Penalties

Existing Optimization Proximal Operator: [Nesterov 07, Beck and Teboulle, 09]

Overview: Smoothing Proximal Gradient (SPG) [Nesterov 05] • First-order Method (only gradient info): fast and scalable • No exact solution for proximal operator • Idea: • 1) Reformulate the structured penalty (via the dual norm) • 2) Introduce its smooth approximation • 3) Plug the smooth approximation back into the original problem and solve it by accelerated proximal gradient methods • Convergence Results:

Why the Approximation is Smooth? Uppermost Line Nonsmooth Uppermost Line Smooth Geometric Interpretation:

Smoothing Proximal Gradient (SPG) Original Problem: Convex Smooth Loss Non-smooth Penalty with complex structure Approximated Problem: Non-smooth with good separability Smooth function Gradient of the Approximation (Danskin’sTheorem) Proximal Operator: Soft-thresholding [Nesterov 07, Beck and Teboulle, 09]

Convergence Rate

Multi-Task Extension

Simulation Study ACGTTTTACTGTACAATTTAC SNP Gene-expression data Multi-task Graph-guided Fused Lasso

Biological Application SPG for Overlapping Group Lasso Regularization path (20 parameters): 331 seconds Important pathways: proteasome,nicotinate (ENPP1) Training:Test=2:1 Breast Cancer Tumor Classification Gene expression data for 8,141 genes in 295 breast cancer tumors. (78 metastatic and 217 non-metastatic, logistic regression loss) Canonical pathways from MSigDB containing 637 groups of genes

Proposed Research Complex Structured Penalty: Smoothing Technique Simple Penalty with good separability: closed-form solution in proximal operator E.g. Low Rank + Sparse • More applications for SPG • Web-scale learning: massive amounts of data • Inputs arrive sequentially at a high-rate • Need to provide real-time service Solution: Stochastic Optimization for Online Learning

Proposed Research Deterministic: Stochastic: Existing Methods : RDA [Lin 10] , Accelerated Stochastic Gradient Descent [Lan et al. 10] Ruin the sparsity-pattern Goal: sparsity-persevering stochastic optimization for large-scale online learning • Stochastic Optimization • Structured Sparsity: Beyond Regression • Canonical Correlation Analysis and its Application in Genome-wide Association Study

Gaussian Graphical Model [Lauritzen 96] [Yuan et al., 06, Friedman et al., 07 Banerjee et al., 08] gLasso Gaussian Graphical Model Graphical Lasso (gLasso) Challenge: Dynamic Graph Structure

Idea: Graph-Valued Regression Multivariate Regression Undirected Graphical Model Input data: Graph-Valued Regression: Application: [Zhou et al., 08 Song et al., 09]

Applications for higher dimensional X Y: Gene expression levels X: Patient Symptoms Characterization

Kernel Smoothing Estimator Conditional Gaussian Assumption Kernel Smoothing Estimator Cons: (1) Unstable when the dimension of x is high (2) Computationally heavy and difficult to analyze (3) Hard to Visualize

Partition Based Estimator [Breiman 84, Tibshirani et al.,09] Graphical model: difficult to search for the split point Partition Based Estimator: Graph-Optimized CART(Go-CART) CART (Classification and Regression Tree)

Dyadic Partitioning Tree [Scott and Nowak,04] Dyadic Partitioning Tree (DPT) Assumptions and Notations:

Graph-Optimized CART (Go-CART) • Go-CART: penalized risk minimization estimator • Go-CART: held-out risk minimization estimator • Split the data: • Practical algorithm: greedy learning using held-out data

Statistical Property We do not assume that underlying partition is dyadic Oracle Risk Oracle Inequality: bound the oracle excessive risk Add the assumption that underlying partition is dyadic: Tree Partitioning Consistency(might obtain finer partition)

Real Climate Data Analysis [Lozano et al.,09, IBM] CO2 UV CH4 DIR CO ETRN H2 ETR WET GLO CLD TMX VAP TMP TMN PRE FRS DTR Data Description 125 locations of U.S. 1990 ~ 2002 (13 years) Monthly observation (18 variables/factors)

Real Climate Data Analysis glasso Observations: (1): For graphical lasso, no edge connects greenhouse gases (CO2, CH4, CO, H2) with solar radiation factors (GLO, DIR) which contradicts IPCC report; Co-CART, there is. (2): Graphs along the coasts are more sparse than the ones in the mainland.

Proposed Research [Chow and Liu, 68, Tan et al., 09, Liu et al., 11] • Limitations of Go-CART (1) Conditional Gaussian Assumption: (2) Only for continuous Y. For discrete Y : approximation likelihood • Forest Graphical Model • Density only involves univariate and bivariate marginals • Compute mutual information for each pair of variables • Greedily learn the tree structure via Chow-Liu algorithm • Handle both continuous and discrete data • Forest-Valued Regression

Nonparametric Regression [Hastie et al., 90] [Ravikumaret al., 09] Bottleneck: Computation Parametric Models Additive Models Sparse Additive Models Generalized Nonparametric Models: model interaction between variables

My Work and Proposed Research [Tropp et al., 06] • Greedy Learning Method • Additive Forward Regression (AFR) • Generalization of Orthogonal Matching Pursuit to Non-parametric setting • Generalized Forward Regression (GFR) • Penalized Regression Tree Method • Proposed Research: • Formulate the functional forms for structured penalties • Develop efficient algorithms for solving the corresponding nonparametric structured sparse regression

Summary and Timeline

Acknowledgements Feedback: Xi Chen (xichen@cs.cmu.edu) My Committee Members Jaime Carbonell (advisor), Tom Mitchell, Larry Wasserman, Robert Tibshirani Acknowledgements: Eric P. Xing, John Lafferty, Seyoung Kim, Manuel Blum, Aarti Singh, Jeff Schneider, Javier Pena, Han Liu, Qihang Lin, Junming Yin, Xiong Liang, Tzu-Kuo Huang, Min Xu, MladenKolar, Yan Liu, Jingrui He, Yanjun Qi, Bing Bai IBM Fellowship

Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications

Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications

Presentation Transcript

PROPOSAL AND AWARD MANAGEMENT

Particle Swarm Optimization (PSO)

Chapter 7: Counting Principles

Chapter 7

DAAD PHD PROPOSAL WRITING WORKSHOP 28TH APRIL 2011 ICIPE, NAIROBI

Increasing the Scalability of Dynamic Web Applications

Applications of Discrete Structures

Learning the Structure of Task-Oriented Conversations from the Corpus of In-Domain Dialogs

Advances in Optimization and its Applications in Process Industries

THESIS WORKSHOP Sponsored by the CSUF Department of Graduate Studies

CMS/TOTEM Upgrade proposal

Multi-objective Optimization of Earth Observing Satellite Missions

Robust Optimization and Applications

Efficient Support for All Levels of Parallelism for Complex Media Applications

Lecture 4: CNN: Optimization Algorithms

Buchwald-Hartwig Coupling : Discovery, Optimization, and Applications

CCF 贝叶斯网络在中国的应用和发展学术沙龙

Vision-Based Retrieval of Dynamic Hand Gestures

Socially Guided Machine Learning

2. C o nstrained Optimization

A Brief History Of