Study of Sparse Classifier Design Algorithms

Study of Sparse Classifier Design Algorithms Sachin Nagargoje, 08449 Advisor : Prof. Shirish Shevade 20th June 2013

Outline • Introduction • Sparsityw.r.t. features • Using regularizer/penalty • Traditional regularizer/penalty • Other regularizer/penalty • SparseNet • Sparsityw.r.t. support vectors / basis points • Various Techniques • SVM with L1regularizer • Greedy Methods • Proposed Methods • Experimental Results • Conclusion / Future Work

Introduction

What is Sparsity? • Sparsity w.r.t. features in model • eg: #Non - zero coefficients of model • Sparsity w.r.t. Support Vectors Support Vectors, x1, …, xd Sparser w.r.t #training points But not w.r.t. #features Vapnik 1992, Vapnik, et al 1995

Need for Sparsity? • Faster prediction • Decreases complexity of model • In the case of sparsityw.r.t. features • To Remove • Redundant features • Irrelevant features • Noisy features • As number of features increases • Data becomes sparse in High Dimension • Difficult to achieve low generalization error

Traditional ways to achieve Sparsity • Filter • Select features before ML Algorithm is run • E.g. Rank features and eliminate • Wrapper • Find best subset of features using ML techniques • E.g. Forward Selection, Random Selection • Embedded • Feature selection as part of ML Algorithm • E.g. L1 regularized linear regression

Sparsityw.r.t. features

Using Regularizer/Penalty • Data, x= [x1, x2, … ,xn], Labels, y= [y1, y2 … ,yn]T, Model, w = [w1, w2 … ,wp] • A type of Embedded approach • Eg: In the case of linear least square regression • R represents regularizer, eg: L0 or L1 Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267, 1994.

Traditional regularizers • L0 Penalty • L1 Penalty L0 Norm • Not Continuous • Non-convex • Not differentiable at 0 L1 Norm

Traditional regularizers contd.. • Example: • Let us take Rainfall prediction problem • Assuming, both model has same training error • Model 1 • L0 Penalty = 1+ 1+ 1+ 1+ 1 = 5 • L1 Penalty = |3| +|-5| +|8|+|-4|+|1| = 21 • Model 2 • L0 Penalty = 1+ 0+ 1+ 1+ 0 = 3 • L1 Penalty = |-20| +|0| +|7|+|18|+|0| = 45 • Since L1 shrinks and selects - it often selects dense model L0 Norm chooses L1 Norm chooses

Other regularizer MC +

MC+ Closer to L1 Norm Closer to L0 Norm

SparseNet • Uses Coordinate Descent with Non-convex Penalty • Lets consider least square problem for single feature data matrix: • It has a closed form solution as: • Our goal is to minimize: Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009

SparseNet (cont.) • Let us define a soft threshold operator as below: • There are three cases here : w>0, w<0, w=0 • Convert multiple feature function into single feature function • Apply Coordinate Descent Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009

SparseNet (cont.) Constant Residue • Now let us extend our problem to solve data matrix with multiple features • Therefore soft threshold operator function becomes - Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009 - Jerome Friedman, Trevor Hastie, Holger H¨ofling, and Robert Tibshirani. Pathwise coordinate optimization. Technical report, Annals of Applied Statistics, 2007.

SparseNet (Algorithm)

SparseNet with L1 Penalty • Using L1 Penalty Choose this model Slice Localization Dataset, A. Frank and A. Asuncion. UCI machine learning repository, 2010.

SparseNet with MC+ Penalty • Using MC+ Penalty Slice Localization Dataset, A. Frank and A. Asuncion. UCI machine learning repository, 2010.

Sparsityw.r.t. Support vectors/Basis points

Sparsityw.r.t. Support vectors • Kernel based learning algorithms • f(x) is linear combination of terms of form

Various techniques • Support Vector Machine (SVM) • SVM with L1 penalty • Greedy methods (wrapper): • Kernel Matching Pursuit (KMP) • Building SVM with sparser complexity (Keerthi et al) • Proposed method: • Preprocessing the training points using filtering and then applying wrapper methods

SVM with L1 regularizer • Settings: Data • SVM optimization: • SVM with L1 Penalty: • Solved using Linear Programming • Settings used: • Lambda: [1/100 1/10 1 10 100 ], Sigma: [1/16 ¼ 1 4 16]

SVM with L1 regularizer Decision Boundaries and Support Vectors RBF Kernel on dummy data Poly & RBF Kernel on Banana data

SVM with L1 regularizer Datasets Datasets Our formulation gave better sparser results than SVM

Greedy methods

Kernel Matching Pursuit • Inspired from signal processing community • Decomposes any signal into a linear expansion of waveforms selected from dictionary of functions • Set of basis points are constructed in greedy fashion • Removes the requirement of positive definiteness of Kernel matrix • Allow us to directly control the sparsity (in terms of number of support vectors) Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, Sep 2002.

Kernel Matching Pursuit • Setup: • D, finite dictionary of functions, • , l= # training points • n = # support vectors chosen so far • At (n+1)th step & are to be chosen s.t. : • Predictor: • where = indexes of SVs Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, Sep 2002.

Basis points versus Support Vectors Basis Points / Support Vectors • - Dataset: http://mldata.org/repository/data/viewslug/banana-ida/ • S. Sathiya Keerthi, et al. Building support vector machines with reduced classier complexity. JMLR, 2006. • - Vladimir Vapnik, Steven E. Golowich, and Alex J. Smola. Support vector method for function approximation, regression estimation and signal processing. NIPS, 1996

Proposed methods

Proposed methods • Two step process: • Step 1: Choosing subset of training set: • Modified BIRCH Clustering Algorithm • K-means Clustering • GMM Clustering • Step 2: Apply Greedy Algorithm • Kernel Matching Pursuit (KMP) • Building SVM with sparser complexity (Keerthi et al) Modified BIRCH KMP / Keerthi et al K-means Basis Points Training Points Model GMM Clustering - S. Sathiya Keerthi, Olivier Chapelle, and Dennis DeCoste. Building support vector machines with reduced classifier complexity. JOURNAL OF MACHINE LEARNING RESEARCH, 2006. - Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, Senior Member, and Senior Member. An efficient k-means clustering algorithm: Analysis and implementation.2002

BIRCH basics • Balanced Iterative Reducing and Clustering using Hierarchies • Uses one-scan over dataset, therefore suits large dataset • Each CF vector of cluster is defined as (N,LS,SS), N=data points, LS=Linear Sum, SS=Squared Sum • Merging of two clusters: • CF1 + CF2 = (Nl + N2, LSl + LS2, ,SS1 + SS2) • CF Tree • Height balanced tree • Two factors: • B (branching factor): Each non-leaf node contains at most B entries [CFi, childi], i=1..B, CFi is sub-cluster represented by childi. A leaf node contains at most L entries [CFi], i=1..L • T (threshold): radius/diameter of cluster

BIRCH Example Insertion into CF Tree B=3 L=3 New subcluster sc8 sc3 sc4 sc7 sc1 sc5 sc6 LN3 sc2 LN2 Root LN1 LN2 LN3 LN1 - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt‎ - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc5 sc3 sc6 sc7 sc1 sc4 sc2

BIRCH Example Here, Branch factor of leaf node exceeds 3, so LN1 is split New subcluster sc8 LN1’ sc3 sc4 sc7 sc1 LN1” sc5 sc6 Root LN2 LN3 sc2 LN2 LN3 LN1” LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt‎ - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc4 sc5 sc3 sc6 sc7 sc1 sc2

BIRCH Example Here, Branch factor of non-leaf node exceeds 3, so root is split and height of CF Tree increases by one NLN2 New subcluster NLN1 sc8 LN1’ sc3 sc4 sc7 sc1 LN1” Root sc5 sc6 NLN1 NLN2 LN2 LN3 sc2 LN2 LN3 LN1 LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt‎ - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc2 sc4 sc5 sc3 sc6 sc7 sc1

BIRCH Example NLN2 NLN1 New Point sc8 LN1’ sc3 sc4 sc7 sc1 LN1” Root sc5 sc6 NLN1 NLN2 LN2 LN3 sc2 LN2 LN3 LN1 LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt‎ - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc2 sc4 sc5 sc3 sc6 sc7 sc1

BIRCH Example Here, alien point falls inside leaf-node. Break it into parts. Branch factor of leaf node exceeds 3, so LN3 should split .. NLN2 New subcluster NLN1 sc7 sc9 sc8 LN1’ sc3 sc4 sc1 LN1” Root sc5 sc6 NLN1 NLN2 sc8 LN2 LN3 sc2 LN2 LN3 LN1 LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt‎ - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc2 sc4 sc5 sc3 sc6 sc7 sc1

Clusters using modified BIRCH Centroids

Experiments

Datasets Used

Modified BIRCH with KMP Our formulation gave descent results (red color)

Multi - class Modified BIRCH with KMP All multi-class datasets gave better results

Modified BIRCH with Keerthi et al’s method

K means and GMM with KMP Gave sparse model with less testset accuracy (except in blue color)

Conclusion • Studied various sparse classifier design algorithms • Better results obtained using SVM with L1 Penalty. • Modified BIRCH with KMP: • gave descent result on binary datasets • gave good results on multi-class datasets • saved kernel calculations (and time) by almost ~1/5th of actual time • Clustering is an easy way (time consuming) to choose basis points but not much effective. • Future work: • Explore greedy embedded sparse multi-class classification with different loss functions, e.g. Logistic Loss • Explore such techniques for Semi-supervised learning

Thank You.

Study of Sparse Classifier Design Algorithms

Study of Sparse Classifier Design Algorithms

Presentation Transcript

CS 420 – Design of Algorithms

Design and Analysis of Algorithms

Design and Analysis of Algorithms

Study of Bayesian network classifier

Design of Sparse Filters for Channel Shortening

Design and Analysis of Algorithms

Sparse Matrix Algorithms

Design and Analysis of Algorithms

Reusable Design of Evolutionary Algorithms.

Design and Analysis of Algorithms

Design of parallel algorithms

CS 420 - Design of Algorithms

CS 219 : Sparse Matrix Algorithms

Design of parallel algorithms

Design of parallel algorithms

Design and Analysis of Algorithms

Design and Analysis of Algorithms

Comparative Study of Security Algorithms

Study of Sparse Online Gaussian Process for Regression

Discovery and Design of Algorithms

Design and Analysis of Algorithms