1 / 45

Study of Sparse Classifier Design Algorithms

Study of Sparse Classifier Design Algorithms. Sachin Nagargoje, 08449 Advisor : Prof. Shirish Shevade 20 th June 2013. Outline. Introduction Sparsity w.r.t . features Using regularizer /penalty Traditional regularizer /penalty Other regularizer /penalty SparseNet

nura
Download Presentation

Study of Sparse Classifier Design Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Study of Sparse Classifier Design Algorithms Sachin Nagargoje, 08449 Advisor : Prof. Shirish Shevade 20th June 2013

  2. Outline • Introduction • Sparsityw.r.t. features • Using regularizer/penalty • Traditional regularizer/penalty • Other regularizer/penalty • SparseNet • Sparsityw.r.t. support vectors / basis points • Various Techniques • SVM with L1regularizer • Greedy Methods • Proposed Methods • Experimental Results • Conclusion / Future Work

  3. Introduction

  4. What is Sparsity? • Sparsity w.r.t. features in model • eg: #Non - zero coefficients of model • Sparsity w.r.t. Support Vectors Support Vectors, x1, …, xd Sparser w.r.t #training points But not w.r.t. #features Vapnik 1992, Vapnik, et al 1995

  5. Need for Sparsity? • Faster prediction • Decreases complexity of model • In the case of sparsityw.r.t. features • To Remove • Redundant features • Irrelevant features • Noisy features • As number of features increases • Data becomes sparse in High Dimension • Difficult to achieve low generalization error

  6. Traditional ways to achieve Sparsity • Filter • Select features before ML Algorithm is run • E.g. Rank features and eliminate • Wrapper • Find best subset of features using ML techniques • E.g. Forward Selection, Random Selection • Embedded • Feature selection as part of ML Algorithm • E.g. L1 regularized linear regression

  7. Sparsityw.r.t. features

  8. Using Regularizer/Penalty • Data, x= [x1, x2, … ,xn], Labels, y= [y1, y2 … ,yn]T, Model, w = [w1, w2 … ,wp] • A type of Embedded approach • Eg: In the case of linear least square regression • R represents regularizer, eg: L0 or L1 Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267, 1994.

  9. Traditional regularizers • L0 Penalty • L1 Penalty L0 Norm • Not Continuous • Non-convex • Not differentiable at 0 L1 Norm

  10. Traditional regularizers contd.. • Example: • Let us take Rainfall prediction problem • Assuming, both model has same training error • Model 1 • L0 Penalty = 1+ 1+ 1+ 1+ 1 = 5 • L1 Penalty = |3| +|-5| +|8|+|-4|+|1| = 21 • Model 2 • L0 Penalty = 1+ 0+ 1+ 1+ 0 = 3 • L1 Penalty = |-20| +|0| +|7|+|18|+|0| = 45 • Since L1 shrinks and selects - it often selects dense model L0 Norm chooses L1 Norm chooses

  11. Other regularizer MC +

  12. MC+ Closer to L1 Norm Closer to L0 Norm

  13. SparseNet • Uses Coordinate Descent with Non-convex Penalty • Lets consider least square problem for single feature data matrix: • It has a closed form solution as: • Our goal is to minimize: Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009

  14. SparseNet (cont.) • Let us define a soft threshold operator as below: • There are three cases here : w>0, w<0, w=0 • Convert multiple feature function into single feature function • Apply Coordinate Descent Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009

  15. SparseNet (cont.) Constant Residue • Now let us extend our problem to solve data matrix with multiple features • Therefore soft threshold operator function becomes - Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009 - Jerome Friedman, Trevor Hastie, Holger H¨ofling, and Robert Tibshirani. Pathwise coordinate optimization. Technical report, Annals of Applied Statistics, 2007.

  16. SparseNet (Algorithm)

  17. SparseNet with L1 Penalty • Using L1 Penalty Choose this model Slice Localization Dataset, A. Frank and A. Asuncion. UCI machine learning repository, 2010.

  18. SparseNet with MC+ Penalty • Using MC+ Penalty Slice Localization Dataset, A. Frank and A. Asuncion. UCI machine learning repository, 2010.

  19. Sparsityw.r.t. Support vectors/Basis points

  20. Sparsityw.r.t. Support vectors • Kernel based learning algorithms • f(x) is linear combination of terms of form

  21. Various techniques • Support Vector Machine (SVM) • SVM with L1 penalty • Greedy methods (wrapper): • Kernel Matching Pursuit (KMP) • Building SVM with sparser complexity (Keerthi et al) • Proposed method: • Preprocessing the training points using filtering and then applying wrapper methods

  22. SVM with L1 regularizer • Settings: Data • SVM optimization: • SVM with L1 Penalty: • Solved using Linear Programming • Settings used: • Lambda: [1/100 1/10 1 10 100 ], Sigma: [1/16 ¼ 1 4 16]

  23. SVM with L1 regularizer Decision Boundaries and Support Vectors RBF Kernel on dummy data Poly & RBF Kernel on Banana data

  24. SVM with L1 regularizer Datasets Datasets Our formulation gave better sparser results than SVM

  25. Greedy methods

  26. Kernel Matching Pursuit • Inspired from signal processing community • Decomposes any signal into a linear expansion of waveforms selected from dictionary of functions • Set of basis points are constructed in greedy fashion • Removes the requirement of positive definiteness of Kernel matrix • Allow us to directly control the sparsity (in terms of number of support vectors) Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, Sep 2002.

  27. Kernel Matching Pursuit • Setup: • D, finite dictionary of functions, • , l= # training points • n = # support vectors chosen so far • At (n+1)th step & are to be chosen s.t. : • Predictor: • where = indexes of SVs Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, Sep 2002.

  28. Basis points versus Support Vectors Basis Points / Support Vectors • - Dataset: http://mldata.org/repository/data/viewslug/banana-ida/ • S. Sathiya Keerthi, et al. Building support vector machines with reduced classier complexity. JMLR, 2006. • - Vladimir Vapnik, Steven E. Golowich, and Alex J. Smola. Support vector method for function approximation, regression estimation and signal processing. NIPS, 1996

  29. Proposed methods

  30. Proposed methods • Two step process: • Step 1: Choosing subset of training set: • Modified BIRCH Clustering Algorithm • K-means Clustering • GMM Clustering • Step 2: Apply Greedy Algorithm • Kernel Matching Pursuit (KMP) • Building SVM with sparser complexity (Keerthi et al) Modified BIRCH KMP / Keerthi et al K-means Basis Points Training Points Model GMM Clustering - S. Sathiya Keerthi, Olivier Chapelle, and Dennis DeCoste. Building support vector machines with reduced classifier complexity. JOURNAL OF MACHINE LEARNING RESEARCH, 2006. - Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, Senior Member, and Senior Member. An efficient k-means clustering algorithm: Analysis and implementation.2002

  31. BIRCH basics • Balanced Iterative Reducing and Clustering using Hierarchies • Uses one-scan over dataset, therefore suits large dataset • Each CF vector of cluster is defined as (N,LS,SS), N=data points, LS=Linear Sum, SS=Squared Sum • Merging of two clusters: • CF1 + CF2 = (Nl + N2, LSl + LS2, ,SS1 + SS2) • CF Tree • Height balanced tree • Two factors: • B (branching factor): Each non-leaf node contains at most B entries [CFi, childi], i=1..B, CFi is sub-cluster represented by childi. A leaf node contains at most L entries [CFi], i=1..L • T (threshold): radius/diameter of cluster

  32. BIRCH Example Insertion into CF Tree B=3 L=3 New subcluster sc8 sc3 sc4 sc7 sc1 sc5 sc6 LN3 sc2 LN2 Root LN1 LN2 LN3 LN1 - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt‎ - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc5 sc3 sc6 sc7 sc1 sc4 sc2

  33. BIRCH Example Here, Branch factor of leaf node exceeds 3, so LN1 is split New subcluster sc8 LN1’ sc3 sc4 sc7 sc1 LN1” sc5 sc6 Root LN2 LN3 sc2 LN2 LN3 LN1” LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt‎ - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc4 sc5 sc3 sc6 sc7 sc1 sc2

  34. BIRCH Example Here, Branch factor of non-leaf node exceeds 3, so root is split and height of CF Tree increases by one NLN2 New subcluster NLN1 sc8 LN1’ sc3 sc4 sc7 sc1 LN1” Root sc5 sc6 NLN1 NLN2 LN2 LN3 sc2 LN2 LN3 LN1 LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt‎ - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc2 sc4 sc5 sc3 sc6 sc7 sc1

  35. BIRCH Example NLN2 NLN1 New Point sc8 LN1’ sc3 sc4 sc7 sc1 LN1” Root sc5 sc6 NLN1 NLN2 LN2 LN3 sc2 LN2 LN3 LN1 LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt‎ - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc2 sc4 sc5 sc3 sc6 sc7 sc1

  36. BIRCH Example Here, alien point falls inside leaf-node. Break it into parts. Branch factor of leaf node exceeds 3, so LN3 should split .. NLN2 New subcluster NLN1 sc7 sc9 sc8 LN1’ sc3 sc4 sc1 LN1” Root sc5 sc6 NLN1 NLN2 sc8 LN2 LN3 sc2 LN2 LN3 LN1 LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt‎ - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc2 sc4 sc5 sc3 sc6 sc7 sc1

  37. Clusters using modified BIRCH Centroids

  38. Experiments

  39. Datasets Used

  40. Modified BIRCH with KMP Our formulation gave descent results (red color)

  41. Multi - class Modified BIRCH with KMP All multi-class datasets gave better results

  42. Modified BIRCH with Keerthi et al’s method

  43. K means and GMM with KMP Gave sparse model with less testset accuracy (except in blue color)

  44. Conclusion • Studied various sparse classifier design algorithms • Better results obtained using SVM with L1 Penalty. • Modified BIRCH with KMP: • gave descent result on binary datasets • gave good results on multi-class datasets • saved kernel calculations (and time) by almost ~1/5th of actual time • Clustering is an easy way (time consuming) to choose basis points but not much effective. • Future work: • Explore greedy embedded sparse multi-class classification with different loss functions, e.g. Logistic Loss • Explore such techniques for Semi-supervised learning

  45. Thank You.

More Related