1 / 20

Spectral Feature Selection for Handling Very Large Scale Problems

Spectral Feature Selection for Handling Very Large Scale Problems. Zheng ( Alan ) Zhao. SAS Institute. Motivation. Petabyte datasets are rapidly becoming normal in data mining applications Google: processing 20 PB data per day (2008)

ikia
Download Presentation

Spectral Feature Selection for Handling Very Large Scale Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spectral Feature Selection for Handling Very Large Scale Problems Zheng (Alan) Zhao SAS Institute

  2. Motivation • Petabyte datasets are rapidly becoming normal in data mining applications • Google: processing 20 PB data per day (2008) • eBay's Green-plum data warehouse: 6.5 PB user data containing 170 trillion records and growing by 150 billion new records per day (2009) • The efficiency of existing feature selection algorithms significantly downgrades, if not totally inapplicable, when data size exceeds hundreds of gigabytes • Distributed computing techniques, such as MPI and MapReduce, can be applied to handle very large data. • Most existing feature selection algorithms are designed for centralized computing architecture.

  3. Large Scale Spectral Feature Selection • Spectral feature selection is a general framework for both supervised and unsupervised feature selection • Unifies many existing supervised and unsupervised feature selection algorithms • ReliefF, Fisher-score, Laplacian-score, Trace-ratio, etc… • Can be used to derive families of new algorithms • Can be extended to solve many novel problems • Semi-supervised feature selection • Multi-source feature selection • We study how to implement spectral feature selection in distributed computing environments, such as MapReduce and SAS Grid, to exploit the power of parallel processing techniques for tackling very large scale data in feature selection • We focus on MapReduce techniqe in this talk

  4. MapReduce • A technique for processing massive data on large scale computer cluster • A programming model • An execution framework • The idea of bringing code to the data • Hide system-level details from the developers • Two key components of Map-Reduce • Mapper • Reducer

  5. The Key Ideas • The training of many existing algorithms can be decomposed to the computations of a serials of • sufficient statistics • gradient steps • These summation forms can be grouped according to the location of the samples and done locally on each cluster node through the mappers. And the obtained local results can be aggregated via the reducer to obtain the final global results A summation over the data points + + Reducer + + + + + + + + + + + + + + Mapper Mapper Mapper

  6. Linear Regression • The objective: • Solution: • Decomposition: Each column of X is an sample

  7. Linear Regression (cont.) • Mapper & Reducer Reducer + + Mappers + + + + + +

  8. Spectral Feature Selection • The basic idea: a good feature should not randomly assign values to the samples A Motivating Example In feature selection, we want to select features that assign similar values to the samples that are similar to each other. 8

  9. The Spectrum of The Similarity Matrix • The eigenvectors of the similarity matrix carry the distribution information of the data

  10. Univariate vs. Multivariate Formulations • Measuring features’ consistency by comparing features to the Eigenvectors • Univariate formulation • Multivariate formulation

  11. Efficient Computation

  12. Efficient Computation (Cont.)

  13. MapReduce Adaptation

  14. Similarity Matrix

  15. Large Scale MRSF

  16. Large Scale MRSF (cont.)

  17. Large Scale Nesterov

  18. Thank You! Any Questions? Questions are guaranteed in life; Answers aren't.

More Related