Spectral Feature Selection for Mining Ultrahigh Dimensional Data

Spectral Feature Selection for Mining Ultrahigh Dimensional Data Zheng (Alan) Zhao Computer Science and Engineering Arizona State University

Massive Data and High Dimensionality • Dimensionality of data has increased exponentially log

Pervasive Phenomenon • Data containing millions of features is not uncommon: • Text mining: large text corpus • Image processing: 3D MRI data • Genetic analysis: RNA-seq data

Challenges and Opportunities • High dimensionality of data poses a serious challenge to statistical learning • Model overfitting • Computational inefficiency • Low interpretability • Dimensionality reduction techniques can be applied to address the problem • Feature selection • Feature extraction

Feature Selection vs. Extraction • Feature selection: dimension reduction by removing irrelevant features • Feature extraction: dimension reduction by combining original features with a weight matrix Original Features Relevant feature features instances New Features features W instances D

Research Contributions • Feature selection • Feature Interaction (IJCAI 2007) • Spectral feature selection (ICML 2007) • Minimal redundancy feature selection (AAAI 2010) • Semi-supervised feature selection (SDM 2007) • Multi-source feature selection (KDD 2008a, JMLR 2008, DDDM 2009,SDM 2010, BICoB 2010) • Feature extraction • Probabilistic kernel discriminant analysis (IJCAI 2009, TEC 2010) • Unsupervised discriminant analysis (CVPR 2007, NIPS 2007, KDD 2008b) best paper award

A Motivating Example • A good feature should not randomly assign values to the samples In feature selection, we want to select features that assign similar values to the samples that are of the same affiliation. 7

Modeling Sample Affiliations • Sample similarity provides a unified way to model class (supervised) and cluster (unsupervised) affiliations In feature selection, we want to select features that assign similar values to the samples that are similar to each other.

The Spectrum of The Similarity Matrix • The eigenvectors of the similarity matrix carry the distribution information of the data

Spectral Feature Selection • Measuring features’ consistency by comparing features to the Eigenvectors • Assuming features are normalized, we can measure the closeness of a feature to an Eigenvector by their inner product: • By considering all Eigenvectors together, we obtain the feature evaluation criterion for spectral feature selection as below: (1)

Advantages of Spectral Feature Selection • Intuitive idea with solid theoretical foundation • Based on spectral graph theory • Related to the research of spectral clustering, numerical linear algebra, and regression • Works well practically • Simple to implement • Very efficient, can handle ultrahigh dimensional data with millions of features • Selects relevant features, which result in high learning performance

Generality of Spectral Feature Selection • Unifies supervised and unsupervised feature selection • Includes many existing popular feature selection algorithms as its special cases • Laplacian Score, Fisher Score, ReliefF, Trace Ratio, and HSIC • and are the normalized feature vector and normalized sample similarity matrix

Connection to Existing Algorithms

Handling Feature Redundancy • To handle redundant features in feature selection, features must be evaluated jointly • Let , instead of evaluating features one by one: , we want to find a set of l features, such that their linear combination is close to

A Multi-output Regression Formulation • When top kEigenpairs are considered, their joint optimization can be formulated as: • Given A, WA can be obtained by simply solving the regression problem • However, to find the optimal A is NP-hard

Sparse Multi-output Regression • To address the joint optimization of A and WA, we applied sparse multi-output regression: • Here widenotes the i-th row of W, and is the L2,1-norm defined in the following way: Only l rows of W is nonzero (2) s.t.

L2,1-Norm Constraint (Example) • Removing features by setting the rows of W to be 0 • L2,1-norm constraint enforces to removeredundant features for feature selection Regression on 200 features, with 2-1 Norm constraint on the weight matrix W X

Computation • Given c, the problem can be solved efficiently by Nesterov’s method for constrained smooth convex optimization problem • However to search c (grid search or binary search) can be very expensive s.t. (2)

An Efficient Path-following Solver • Based on exploiting the necessary and sufficient conditions of the optimal solution Given Wk-1 , the optimal solution selecting k-1 features, a feasible solution, W’k , which selects k features and satisfies the necessary condition of an optimal solution can be obtained in closed form W’k can be efficiently adjusted to Wk , which is the optimal solution that satisfies the necessary and sufficient condition W1 1 feature optimal Wk-1 k-1 feature optimal Necessary condition W’k k feature k feature optimal Wk

Empirical Study • The proposed approach is named MRSF • We evaluate MRSF on six benchmark datasets • Supervised learning context • Six baseline algorithms: ReliefF, Fisher score, Trace Ratio Criterion, HSIC, mRMR and AROM-SVM • Performance measures: Accuracy, Redundancy Rate • Unsupervised learning context • Four baseline algorithms: Laplacian Score, SPEC, Trace Ratio Criterion, and HSIC. • Performance measures: Jaccard Index, Redundancy Rate

Accuracy vs. Different Num of Features, SVM (Supervised)

Redundancy Rate (Supervised) Redundancy rate (the smaller, the better)

Jaccard Index & Redundancy Rate (Unsupervised) Jaccard Index (the bigger, the better) Redundancy Rate (the smaller, the better)

Experimental Evaluation • Efficiency, running time (sec.)

Semi-supervised Feature Selection • Semi-supervised Feature Selection uses the large amount of unlabeled data and the small amount of labeled data together to improve feature selection performance Feature f’ Feature f

Formulating The Idea via SPEC Framework • This idea can be formulated by trading-off a feature’s consistency on the labeled data with its consistency on the unlabeled data: • Can effectively improve the performance of unsupervised feature selection • One of the first semi-supervised feature selection algorithms can be found in literature

Multi-source Feature Selection • Multi-source feature selection integrates infor-mation from multiple knowledge sources for reliable relevance estimation Knowledge of Features (1) Multi-source Feature Selection features Knowledge of Features (p) instances Knowl-edge of Sample (1) Knowl-edge of Sample (q) target data

Multi-source Feature Selection knowledge collection knowledge conversion knowledge integration feature selection

Application in Gene Selection • Cancer biomarker detection with cDNA Microarray • Small Sample Problem • usually contains >20,000 genes with only <100 sample • Existing statistical measures become unreliable • Many irrelevant genes seem to be relevant on the small set of samples due to sheer randomness • We propose to utilize various types of knowledge to improve the reliability of relevance estimation

Various Types of Knowledge • MicroRNA expression profile • Gene sequence • Entrez, EMBL-EBI, GenBank • Gene function annotation • Gene Ontology (GO), Cancer Gene Census • Genetic Interaction • KEGG, iHOP, BioCarta, Protein-Protein interaction

Categories of Different Types of Knowledge • Knowledge in different categories can be effectively used for calculating sample similarity: • Gene functions + gene expression -> Sample similarity • Gene similarity + gene expression -> Sample similarity • Gene interaction + gene expression -> Sample similarity

Example: Knowledge Conversion • Extracting sample similarity form knowledge of genes Kernel Embedding Knowledge Conversion

Empirical Study • Human Cancer Data • 33 tumor tissue samples of 4 types of cancers Mesothelioma, Uterus, Colon and Pancreas • Five types of knowledge are used • Evaluation criteria • Accuracy & Hit Ratio Number of known disease related genes in the selected genes. The evidence of biological relevance

Results Statistics Relevance Evidence Biological Relevance Evidence

Conclusion Spectral feature selection • A general framework unifying supervised and unsupervised feature selection • Includes many popular existing feature selection algorithms as it special cases • Has solid theoretical foundation • Can be extended to solve many challenging problems • Minimal redundancy feature selection • Semi-supervised feature selection • Multi-source feature selection

Future Work Knowledge Oriented Sparse Learning • An extension of multisource feature selection • Sparse learning • Doing feature selection and model fitting simultaneously • Provides superior learning performance • Knowledge oriented sparse learning • Utilizing multiple types of knowledge to guide the inference of sparse learning model • Higher accuracy, robust performance • Better interpretability of the learning model

Acknowledgments • Dr. Liu, Dr. Ye, Dr. Rao and Dr. Xue, CIDSE • All Members in Data Mining and Machine Learning (DMML) Group • Dr. Jiangxin Wang and Dr. Yung Chang, the Biodesign institute of ASU • Dr. Lei Wang, The Australian National University • Dr. Kari Torkkola, Amazon.com • National Science Foundation and Graduate Research Support Program for sponsorships

Thank you

Spectral Feature Selection for Mining Ultrahigh Dimensional Data

Spectral Feature Selection for Mining Ultrahigh Dimensional Data

Presentation Transcript

Feature selection

Feature Selection

Spectral Feature Selection for Handling Very Large Scale Problems

Feature selection

Feature Selection

Data Mining Feature Selection

Feature Selection

Unsupervised Feature Selection for Multi-Cluster Data

Feature Selection

Feature selection

Feature Selection

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Feature Selection

Feature Selection, Feature Extraction

Feature Selection

Feature selection

Feature Selection

Feature Selection

Feature selection