380 likes | 690 Views
Spectral Feature Selection for Mining Ultrahigh Dimensional Data. Zheng (Alan) Zhao Computer Science and Engineering Arizona State University. Massive Data and High Dimensionality. Dimensionality of data has increased exponentially. log. Pervasive Phenomenon.
E N D
Spectral Feature Selection for Mining Ultrahigh Dimensional Data Zheng (Alan) Zhao Computer Science and Engineering Arizona State University
Massive Data and High Dimensionality • Dimensionality of data has increased exponentially log
Pervasive Phenomenon • Data containing millions of features is not uncommon: • Text mining: large text corpus • Image processing: 3D MRI data • Genetic analysis: RNA-seq data
Challenges and Opportunities • High dimensionality of data poses a serious challenge to statistical learning • Model overfitting • Computational inefficiency • Low interpretability • Dimensionality reduction techniques can be applied to address the problem • Feature selection • Feature extraction
Feature Selection vs. Extraction • Feature selection: dimension reduction by removing irrelevant features • Feature extraction: dimension reduction by combining original features with a weight matrix Original Features Relevant feature features instances New Features features W instances D
Research Contributions • Feature selection • Feature Interaction (IJCAI 2007) • Spectral feature selection (ICML 2007) • Minimal redundancy feature selection (AAAI 2010) • Semi-supervised feature selection (SDM 2007) • Multi-source feature selection (KDD 2008a, JMLR 2008, DDDM 2009,SDM 2010, BICoB 2010) • Feature extraction • Probabilistic kernel discriminant analysis (IJCAI 2009, TEC 2010) • Unsupervised discriminant analysis (CVPR 2007, NIPS 2007, KDD 2008b) best paper award
A Motivating Example • A good feature should not randomly assign values to the samples In feature selection, we want to select features that assign similar values to the samples that are of the same affiliation. 7
Modeling Sample Affiliations • Sample similarity provides a unified way to model class (supervised) and cluster (unsupervised) affiliations In feature selection, we want to select features that assign similar values to the samples that are similar to each other.
The Spectrum of The Similarity Matrix • The eigenvectors of the similarity matrix carry the distribution information of the data
Spectral Feature Selection • Measuring features’ consistency by comparing features to the Eigenvectors • Assuming features are normalized, we can measure the closeness of a feature to an Eigenvector by their inner product: • By considering all Eigenvectors together, we obtain the feature evaluation criterion for spectral feature selection as below: (1)
Advantages of Spectral Feature Selection • Intuitive idea with solid theoretical foundation • Based on spectral graph theory • Related to the research of spectral clustering, numerical linear algebra, and regression • Works well practically • Simple to implement • Very efficient, can handle ultrahigh dimensional data with millions of features • Selects relevant features, which result in high learning performance
Generality of Spectral Feature Selection • Unifies supervised and unsupervised feature selection • Includes many existing popular feature selection algorithms as its special cases • Laplacian Score, Fisher Score, ReliefF, Trace Ratio, and HSIC • and are the normalized feature vector and normalized sample similarity matrix
Handling Feature Redundancy • To handle redundant features in feature selection, features must be evaluated jointly • Let , instead of evaluating features one by one: , we want to find a set of l features, such that their linear combination is close to
A Multi-output Regression Formulation • When top kEigenpairs are considered, their joint optimization can be formulated as: • Given A, WA can be obtained by simply solving the regression problem • However, to find the optimal A is NP-hard
Sparse Multi-output Regression • To address the joint optimization of A and WA, we applied sparse multi-output regression: • Here widenotes the i-th row of W, and is the L2,1-norm defined in the following way: Only l rows of W is nonzero (2) s.t.
L2,1-Norm Constraint (Example) • Removing features by setting the rows of W to be 0 • L2,1-norm constraint enforces to removeredundant features for feature selection Regression on 200 features, with 2-1 Norm constraint on the weight matrix W X
Computation • Given c, the problem can be solved efficiently by Nesterov’s method for constrained smooth convex optimization problem • However to search c (grid search or binary search) can be very expensive s.t. (2)
An Efficient Path-following Solver • Based on exploiting the necessary and sufficient conditions of the optimal solution Given Wk-1 , the optimal solution selecting k-1 features, a feasible solution, W’k , which selects k features and satisfies the necessary condition of an optimal solution can be obtained in closed form W’k can be efficiently adjusted to Wk , which is the optimal solution that satisfies the necessary and sufficient condition W1 1 feature optimal Wk-1 k-1 feature optimal Necessary condition W’k k feature k feature optimal Wk
Empirical Study • The proposed approach is named MRSF • We evaluate MRSF on six benchmark datasets • Supervised learning context • Six baseline algorithms: ReliefF, Fisher score, Trace Ratio Criterion, HSIC, mRMR and AROM-SVM • Performance measures: Accuracy, Redundancy Rate • Unsupervised learning context • Four baseline algorithms: Laplacian Score, SPEC, Trace Ratio Criterion, and HSIC. • Performance measures: Jaccard Index, Redundancy Rate
Redundancy Rate (Supervised) Redundancy rate (the smaller, the better)
Jaccard Index & Redundancy Rate (Unsupervised) Jaccard Index (the bigger, the better) Redundancy Rate (the smaller, the better)
Experimental Evaluation • Efficiency, running time (sec.)
Semi-supervised Feature Selection • Semi-supervised Feature Selection uses the large amount of unlabeled data and the small amount of labeled data together to improve feature selection performance Feature f’ Feature f
Formulating The Idea via SPEC Framework • This idea can be formulated by trading-off a feature’s consistency on the labeled data with its consistency on the unlabeled data: • Can effectively improve the performance of unsupervised feature selection • One of the first semi-supervised feature selection algorithms can be found in literature
Multi-source Feature Selection • Multi-source feature selection integrates infor-mation from multiple knowledge sources for reliable relevance estimation Knowledge of Features (1) Multi-source Feature Selection features Knowledge of Features (p) instances Knowl-edge of Sample (1) Knowl-edge of Sample (q) target data
Multi-source Feature Selection knowledge collection knowledge conversion knowledge integration feature selection
Application in Gene Selection • Cancer biomarker detection with cDNA Microarray • Small Sample Problem • usually contains >20,000 genes with only <100 sample • Existing statistical measures become unreliable • Many irrelevant genes seem to be relevant on the small set of samples due to sheer randomness • We propose to utilize various types of knowledge to improve the reliability of relevance estimation
Various Types of Knowledge • MicroRNA expression profile • Gene sequence • Entrez, EMBL-EBI, GenBank • Gene function annotation • Gene Ontology (GO), Cancer Gene Census • Genetic Interaction • KEGG, iHOP, BioCarta, Protein-Protein interaction
Categories of Different Types of Knowledge • Knowledge in different categories can be effectively used for calculating sample similarity: • Gene functions + gene expression -> Sample similarity • Gene similarity + gene expression -> Sample similarity • Gene interaction + gene expression -> Sample similarity
Example: Knowledge Conversion • Extracting sample similarity form knowledge of genes Kernel Embedding Knowledge Conversion
Empirical Study • Human Cancer Data • 33 tumor tissue samples of 4 types of cancers Mesothelioma, Uterus, Colon and Pancreas • Five types of knowledge are used • Evaluation criteria • Accuracy & Hit Ratio Number of known disease related genes in the selected genes. The evidence of biological relevance
Results Statistics Relevance Evidence Biological Relevance Evidence
Conclusion Spectral feature selection • A general framework unifying supervised and unsupervised feature selection • Includes many popular existing feature selection algorithms as it special cases • Has solid theoretical foundation • Can be extended to solve many challenging problems • Minimal redundancy feature selection • Semi-supervised feature selection • Multi-source feature selection
Future Work Knowledge Oriented Sparse Learning • An extension of multisource feature selection • Sparse learning • Doing feature selection and model fitting simultaneously • Provides superior learning performance • Knowledge oriented sparse learning • Utilizing multiple types of knowledge to guide the inference of sparse learning model • Higher accuracy, robust performance • Better interpretability of the learning model
Acknowledgments • Dr. Liu, Dr. Ye, Dr. Rao and Dr. Xue, CIDSE • All Members in Data Mining and Machine Learning (DMML) Group • Dr. Jiangxin Wang and Dr. Yung Chang, the Biodesign institute of ASU • Dr. Lei Wang, The Australian National University • Dr. Kari Torkkola, Amazon.com • National Science Foundation and Graduate Research Support Program for sponsorships