Feature Selection and Bioinformatics Applications Isabelle Guyon

Feature Selection and Bioinformatics Applications Isabelle Guyon

Part I INTRODUCTION

Objectives Output y Predictor f(x) Input x • Reduce the number of features as much as possible without significantly degrading prediction performance. • Possibly improve prediction performances. • Gain insight.

Applications training examples High Energy Physics Market Analysis OCR HWR 105 Machine Vision 104 Text Categorization 103 Genomics System diagnosis 102 Bioinformatics 10 Proteomics inputs 10 102 103 104 105

This talk: • Simple is beautiful but some (moderate) sophistication is needed. • “Classical statistics” is pessimistic: it advocates the simplest methods to overcome the curse of dimensionality. • Modern statistical methods from soft-computing and machine learning provide necessary additional sophistication and still defeat the curse of dimensionality.

Part II PROBLEM STATEMENT

Correlation Analysis {yk}, k=1…num_patients {xik}, k=1…num_patients m- m+ Top 25 positively correlated features (genes) Top 25 negatively correlated features (genes) s- s+ 38 training ex. (27 ALL, 11 AML); 34 test ex. (20 ALL, 14 AML). Golub et al, Science Vol 286:15 Oct. 1999 {- yk}

Yes, but ... s- s+ m- m+ m- m+ s- s+

I.I.D. Features 6 4 2 0 -2 -4 5 0 -5 -4 -2 0 2 4 6 -5 0 5

I.I.D. Features 5 0 -5 6 4 2 0 -2 -4 -6 -5 0 5 -6 -4 -2 0 2 4 6 m- m+

Smaller Win 4 2 0 -2 -4 -6 4 2 0 -2 -4 -6 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4

Bigger Win 6 4 2 0 -2 -4 4 2 0 -2 -4 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4

Example from Real Data

Explanation: F1: The peak of interest F2: The best local estimate of the baseline.

Two “Useless” Features 1.5 1 0.5 0 -0.5 1.5 1 0.5 0 -0.5 -0.5 0 0.5 1 1.5 -0.5 0 0.5 1 1.5 Axis projections do not help finding good features.

Higher dimension problem Even two-d projections may not help finding good features.

Part IV ALGORITHMS

Main Goal Output Output Predictor f(x) - Eliminate useless features (distracters). - Rank useful features. - Eliminate redundant features. - Rank subsets of useful features. Sub-goals: Main goal:

Filters and Wrappers Feature subset • Main goal: rank subsets of useful features. • Danger of overfitting: Greedy search often works better. All features Filter Predictor Multiple Feature subsets All features Predictor Wrapper

Nested Subset Methods Nested subset methods perform a greedy search: - At each step add or remove a single feature to best improve (or least degrade) the cost function. - Backward elimination: Start with all features, progressively remove (never add). Example: RFE (Guyon, Weston, et al, 2002.) - Forward selection: Start with an empty set, progressively add (never remove). Example: Gram-Schmidt orthogonalization (Stoppiglia et al, 2003, Rivals-Personnaz, 2003.)

Backward elimination: RFE Improve (or least degrade) cost function J(t): • Exact or approximate difference calculation DJ=J(feat+1)-J(feat). • RFE with linear predictor f(x)=w.x+b: eliminate the feature with smallest wi2(Guyon, Weston, et al, 2002.) • Zero norm/multiplicative updates (MU): rescale the input with |wi| at each iteration(Weston, Elisseeff et al. 2003.) • Non-linear RFE and non-linear MU: estimate (DJ)i ~ aH(i)a.

Forward selection: Gram-Schmidt Feature ranking in the context of others • Vanilla (linear) GS: At every iteration, project onto null space of features already selected; select feature most correlated with target. • Relief (Kira and Rendell, 1992): • GS-Relief combination (Guyon, 2003).

Part IV EXPERIMENTS

Mass Spectrometry Experiments In collaboration with Biospect Inc., 2003 Data from Cancer Research, Adam, et al, 2002 TOF - EVMS prostate cancer data: 326 samples (167 cancer, 159 control). - Preprocessing including m/z 200-10000, baseline removal. - Split in 3 equal parts and make 3 experiments 2/3 train 1/3 test. - Fourty-four methods tried.

Method Comparison: 100 Features ... Non-linear multivariate > Linear multivariate > Linear univariate

Method Comparison: 7 Features ... Non-linear multivariate > Linear multivariate > Linear univariate

Part V CONCLUSION

Experimental Results In spite of the risk of overfitting ... • Subset selection methods can outperform single feature ranking by correlation with the target. • Non-linear feature selection can outperform linear feature selection. | > > … in prediction performance and number of features.

Which method works best? See the results of the NIPS 2003 competition. Presentation on December 19th. See also: JMLR special issue: www.jmlr.org/papers/special/feature.html I. Guyon and A. Elisseeff editors, March 2003. Workshop website: www.clopinet.com/isabelle/Projects/NIPS2003 Acknowledgements: Masoud Nikravesh

Feature Selection and Bioinformatics Applications Isabelle Guyon

Feature Selection and Bioinformatics Applications Isabelle Guyon

Presentation Transcript

Bioinformatics Applications

RESULTS OF THE NIPS 2006 MODEL SELECTION GAME Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley, Olivier Guyon,

NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies

Feature selection

Feature Selection

Feature selection

Feature Selection

Feature Selection and Extraction

Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee

Feature Selection

Feature Selection

FEATURE SELECTION = GENE SELECTION

Feature selection

Feature Selection

Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Feature Selection

Feature Selection, Feature Extraction

Feature Selection

Feature selection

Feature Selection

Feature Selection

Feature selection