1 / 29

Feature Selection and Bioinformatics Applications Isabelle Guyon

Feature Selection and Bioinformatics Applications Isabelle Guyon. Part I. INTRODUCTION. Objectives. Output y. Predictor f( x ). Input x. Reduce the number of features as much as possible without significantly degrading prediction performance.

dulcea
Download Presentation

Feature Selection and Bioinformatics Applications Isabelle Guyon

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Selection and Bioinformatics Applications Isabelle Guyon

  2. Part I INTRODUCTION

  3. Objectives Output y Predictor f(x) Input x • Reduce the number of features as much as possible without significantly degrading prediction performance. • Possibly improve prediction performances. • Gain insight.

  4. Applications training examples High Energy Physics Market Analysis OCR HWR 105 Machine Vision 104 Text Categorization 103 Genomics System diagnosis 102 Bioinformatics 10 Proteomics inputs 10 102 103 104 105

  5. This talk: • Simple is beautiful but some (moderate) sophistication is needed. • “Classical statistics” is pessimistic: it advocates the simplest methods to overcome the curse of dimensionality. • Modern statistical methods from soft-computing and machine learning provide necessary additional sophistication and still defeat the curse of dimensionality.

  6. Part II PROBLEM STATEMENT

  7. Correlation Analysis {yk}, k=1…num_patients {xik}, k=1…num_patients m- m+ Top 25 positively correlated features (genes) Top 25 negatively correlated features (genes) s- s+ 38 training ex. (27 ALL, 11 AML); 34 test ex. (20 ALL, 14 AML). Golub et al, Science Vol 286:15 Oct. 1999 {- yk}

  8. Yes, but ... s- s+ m- m+ m- m+ s- s+

  9. I.I.D. Features 6 4 2 0 -2 -4 5 0 -5 -4 -2 0 2 4 6 -5 0 5

  10. I.I.D. Features 5 0 -5 6 4 2 0 -2 -4 -6 -5 0 5 -6 -4 -2 0 2 4 6 m- m+

  11. Smaller Win 4 2 0 -2 -4 -6 4 2 0 -2 -4 -6 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4

  12. Bigger Win 6 4 2 0 -2 -4 4 2 0 -2 -4 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4

  13. Example from Real Data

  14. Explanation: F1: The peak of interest F2: The best local estimate of the baseline.

  15. Two “Useless” Features 1.5 1 0.5 0 -0.5 1.5 1 0.5 0 -0.5 -0.5 0 0.5 1 1.5 -0.5 0 0.5 1 1.5 Axis projections do not help finding good features.

  16. Higher dimension problem Even two-d projections may not help finding good features.

  17. Part IV ALGORITHMS

  18. Main Goal Output Output Predictor f(x) - Eliminate useless features (distracters). - Rank useful features. - Eliminate redundant features. - Rank subsets of useful features. Sub-goals: Main goal:

  19. Filters and Wrappers Feature subset • Main goal: rank subsets of useful features. • Danger of overfitting: Greedy search often works better. All features Filter Predictor Multiple Feature subsets All features Predictor Wrapper

  20. Nested Subset Methods Nested subset methods perform a greedy search: - At each step add or remove a single feature to best improve (or least degrade) the cost function. - Backward elimination: Start with all features, progressively remove (never add). Example: RFE (Guyon, Weston, et al, 2002.) - Forward selection: Start with an empty set, progressively add (never remove). Example: Gram-Schmidt orthogonalization (Stoppiglia et al, 2003, Rivals-Personnaz, 2003.)

  21. Backward elimination: RFE Improve (or least degrade) cost function J(t): • Exact or approximate difference calculation DJ=J(feat+1)-J(feat). • RFE with linear predictor f(x)=w.x+b: eliminate the feature with smallest wi2(Guyon, Weston, et al, 2002.) • Zero norm/multiplicative updates (MU): rescale the input with |wi| at each iteration(Weston, Elisseeff et al. 2003.) • Non-linear RFE and non-linear MU: estimate (DJ)i ~ aH(i)a.

  22. Forward selection: Gram-Schmidt Feature ranking in the context of others • Vanilla (linear) GS: At every iteration, project onto null space of features already selected; select feature most correlated with target. • Relief (Kira and Rendell, 1992): • GS-Relief combination (Guyon, 2003).

  23. Part IV EXPERIMENTS

  24. Mass Spectrometry Experiments In collaboration with Biospect Inc., 2003 Data from Cancer Research, Adam, et al, 2002 TOF - EVMS prostate cancer data: 326 samples (167 cancer, 159 control). - Preprocessing including m/z 200-10000, baseline removal. - Split in 3 equal parts and make 3 experiments 2/3 train 1/3 test. - Fourty-four methods tried.

  25. Method Comparison: 100 Features ... Non-linear multivariate > Linear multivariate > Linear univariate

  26. Method Comparison: 7 Features ... Non-linear multivariate > Linear multivariate > Linear univariate

  27. Part V CONCLUSION

  28. Experimental Results In spite of the risk of overfitting ... • Subset selection methods can outperform single feature ranking by correlation with the target. • Non-linear feature selection can outperform linear feature selection. | > > … in prediction performance and number of features.

  29. Which method works best? See the results of the NIPS 2003 competition. Presentation on December 19th. See also: JMLR special issue: www.jmlr.org/papers/special/feature.html I. Guyon and A. Elisseeff editors, March 2003. Workshop website: www.clopinet.com/isabelle/Projects/NIPS2003 Acknowledgements: Masoud Nikravesh

More Related