1 / 36

Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools. Elena Marchiori (elena@few.vu.nl). Why select features.

tobit
Download Presentation

Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Lecture 8 Feature SelectionBioinformatics Data Analysis and Tools Elena Marchiori (elena@few.vu.nl)

  2. Why select features • Select a subset of “relevant” input variables • Advantages: • it is cheaper to measure less variables • the resulting classifier is simpler and potentially faster • prediction accuracy may improve by discarding irrelevant variables • identifying relevant variables gives more insight into the nature of the corresponding classification problem (biomarker detection)

  3. Why select features? Top 100 feature selection Selection based on variance No feature selection Correlation plot Data: Leukemia, 3 class +1 -1

  4. Approaches • Wrapper • feature selection takes into account the contribution to the performance of a given type of classifier • Filter • feature selection is based on an evaluation criterion for quantifying how well feature (subsets) discriminate the two classes • Embedded • feature selection is part of the training procedure of a classifier (e.g. decision trees)

  5. Embedded methods • Attempt to jointly or simultaneously train both a classifier and a feature subset • Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features. • Intuitively appealing Example: tree-building algorithms Adapted from J. Fridlyand

  6. Approaches to Feature Selection Filter Approach Feature Selection by Distance Metric Score Input Features Model Train Model Wrapper Approach Feature Set Feature Selection Search Model Train Model Input Features Importance of features given by the model Adapted from Shin and Jasso

  7. Filter methods Feature selection p s Classifier design R R s << p • Features are scored independently and the top s are used by • the classifier • Score: correlation, mutual information, t-statistic, F-statistic, • p-value, tree importance statistic etc Easy to interpret. Can provide some insight into the disease markers. Adapted from J. Fridlyand

  8. Problems with filter method • Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute new information • Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others) • Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others. Adapted from J. Fridlyand

  9. Dimension reduction: a variant on a filter method • Rather than retain a subset of s features, perform dimension reduction by projecting features onto s principal components of variation (e.g. PCA etc) • Problem is that we are no longer dealing with one feature at a time but rather a linear or possibly more complicated combination of all features. It may be good enough for a black box but how does one build a diagnostic chip on a “supergene”? (even though we don’t want to confuse the tasks) • Those methods tend not to work better than simple filter methods. Adapted from J. Fridlyand

  10. Wrapper methods Feature selection p s Classifier design R R s << p • Iterative approach: many feature subsets are scored based • on classification performance and best is used. • Selection of subsets: forward selection, backward selection, • Forward-backward selection, tree harvesting etc Adapted from J. Fridlyand

  11. Problems with wrapper methods • Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated • No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only. • Easy to overfit. p Adapted from J. Fridlyand

  12. Example: Microarray Analysis “Labeled” cases (38 bone marrow samples: 27 AML, 11 ALL Each contains 7129 gene expression values) Train model (using Neural Networks, Support Vector Machines, Bayesian nets, etc.) key genes 34 New unlabeled bone marrow samples Model AML/ALL

  13. Microarray Data Challenges to Machine Learning Algorithms: • Few samples for analysis (38 labeled) • Extremely high-dimensional data (7129 gene expression values per sample) • Noisy data • Complex underlying mechanisms, not fully understood

  14. Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful

  15. Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful AML ALL

  16. Some genes are more useful than others for building classification models Example: genes 37176_at and 36563_atnot useful

  17. Importance of Feature (Gene) Selection • Majority of genes are not directly related to leukemia • Having a large number of features enhances the model’s flexibility, but makes it prone to overfitting • Noise and the small number of training samples makes this even more likely • Some types of models, like kNN do not scale well with many features

  18. With 7219 genes, how do we choose the best? • Distance metrics to capture class separation • Rank genes according to distance metric score • Choose the top n ranked genes HIGH score LOW score

  19. Distance Metrics • Tamayo’s Relative Class Separation • t-test • Bhattacharyya distance

  20. SVM-RFE: wrapper • Recursive Feature Elimination: • Train linear SVM -> linear decision function • Use absolute value of variable weights to rank variables • Remove half variables with lower rank • Repeat above steps (train, rank, remove) on data restricted to variables not removed • Output: subset of variables

  21. SVM-RFE • Linear binary classifier decision function • Recursive Feature Elimination (SVM-RFE) • at each iteration: • eliminate threshold% of variables with lower score • recompute scores of remaining variables

  22. SVM-RFE I. Guyon et al., Machine Learning, 46,389-422, 2002

  23. RELIEF • Idea: relevant variables make nearest examples of same class closer and make nearest examples of opposite classes more far apart. • weights = zero • For all examples in training set: • find nearest example from same (hit) and opposite class (miss) • update weight of each variable by adding abs(example - miss) -abs(example - hit) RELIEF I. Kira K, Rendell L, 10th Int. Conf. on AI, 129-134, 1992

  24. RELIEF assigns weights to variables based on how well they separate samples from their nearest neighbors (nnb) from the same and from the opposite class. RELIEF %input: X (two classes) %output: W (weights assigned to variables) nr_var = total number of variables; weights = zero vector of size nr_var; for all x in X do hit(x) = nnb of x from same class; miss(x) = nnb of x from opposite class; weights += abs(x-miss(x)) - abs(x-hit(x)); end; nr_ex = number of examples of X; return W = weights/nr_ex Note: Variables have to be normalized (e.g., divide each variable by its (max – min) values) RELIEF Algorithm

  25. What are the weights of s1, s2, s3 and s4 assigned by RELIEF? EXAMPLE

  26. Classification: CV error N samples • Training error • Empirical error • Error on independent test set • Test error • Cross validation (CV) error • Leave-one-out (LOO) • N-fold CV splitting 1/n samples for testing N-1/n samples for training Count errors Summarize CV error rate

  27. Two schemes of cross validation CV1 CV2 N samples N samples LOO feature selection Train and test the feature-selector and the classifier LOO Train and test the classifier Count errors Count errors

  28. Difference between CV1 and CV2 • CV1 gene selection within LOOCV • CV2 gene selection before before LOOCV • CV2 can yield optimistic estimation of classification true error • CV2 used in paper by Golub et al. : • 0 training error • 2 CV error (5.26%) • 5 test error (14.7%) • CV error different from test error!

  29. Significance of classification results • Permutation test: • Permute class label of samples • LOOCV error on data with permuted labels • Repeat process a high number of times • Compare with LOOCV error on original data: • P-value = (# times LOOCV on permuted data <= LOOCV on original data) / total # of permutations considered

  30. Application: Biomarker detection with Mass Spectrometric data of mixed quality • MALDI-TOF data. • samples of mixed quality due to different storage time. • controlled molecule spiking used to generate two classes. I. Marchiori et al, IEEE CIBCB, 385-391, 2005

  31. Profiles of one spiked sample

  32. Comparison of ML algorithms • Feature selection + classification: • RFE+SVM • RFE+kNN • RELIEF+SVM • RELIEF+kNN

  33. LOOCV results • Misclassified samples are of bad quality (higher storage time) • The selected features do not always correspond to m/z of spiked molecules

  34. LOOCV results • The variables selected by RELIEF correspond to the spiked peptides • RFE is less robust than RELIEF over LOOCV runs and selects also “irrelevant” variables RELIEF-based feature selection yields results which are better interpretable than RFE

  35. BUT... • RFE+SVM yields superior loocv accuracy than RELIEF+SVM • RFE+kNN superior accuracy than RELIEF+kNN (perfect LOOCV classification for RFE+1NN) RFE-based feature selection yields better predictive performance than RELIEF

  36. Conclusion • Better predictive performance does not necessarily correspond to stability and interpretability of results • Open issues: • (ML/BIO) Ad-hoc measure of relevance for potential biomarkers identified by feature selection algorithms (use of domain knowledge)? • (ML) Is stability of feature selection algorithms more important than predictive accuracy?

More Related