1 / 22

Feature Selection Stability Analysis for Classification Using Microarray Data

Université Libre de Bruxelles DEA Bioinformatique. Feature Selection Stability Analysis for Classification Using Microarray Data. Panagiotis Moulos. Outline. Introduction Motivation Stability Measure Approach The bias/variance tradeoff Contributions Materials and Methods

Download Presentation

Feature Selection Stability Analysis for Classification Using Microarray Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Université Libre de Bruxelles DEA Bioinformatique Feature Selection Stability Analysis for Classification Using Microarray Data Panagiotis Moulos

  2. Outline • Introduction • Motivation • Stability Measure Approach • The bias/variance tradeoff • Contributions • Materials and Methods • Stability Metrics • Example (Hamming Distance) • Experimental Analysis • Results • Visualizing Instability • Stability Results • Accuracy Results • Remarks • Feature Aggregation • Discussion • General Remarks • Future Work Feature Selection Stability Analysis for Classification Using Microarray Data

  3. Prognosis Supervised/Unsupervised learning for tumor classification Cancer Genetic Signature Feature Selection techniques for important gene identification (Early) Diagnosis Feature Selection Classification Motivation • Microarrays are invaluable tools for cancer studies at the molecular level  prognosis, early diagnosis • Microarray data analysis • However, these signatures are sensitive to perturbations: a small perturbation (e.g. remove 1 sample) may lead to a completely different signature Similarity between (2,5,3) and (1,3,4) ? full gene ranking list (1,3,4,5,2) signature (1,3,4) full gene ranking list (2,5,3,4,1) signature (2,5,3) BUT Feature Selection Stability Analysis for Classification Using Microarray Data

  4. Stability Measure Approach • Problem of similarity between two gene lists can be approached mathematically by the theory of permutations • Given a set Gn = (g1, g2, …, gn) of objects, a permutation π is a bijective function between Gn and Gn • Concerning the frame of microarray data • The n genes – features involved are labeled with a unique number between 1, …, n • Every gene ranking list (full ranking list) is exactly a permutationπ on the set {1, …, n} where the image π(i) of the ith gene is its ranking inside π • If we are interested only for the top N ranked genes – features of Gn, we define as π* the partial ranking list of Gn which contains the first N elements of π Feature Selection Stability Analysis for Classification Using Microarray Data

  5. Stability Measure Approach (Example) • A full ranking list: G5 = (1,2,3,4,5) • A permutation: π = (3,2,5,4,1) where π(1) = 3, π(2) = 2, π(3) = 5, π(4) = 4, π(5) = 1 • A partial ranking list with the top N = 3 ranked genes: π* = (3,2,5) where π*(1)= 3, π*(2) = 2, π*(3) = 5 • How can we summarize variability between • Full ranking lists πand σ • Partial ranking listsπ*and σ* • Several metrics proposed in statistical literature (e.g. Critchlow, 1985) Feature Selection Stability Analysis for Classification Using Microarray Data

  6. Variance contribution Too many parameters Inclusion of noise Overfitting Bias contribution Too few parameters Not enough flexibility Misfit The bias/variance tradeoff • A central issue in choosing a model for a given problem is selecting the level of structural complexity (# variables/parameters etc) that best suits the data that it must accommodate • Deciding on the correct amount of flexibility in a model is therefore a tradeoff between these two sources of the misfit. This is called the bias/variance tradeoff Feature Selection Stability Analysis for Classification Using Microarray Data

  7. Contributions • Experimental study of signature stability in gene expression datasets by resampling (bootstrap, jackknife) datasets for different ranking/feature selection methods • Study of several forms of feature selection stability using statistical similarity measures • Classification performance assessment for each feature selection and classification method • Study of possible correlation between feature selection stability and classification accuracy for all feature selection and classification methods • Proposal of a feature aggregation procedure to obtain more stable probabilistic gene signatures Feature Selection Stability Analysis for Classification Using Microarray Data

  8. Outline • Introduction • Motivation • Stability Measure Approach • The bias/variance tradeoff • Contributions • Materials and Methods • Stability Metrics • Example (Hamming Distance) • Experimental Analysis • Results • Visualizing Instability • Stability Results • Accuracy Results • Remarks • Feature Aggregation • Discussion • General Remarks • Future Work Feature Selection Stability Analysis for Classification Using Microarray Data

  9. with Stability Metrics • Measures of Feature Selection Stability • Stability of Selection • Stability of selection for a given dataset is the stability of appearance of certain features after resampling the original dataset. • Hamming Distance • Inconsistency • Stability of Ranking • Stability of ranking for a given dataset is the stability of both the appearance and the ranking order of certain features after resampling the original dataset • Spearman’s Footrule • Kendall’s Tau Feature Selection Stability Analysis for Classification Using Microarray Data

  10. Resampling 1 Original Dataset Resampling 2 Resampling 3 Resampling 4 Resampling 5 Example (Hamming Distance) • Example of stability metric: Hamming Distance • Calculation of Hamming Distance for: m = 5, n = 10, N = 5 Feature Selection Stability Analysis for Classification Using Microarray Data

  11. Experimental Analysis • Datasets: Breast Cancer (HBC, Tamoxifen), Leukemia (MLL, Golub), Lymphoma • Classification algorithms: k-NN (k = 5), Support Vector Machines • Feature Selection algorithms • Filters: Gram – Schmidt orthogonalization, k-NN and SVM correlation based filter (gene ranking according to misclassification error by 1 – gene trained classifier) • Wrapper: Sequential Forward Selection wrapper • Feature aggregation (main personal contribution) • Gather together all different signatures • Remove duplicates • Exclude features with low selection frequency according to a threshold • Resampling strategies • Bootstrap (on each step resample patients of a dataset with replacement) • Jackknife (on each step remove 1 – 5% of samples) Feature Selection Stability Analysis for Classification Using Microarray Data

  12. Outline • Introduction • Motivation • Stability Measure Approach • The bias/variance tradeoff • Contributions • Materials and Methods • Stability Metrics • Example (Hamming Distance) • Experimental Analysis • Results • Visualizing Instability • Stability Results • Accuracy Results • Remarks • Feature Aggregation • Discussion • General Remarks • Future Work Feature Selection Stability Analysis for Classification Using Microarray Data

  13. Visualizing Instability Feature Selection Stability Analysis for Classification Using Microarray Data

  14. Stability Results (1) • Stability of Selection (Bootstrap) • Stability of Ranking (Bootstrap) Feature Selection Stability Analysis for Classification Using Microarray Data

  15. Stability Results (2) • Stability of Selection (Jackknife) • Stability of Ranking (Jackknife) Feature Selection Stability Analysis for Classification Using Microarray Data

  16. No Filtering or Wrapping Accuracy Results Feature Selection Stability Analysis for Classification Using Microarray Data

  17. Remarks • Stability • Stability inversely proportional to size of perturbation • Gram – Schmidt orthogonalization outperforms classifier based correlations • Filters more stable than the wrapper • Correlation between stability of selection and stability of ranking • Accuracy • Accuracy proportional to size of perturbation • Gram – Schmidt orthogonalization is outperformed by classifier based correlations • Filters outperform the wrapper • Performance is improved after the application of Feature Selection techniques Feature Selection Stability Analysis for Classification Using Microarray Data

  18. Feature Aggregation • Feature Aggregation • Class permutation test shows no overfitting • t – test between mean accuracies before and after aggregation reveals improvement in the performance of wrapper but not of the filters Feature Selection Stability Analysis for Classification Using Microarray Data

  19. Outline • Introduction • Motivation • Stability Measure Approach • The bias/variance tradeoff • Contributions • Materials and Methods • Stability Metrics • Example (Hamming Distance) • Experimental Analysis • Results • Visualizing Instability • Stability Results • Accuracy Results • Remarks • Feature Aggregation • Discussion • General Remarks • Future Work Feature Selection Stability Analysis for Classification Using Microarray Data

  20. Aggregation does not improve Accuracy (e.g. Filters) bias/variance tradeoff Lower variance in Feature Selection Model less flexible Higher bias High Stability Aggregation improves Accuracy (e.g. Wrapper) bias/variance tradeoff Higher variance in Feature Selection Model more flexible Lower bias Low Stability General Remarks • Similarity metrics: their use depend on what kind of stability we study • Filters more stable and accurate than wrappers: although wrappers return few variables, their selection procedure can be highly variant • One would expect that high stability leads to high accuracy. However this is not always the case. Why? • Best compromise between bias and variance depends on many parameters (feature selection algorithm, top N ranked genes etc.) • Aggregation • Filters: Lower variance in Feature Selection  Aggregation does not improve accuracy • Wrapper: Higher variance in Feature Selection  Aggregation improves accuracy (by adjusting variance to achieve better compromise between bias and variance) Feature Selection Stability Analysis for Classification Using Microarray Data

  21. Conclusions – Future Work • Conclusions • We have shown that genetic signatures are sensitive to perturbations • Stability analysis using similarity metrics is necessary in order to evaluate signature sensitivity • Aggregation procedure creates a distribution of selected genes which can be used as a more stable probabilistic genetic signature for cancer microarray studies • It is better to use a more stable probabilistic signature consisting of more genes than a perturbation sensitive signature consisting of less genes • Future work • Study gene ranking using Markov Chains (MC): the selection of a gene during the selection process could be dependent on the previous gene (1st order MC) or on more previously selected genes (higher order MC) • Comparison of stability between Forward Selection and Backward Elimination wrappers • Further research on the relation between stability and accuracy: use of more algorithms, feature ranking based on stability/accuracy ratio • Study the effect of updating classification models with new data on genetic signatures • Biological interpretation of selected genes in probabilistic signatures Feature Selection Stability Analysis for Classification Using Microarray Data

  22. Acknowledgements • Many thanks to: • Gianluca Bontempi (Machine Learning Group, ULB) • Christos Sotiriou (Microarray Unit, IJB) • Benjamin Haibe – Kains (PhD student, MLG ULB, IJB) • Mrs Yiota Poirazi and the Computational Biology Group, FORTH (for this opportunity) Feature Selection Stability Analysis for Classification Using Microarray Data

More Related