1 / 41

New EDA-approaches to feature selection for classification (of biological sequences)

New EDA-approaches to feature selection for classification (of biological sequences). Yvan Saeys. Outline. Feature selection in the data mining process Need for dimensionality reduction techniques in biology Feature selection techniques EDA-based wrapper approaches Constrained EDA approach

kiri
Download Presentation

New EDA-approaches to feature selection for classification (of biological sequences)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys

  2. Outline • Feature selection in the data mining process • Need for dimensionality reduction techniques in biology • Feature selection techniques • EDA-based wrapper approaches • Constrained EDA approach • EDA-ranking, EDA-weighting • Application to biological sequence classification • Why am I here ? Yvan Saeys, Donostia 2004

  3. Feature selection in the data mining process pre-processing feature extraction feature selection model induction/classification post-processing Yvan Saeys, Donostia 2004

  4. Need for dimensionality reduction techniques in biology • Many biological processes are far from being completely understood • In order not to miss relevant information • Take into account as much features as possible • Use dimension reduction techniques to identify the relevant feature subspaces • Additional difficulty: • Many feature dependencies Yvan Saeys, Donostia 2004

  5. Dimension reduction techniques Feature selection Projection Feature ranking Feature weighting Compression … Yvan Saeys, Donostia 2004

  6. Benefits of feature selection • Attain good or even better classification performance using a small subset of features • Provide more cost-effective classifiers • Less features to take into account faster classifiers • Less features to store smaller datasets • Gain more insight into the processes that generated the data Yvan Saeys, Donostia 2004

  7. Feature selection: another layer of complexity • Bias-variance tradeoff of a classifier • Model selection: find the best classifier with the best parameters for the best subset • For every feature subset: model selection • Extra dimension in the search process Yvan Saeys, Donostia 2004

  8. Feature selection strategies • Filter approach • Wrapper approach • Embedded approach Classification Model FS FS Search Method Classification Model Classification Model Classification Model Parameters FS • Feature selection based on signal processing techniques Yvan Saeys, Donostia 2004

  9. Filter approach • Independent of classification model • Uses only dataset of annotated examples • A relevance measure for each feature is calculated: • E.g: Feature – Class entropy • Kullback-Leibler divergence (cross-entropy) • Information gain, gain ratio • Normalize relevance scores weights • Fast, but discards feature dependencies Yvan Saeys, Donostia 2004

  10. Wrapper approach • Specific to a classification algorithm • The search for a good feature subset is guided by a search algorithm (e.g. greedy forward or backward) • The algorithm uses the evaluation of the classifier as a guide to find good feature subsets • Examples: sequential forward or backward search, simulated annealing, stochastic iterative sampling (e.g. GA, EDA) • Computationally intensive, but able to take into account feature dependencies Yvan Saeys, Donostia 2004

  11. Embedded approach • Specific to a classification algorithm • Model parameters are directly used to discard features • Examples: • Reduced error pruning in decision trees • Feature elimination using the weight vector of a linear discriminant function • Usually needs only few additional calculations • Able to take into account feature dependencies Yvan Saeys, Donostia 2004

  12. EDA-based wrapper approaches Yvan Saeys, Donostia 2004

  13. EDA-based wrapper approaches • Observations for (biological) datasets with many features: • Many feature subsets result in the same classification performance • Many features are irrelevant • Search process spends most of its time in subsets containing approximately half of the number of features Yvan Saeys, Donostia 2004

  14. EDA-based wrapper approaches Only a small fraction of the features are relevant Faster evaluation of a classification model when only a small number of features are present Constrained Estimation of Distribution Algorithm (CDA) : Determine an upper bound U for the maximally allowed number of features in every individual (sample) Apply a filter to the generated (sampled) individuals: allow at most U features in the subset Yvan Saeys, Donostia 2004

  15. EDA-based wrapper approachesCDA • Advantages: • Huge reduction of the search space • Example : 400 features: • Full search space: 2400 feature subsets • U=100:  3.3E96 feature subsets • Reduction by 23 orders of magnitude • Faster evaluation of a classification model • Scalable to datasets containing a very large number of features • Scalable to more complex classification models (e.g. SVM using higher order polynomial kernel) Yvan Saeys, Donostia 2004

  16. CDA: example # F eatures # Ev Av erage # F eatures Balanced Un balanced NBM 150 68875 294.40 0 h 34 m 1 h 58 m SBE 80 76960 275.98 0 h 36 m 2 h 09 m 40 79380 269.48 0 h 37 m 2 h 11 m NBM 150 67100 150 0 h 20 m 0 h 46 m CD A 80 67100 80 0 h 09 m 0 h 21 m 40 67100 40 0 h 05 m 0 h 11 m LSVM 150 68875 294.40 2 h 15 m 2 h 38 m SBE 80 76960 275.98 2 h 19 m 2 h 52 m 40 79380 269.48 2 h 20 m 2 h 54 m LSVM 150 67100 150 0 h 38 m 0 h 59 m CD A 80 67100 80 0 h 17 m 0 h 27 m 40 67100 40 0 h 14 m 0 h 19 m PSVM 150 13875 296.26 9 h 11 m 62 h 02 m SBE 80 15520 277.68 9 h 42 m 63 h 24 m 40 16020 271.03 9 h 48 m 63 h 40 m PSVM 150 13510 150 4 h 54 m 16 h 48 m CD A 80 13510 80 2 h 48 m 9 h 38 m 40 13510 40 1 h 52 m 6 h 16 m Yvan Saeys, Donostia 2004

  17. EDA-based feature ranking • Traditional approach to FS • Only use the best individual found during the search -> optimal feature subset • Many questions remain unanswered • Single best subset provides a static view of the whole elimination process • How much features can still be eliminated before classification performance drastically drops down • Which features can still be eliminated ? • Can we get a more dynamical analysis ? Yvan Saeys, Donostia 2004

  18. Feature ranking Yvan Saeys, Donostia 2004

  19. EDA-based feature ranking/weighting • Don’t use the single best individual • Use the whole distribution to assess feature weights • Use the weights to rank the features Yvan Saeys, Donostia 2004

  20. EDA-based feature weighting • Can be used to do : • Feature weighting • Feature ranking • Feature selection • Problem : how “convergent” should the final population be ? • Not enough convergence : no good feature subsets found yet (early stop) • Too much convergence (in the limit, all individuals are the same) • Solution • Convergent enough but not too convergent  Yvan Saeys, Donostia 2004

  21. How to quantify “enough but not too convergent” ? • Define the scaled Hamming distance between two individuals A and B as • Convergence of a distribution : • The average scaled Hamming distance between all pairs of individuals HD(A,B) HDS(A,B) = N Yvan Saeys, Donostia 2004

  22. Transcription Pre-mRNA Splicing (removal of introns) mRNA Translation Protein Application to gene prediction Introns Transcription start site Start codon Stop codon Poly-A tail Core promoter Enhancer 5’ DNA Promoter region Exons Yvan Saeys, Donostia 2004

  23. GT.. …AG Transcription Ex Ex I 1 I 2 I 3 Ex 1 Ex 2 Ex 3 Ex 4 Donor site Acceptor site Pre-mRNA splicing Ex 1 Ex 2 Ex 3 Ex 4 Translation Protein Splice site prediction Yvan Saeys, Donostia 2004

  24. Splice site predictionFeatures • Position dependent features • e.g. an A on position 1, C on position 17, …. • Position independent features • e.g. subsequence “TCG” occurs, “GAG” occurs,… 1 2 3 17 28 atcgatcagtatcgat GT ctgagctatgag atcgatcagtatcgat GT ctgagctatgag Yvan Saeys, Donostia 2004

  25. Acceptor prediction • Dataset: • 3000 positives • 18,000 negatives • Local context of 100 nucleotides [50,50] • 100 4-valued features • 400 binary features • Classifiers: • Naïve Bayes method • C4.5 • Linear SVM Yvan Saeys, Donostia 2004

  26. 2 convergent   2 convergent • A trial on acceptor prediction • 400 binary features (position dependent nucleotides) • Initial distribution : • P(fi) = 0.5 • C(D0) ~ 0.5 (each pair of individuals has on average half of the features in common) • C(D) = 0 (all individuals are the same) Yvan Saeys, Donostia 2004

  27. Evolution of convergence Yvan Saeys, Donostia 2004

  28. Evaluation of convergence rate Yvan Saeys, Donostia 2004

  29. Evaluation of convergence rate Yvan Saeys, Donostia 2004

  30. EDA-based feature ranking • Best results obtained with “semi-converged” population • Not looking for best subset anymore • Looking for best distribution • Advantage: • Need less iterations • Dynamical view of the feature selection process Yvan Saeys, Donostia 2004

  31. EDA-based feature weighting • Color coding feature weights to visualize new patterns : • A color coded mapping of the interval • [0-1] Cold Middle Hot Yvan Saeys, Donostia 2004

  32. Local context G-rich region ? Donor prediction : 400 features T-rich region ? 3-base periodicity Yvan Saeys, Donostia 2004

  33. Donor prediction : 528 features Yvan Saeys, Donostia 2004

  34. Donor prediction : 2096 features Yvan Saeys, Donostia 2004

  35. 3-base periodicity Local context T-rich region (poly-pyrimidine stretch) Acceptor prediction: 400 features Yvan Saeys, Donostia 2004

  36. Acceptor prediction : 528 features Yvan Saeys, Donostia 2004

  37. AG-scanning Acceptor prediction: 2096 features TG Yvan Saeys, Donostia 2004

  38. Comparison with NBM Yvan Saeys, Donostia 2004

  39. Related & Future work • Embedded feature selection in SVM with C-retraining • Feature selection tree: combination of filter feature selection and decision tree • Combining Bayesian decision trees and feature selection • Combinatorial pattern matching in biological sequences • Feature Selection Toolkit for large scale applications (FeaST) Yvan Saeys, Donostia 2004

  40. Why am I here ? • Establish collaboration between our research groups • Getting to know each other • Think about future collaborations • Define collaborative research projects • Exchange thoughts/learn more about EDA methods • Probabilistic graphical models for classification • Biological problems • Some ‘test cases’ during this months: apply some of ‘your’ techniques to ‘our’ data • … Yvan Saeys, Donostia 2004

  41. Thank you !! Yvan Saeys, Donostia 2004

More Related