1 / 19

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004. Lei Yu Arizona State University. Outlines. Introduction to feature selection Motivation Problem statement Key research issues Application in genomic data analysis Overview of data mining for microarray data

mizell
Download Presentation

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Selection and Its Application in Genomic Data AnalysisMarch 9, 2004 Lei Yu Arizona State University

  2. Outlines • Introduction to feature selection • Motivation • Problem statement • Key research issues • Application in genomic data analysis • Overview of data mining for microarray data • Gene selection • A case study • Current research directions

  3. Motivation An active field in Pattern recognition Machine learning Data mining Statistics • Goodness • Reducing dimensionality • Improving learning efficiency • Increasing predicative accuracy • Reducing complexity of learned results

  4. Problem Statement • A process of selecting a minimum subset of features that is sufficient to construct a hypothesis consistent with the training examples (Almuallim and Dietterich, 1991) • Selecting a minimum subset G such that P(C|G) is equal or as close as possible to P(C|F) (Koller and Sahami, 1996)

  5. An Example for the Problem Data set Five Boolean features C = F1∨F2 F3 = ┐F2 ,F5 = ┐F4 Optimal subset: {F1, F2}or{F1, F3} Combinatorial nature of searching for an optimal subset

  6. Subset Search • An example of search space (Kohavi and John, 1997)

  7. Evaluation Measures • Wrapper model • Relying on a predetermined classification algorithm • Using predictive accuracy as goodness measure • High accuracy, computationally expensive • Filter model • Separating feature selection from classifier learning • Relying on general characteristics of data (distance, correlation, consistency) • No bias toward any learning algorithm, fast

  8. A Framework for Algorithms Original Candidate Subset Evaluation Subset Generation Set Subset Current Best Subset Stopping Criterion Yes No Selected Subset

  9. Feature Ranking • Weighting and ranking individual features • Selecting top-ranked ones for feature selection • Advantages • Efficient: O(N) in terms of dimensionality N • Easy to implement • Disadvantages • Hard to determine the threshold • Unable to consider correlation between features

  10. Applications of Feature Selection • Text categorization • Yang and Pederson, 1997 (CMU) • Forman, 2003 (HP Labs) • Image retrieval • Swets and Weng, 1995 (MSU) • Dy et al, 2003 (Purdue University) • Gene expression microarrray data analysis • Xing et al, 2001 (UC Berkeley) • Lee et al, 2003 (Texas A&M) • Customer relationship management • Ng and Liu, 2000 (NUS) • Intrusion detection • Lee et al, 2000 (Columbia University)

  11. Microarray Technology • Enabling simultaneously measuring the expression levels for thousands or tens of thousands of genes in a single experiment • Providing new opportunities and challenges for data mining Gene Value M23197_at 261 U66497_at 88 4778 M92287_at . . . . . .

  12. Two Ways to View Microarray Data Gene Sample M23197_at U66497_at M92287_at . . . Class ALL Sample 1 261 88 4778 . . . ALL Sample 2 101 74 2700 . . . AML Sample 3 1450 34 498 . . . . . . . . . . . . . . . . . . . . . . . . . . .

  13. Data Mining Tasks Data points are: Genes Samples Grouping similar genes together to find co-regulated genes Grouping similar samples together to find classes or subclasses Clustering Building a classifier to predict the classes of new samples Classification

  14. Gene Selection • Data characteristics in sample classification • High dimensionality (thousands of genes) • Small sample size (often less than 100 samples) • Problems • Curse of dimensionality • Overfitting the training data • Traditional gene selection methods • Within the filter model • Gene ranking

  15. A Case Study (Golub et al., 1999) AML ALL -3 0 3 Normalized Expression • Leukemia data • 7129 genes, 72 samples • Training: 38 (27 ALL, 11 AML) • Test: 34 (20 ALL, 14 AML) • Normalization • Mean: 0 • Standard deviation: 1 • Correlation measure

  16. Case Study (continued) • Performance of selected genes • Accuracy on training set: 36 out 38 (94.74%) correctly classified • Accuracy on test set: 29 out 34 (85.29%) correctly classified • Limitations • Domain knowledge required to determine the number of genes selected • Unable to remove redundant genes

  17. Feature/Gene Redundancy • Examining redundant genes • Two heads are not necessarily better than one • Effects of redundant genes • How to handle redundancy • A challenge • Some latest work • MRMR (Maximum Relevance Minimum Redundancy) (Ding and Peng, CSB-2003) • FCBF (Fast Correlation Based Filter) (Yu and Liu, ICML-2003)

  18. Research Directions • Feature selection for unlabeled data • Common things as for labeled data • Difference • Dealing with different data types • Nominal, discrete, continuous • Discretization • Dealing with large size data • Comparative study and intelligent selection of feature selection methods

  19. References • G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. ICML-1994. • L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML-2003. • T. R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science-1999. • C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. CSB-2003. • J. Shavlik and D. Page. Machine learning and genetic microarrays. ICML-2003 tutorial. http://www.cs.wisc.edu/~dpage/ICML-2003-Tutorial-Shavlik-Page.ppt

More Related