Feature Selection and Its Application in Genomic Data Analysis March 9, 2004

Feature Selection and Its Application in Genomic Data AnalysisMarch 9, 2004 Lei Yu Arizona State University

Outlines • Introduction to feature selection • Motivation • Problem statement • Key research issues • Application in genomic data analysis • Overview of data mining for microarray data • Gene selection • A case study • Current research directions

Motivation An active field in Pattern recognition Machine learning Data mining Statistics • Goodness • Reducing dimensionality • Improving learning efficiency • Increasing predicative accuracy • Reducing complexity of learned results

Problem Statement • A process of selecting a minimum subset of features that is sufficient to construct a hypothesis consistent with the training examples (Almuallim and Dietterich, 1991) • Selecting a minimum subset G such that P(C|G) is equal or as close as possible to P(C|F) (Koller and Sahami, 1996)

An Example for the Problem Data set Five Boolean features C = F1∨F2 F3 = ┐F2 ,F5 = ┐F4 Optimal subset: {F1, F2}or{F1, F3} Combinatorial nature of searching for an optimal subset

Subset Search • An example of search space (Kohavi and John, 1997)

Evaluation Measures • Wrapper model • Relying on a predetermined classification algorithm • Using predictive accuracy as goodness measure • High accuracy, computationally expensive • Filter model • Separating feature selection from classifier learning • Relying on general characteristics of data (distance, correlation, consistency) • No bias toward any learning algorithm, fast

A Framework for Algorithms Original Candidate Subset Evaluation Subset Generation Set Subset Current Best Subset Stopping Criterion Yes No Selected Subset

Feature Ranking • Weighting and ranking individual features • Selecting top-ranked ones for feature selection • Advantages • Efficient: O(N) in terms of dimensionality N • Easy to implement • Disadvantages • Hard to determine the threshold • Unable to consider correlation between features

Applications of Feature Selection • Text categorization • Yang and Pederson, 1997 (CMU) • Forman, 2003 (HP Labs) • Image retrieval • Swets and Weng, 1995 (MSU) • Dy et al, 2003 (Purdue University) • Gene expression microarrray data analysis • Xing et al, 2001 (UC Berkeley) • Lee et al, 2003 (Texas A&M) • Customer relationship management • Ng and Liu, 2000 (NUS) • Intrusion detection • Lee et al, 2000 (Columbia University)

Microarray Technology • Enabling simultaneously measuring the expression levels for thousands or tens of thousands of genes in a single experiment • Providing new opportunities and challenges for data mining Gene Value M23197_at 261 U66497_at 88 4778 M92287_at . . . . . .

Two Ways to View Microarray Data Gene Sample M23197_at U66497_at M92287_at . . . Class ALL Sample 1 261 88 4778 . . . ALL Sample 2 101 74 2700 . . . AML Sample 3 1450 34 498 . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data Mining Tasks Data points are: Genes Samples Grouping similar genes together to find co-regulated genes Grouping similar samples together to find classes or subclasses Clustering Building a classifier to predict the classes of new samples Classification

Gene Selection • Data characteristics in sample classification • High dimensionality (thousands of genes) • Small sample size (often less than 100 samples) • Problems • Curse of dimensionality • Overfitting the training data • Traditional gene selection methods • Within the filter model • Gene ranking

A Case Study (Golub et al., 1999) AML ALL -3 0 3 Normalized Expression • Leukemia data • 7129 genes, 72 samples • Training: 38 (27 ALL, 11 AML) • Test: 34 (20 ALL, 14 AML) • Normalization • Mean: 0 • Standard deviation: 1 • Correlation measure

Case Study (continued) • Performance of selected genes • Accuracy on training set: 36 out 38 (94.74%) correctly classified • Accuracy on test set: 29 out 34 (85.29%) correctly classified • Limitations • Domain knowledge required to determine the number of genes selected • Unable to remove redundant genes

Feature/Gene Redundancy • Examining redundant genes • Two heads are not necessarily better than one • Effects of redundant genes • How to handle redundancy • A challenge • Some latest work • MRMR (Maximum Relevance Minimum Redundancy) (Ding and Peng, CSB-2003) • FCBF (Fast Correlation Based Filter) (Yu and Liu, ICML-2003)

Research Directions • Feature selection for unlabeled data • Common things as for labeled data • Difference • Dealing with different data types • Nominal, discrete, continuous • Discretization • Dealing with large size data • Comparative study and intelligent selection of feature selection methods

References • G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. ICML-1994. • L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML-2003. • T. R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science-1999. • C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. CSB-2003. • J. Shavlik and D. Page. Machine learning and genetic microarrays. ICML-2003 tutorial. http://www.cs.wisc.edu/~dpage/ICML-2003-Tutorial-Shavlik-Page.ppt

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004

Presentation Transcript

Genomic sequencing and its data analysis

Feature selection

Feature Selection

Feature selection

Feature Selection

Data Mining Feature Selection

Feature Selection

Feature Selection

Feature selection

Feature Selection

Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Feature selection, SVM-based classification and application to mass spectrometry data analysis

Feature Selection

Feature Selection

Feature selection

Feature Selection

Feature Selection

Genomic sequencing and its data analysis

Genomic Sequence alignments and its application

Feature selection