Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data

Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang

purpose of this paper • Compare the performance of different discrimination methods • Nearest Neighbor classifier • Linear discriminant analysis • Classification tree • Machine learning approaches: bagging, boosting • Investigate the use of prediction votes to assess the confidence of each prediction

Statistical problems: • The identification of new/unknown tumor classes using gene expression profiles Clustering analysis/unsupervised learning • The classification of malignancies into known classes Discriminant analysis/supervised learning • The identification of marker genes that identified different tumor classes  Variable (Gene) selection

Datasets Gene expression data on p genes for n mRNA samples: n x p matrix X={x ij}, where x ij denotes the expression level of gene (variable) j in ith mRNA sample(observation) Response: k-dimensional vector Y={yi}, where yi denotes the class of observation i • Lymphoma dataset (p=4682, n=81,k=3) • Leukemia dataset (p=3571, n=72, k=3 or 2) • NCI 60 dataset (p=5244, n=61, k=8)

Data preprocessing • Imputation of missing data (KNN) • Standardization of data (Euclidean distance) • preliminary gene selection • Lymphoma dataset (p=4682  p=50, n=81,k=3) Leukemia dataset (p=3571p=40, n=72, k=3) NCI 60 dataset (p=5244p=30, n=61, k=8)

Visual presentation of Leukemia dataset P=3571 p=40 Correlation matrix (72x72) ordered by class Black: 0 correlation / Red: positive correlation / Green: negative correlation

Prediction Methods • Supervised Learning Methods • Machine learning approaches

Supervised Learning Methods • Nearest Neighbor classifier(NN) • Fisher Linear Discriminant Analysis (LDA) • Weighted Gene Voting • Classification trees (CART)

Nearest Neighbor The k-NN rule • Find the k closest observations in the learning set • Predict the class for each element in the test dataset by majority vote • K is chosen by minimizing cross-validation error rate

Linear Discirminantion Analysis FLDA consists of • finding linear functions a’x of the gene expression levels x=(x1, …,xp)with large ratio of between groups to within groups sum of squares • Predicting the class of an observation by the class whose mean vector is closest to the discrimination variables

Maximum likelihood discriminant rules • Predicts the class of an observation x as C(x)=argmaxkpr(x|y=k)

Weighted Gene Voting • An observation x=(x1,…xp) is classified as 1 iff • Prediction strength as the margin of victory(p9)

Classification tree • Constructed by repeated splits of subsets (nodes) • Each terminal subset is assigned a class label • The size of the tree is determined by minimizing the cross validation error rate • Three aspects to tree construction  the selection of the splits  the stopping criteria  the assignment of each terminal node to a class

Aggregated Predictors There are several ways to generate perturbed learning set: • Bagging • Boosting • Convex Pseudo data (CPD)

Bagging Predictors are built for each sub-sample and aggregated by Majority voting with equal wb=1 Non-parametric bootstrap: • drawing at random with replacement to form a perturbed learning sets of the same size as the original learning set • By product: out of bag observations can be used to estimate misclassification rates of bagged predictors • A prediction for each observation (xi, yi) is obtained by aggregating the classifiers in which (xi,yi) is out-of-bag

Bagging (cont.) Parametric bootstrap: • Perturbed learning sets are generated according to a mixture of MVN distributions • For each class k, the class sample mean and covariance matrix were taken as the estimates of distribution parameters • Make sure at least one observation sampled from each class

Boosting The bth step of the boosting algorithm • Get another learning set Lb of the same size nL • Build a classifier based on Lb • Run the learning set L let di=1 if the ith case is classified incorrectly di=0 otherwise • Define b=Pidi and Bbdi=(1- b)/ b Update by pi=piBbdi/  piBbdi • Re-sampling probabilities are reset to equal if b>=1/2 or b=0

Prediction votes For aggregated classifiers, prediction votes assessing the strength of a prediction may be defined for each observation The prediction vote (PV) for an observation x

Study Design • Randomly divide the dataset into a learning and test set (2:1 scheme) For each of N=150 runs: • Select a subset of p genes from the learning set with the largest BSS/WSS • Build the different predictors using the learning sets with p genes • Apply the predictors to the observations in the test set to obtain test set error rates

Results • Test set error rates: apply classifier build based on learning set to test set. Summarized by box-plot over runs • Observation-wise error rates: for each observation, record the proportion of times it was classified incorrectly. Summarized by means of survival plots • Variable selection: compare the effect of increasing or decreasing number of genes (variables)

Leukemia data, two classes

Leukemia data, three classes

Lymphoma data

Conclusions • In the main comparison, NN and DLDA had the smallest error rates, while FLDA had the highest error rates • Aggregation improved the performance of CART classifiers, the largest gains being with boosting and bagging with CPD • For the lymphoma and leukemia datasets, increasing the number of variables to p=200 did not affect much the performance of the various classifiers. There was an improvement for the NCI 60 dataset. • A more carefully selection of a small number of genes (p=10) improved the performance of FLDA dramatically

Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data

Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data

Presentation Transcript

Statistical Methods for the Screening and Classification of Microarray Gene Expression Data

The Control of Gene Expression

Statistical Methods for Analyzing Ordered Gene Expression Microarray Data

THE CONTROL OF GENE EXPRESSION

Case Study Presentation Systematic Evaluation of Scaling Methods for Gene Expression Data

Analysis of Gene Expression Data

Classification (Discrimination, Supervised Learning) Using Microarray Data

Discrimination and clustering with microarray gene expression data

Classification of Microarray Gene Expression Data

Classification with Gene Expression Data

Scalable Rule-Based Gene Expression Data Classification

Evaluation of Two Methods to Cluster Gene Expression Data

Classification of microarray gene expression data using support vector machines ( SVM )

Comparison of Classification Methods for Customer Attrition Analysis

Comparison of methods for reconstruction of models for gene expression regulation

Probabilistic Techniques for the Clustering of Gene Expression Data

Gene Expression Data

More Analysis of Gene Expression Data

Classification of Microarray Gene Expression Data

Statistical Methods for the Screening and Classification of Microarray Gene Expression Data

Soft clustering of gene expression data

Eigensolvers for analysis of microarray gene expression data