Advanced analysis: Classification, clustering and other multivariate methods.

Advanced analysis:Classification, clustering and other multivariate methods. Statistics for Microarray Data Analysis – Lecture 4 The Fields Institute for Research in Mathematical Sciences May 25, 2002

Cluster and discriminant analysis • These techniques group, or equivalently classify, observational units on the basis of measurements. • They differ according to their aims, which in turn depend on the availability of a pre-existing basis for the grouping. • In cluster analysis, there are no predefined groups or labels for the observations, while discriminant analysis is based on the existence of such groups or labels. • Alternative terminology • Computer science: unsupervised and supervised learning. • Microarray literature: class discovery and class prediction.

Tumor classification A reliable and precise classification of tumors is essential for successful diagnosis and treatment of cancer. Current methods for classifying human malignancies rely on a variety of morphological, clinical, and molecular variables. In spite of recent progress, there are still uncertainties in diagnosis. Also, it is likely that the existing classes are heterogeneous. DNA microarrays may be used to characterize the molecular variations among tumors by monitoring gene expression on a genomic scale. This may lead to a more reliable classification of tumors.

Tumor classification, cont There are three main types of statistical problems associated with tumor classification: 1. The identification of new/unknown tumor classes using gene expression profiles - cluster analysis; 2. The classification of malignancies into known classes - discriminant analysis; 3. The identification of “marker” genes that characterize the different tumor classes - variable selection. These issues are relevant to many other questions, e.g. characterizing/classifying neurons or the toxicity of chemicals administered to cells or model animals.

Clustering microarray data We can cluster genes (rows), mRNA samples (cols), or both at once. • Clustering leads to readily interpretable figures. • Clustering strengthens the signal when averages are taken within clusters of genes (Eisen). • Clustering can be helpful for identifying patterns in time or space. Our main experience is in this area. • Clustering is useful, perhaps essential, when seeking new subclasses of cell samples (tumors, etc).

Recap: cDNA gene expression data Data on p genes for n mRNA samples mRNA samples sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes xij = expression level of gene i in mRNA sample j = Log(Red intensity / Green intensity) yj = tumor class for sample j.

The problem of finding patterns In many situations we have a set of hybridizations with temporal or spatial aspects to them, and seek genes exhibiting patterns. An early example was the yeast cell cycle data, a relatively easy case, as the nature of the pattern of interest was clear. Harder examples include developmental time series, or the current case study (the mouse olfactory bulb). Here genes are “on” or “off”, or “come on” and “go off” separately and jointly, according to their own programs. In a sense every gene has its own pattern, though we expect some sharing of patterns. But most importantly, we usually don’t know the patterns a priori. Our problem seems to be distinguishing interesting or real patterns from meaningless variation, at the level of the gene.

The problem, continued Current solutions to our problem include the various forms of principal component analysis (PCA), e.g. singular value decomposition (SVD) and other descriptive multivariate techniques, and, of course, cluster analysis. These all seem to based on the idea that we are mostly interested in patterns shared by enough genes to stand out. Can we do better? How can contextual information be used to guide our search? Have I even posed a meaningful question? Few answers here.

Strategies to find patterns Two principal approaches (that I can see) 1a. Find the genes whose expression fits specific, predefined patterns. 1b. Find the genes whose expression follows the pattern of predefined gene or set of genes. • Carry out some kind of exploratory analysis to see what expression patterns emerge. I plan to briefly touch on 1a, and then discuss approach 2.

Some comments on strategy 2 In essence we are using the large-scale hybridizations as a screen. Some reasonable fraction of genes with presumptive ‘interesting patterns’ will have these checked by in situ hybs or something similar. In essence we want to select genes based on our analysis which gives a good hit rate (low false positive rate), certainly compared to random selection or checking out of ‘candidate’ genes. Some patterns are ‘stronger’ than others, and we would like to see the full range of ‘real’ patterns in our data,including real but weak ones. A strategy we tried early on was this: simply rank genes on variation across the set of hybs. The problem was that the highly ranked genes tended to give thesame few patterns: we didn’t see the full range.

Strategy 1a: identifying genes with specific, predefined patterns

Finding early and late response genes in a short times series M 576 (u) 16 1 1/4 Time 24 1/2 1 4 Which genes increase or decrease like the function x2?

y y u C C Doing this in the vector world • For each gene, we have a vector, y, of expression estimates at the different time points • Project the vector onto the space spanned by the vector u (the values of x2 at our time points). • C is the scalar product u

Use a qq-plot of C to identify a genes with “significant” projections along u.

Strategy 2: pattern discovery

Clustering microarray data We can cluster cell samples(cols), e.g. neuron cells, for identification (profiles). Also, we might want to estimate the number of different neuron cell types in a set of samples, based on gene expression levels. We can cluster genes (rows) , e.g. using large numbers of yeast experiments, to identify groups of co-regulated genes. We can cluster genes (rows) to reduce redundancy (cf. variable selection) in predictive models.

Row & column clusters Taken from Nature February, 2000. Paper by A Alizadeh et al. Distinct types of diffuse large B-cell lymphoma identified by Gene expression profiling.

Discovering tumor subclasses

Basic principles of clustering Aim: to group observations that are “similar” based on predefined criteria. Issues: Which genes / arrays to use? Which similarity or dissimilarity measure? Which clustering algorithm? It is advisable to reduce the number of genes from the full set to some more manageable number, before clustering. The basis for this reduction is usually quite context specific, see later example.

There are many measures of (di)ssimilarity • Euclidean distance; • Mahalanobis distance; • Manhattan metric; • Minkowski metric; • Canberra metric; • one minus correlation; etc. And: there are many methods of clustering…..

Partitioning methods Partition the data into a prespecified number k of mutually exclusive and exhaustive groups. Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares. Examples: • k-means, self-organizing maps (SOM), PAM, etc.; • Fuzzy: needs stochastic model, e.g. Gaussian mixtures.

Hierarchical methods Hierarchical clustering methods produce a tree or dendrogram. They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level. The tree can be built in two distinct ways • bottom-up: agglomerative clustering; • top-down: divisive clustering.

Agglomerative methods • Start with n clusters. At each step, merge the two closest clusters using a measure of between-cluster disssimilarity, which reflects the shape of the clusters. • Between-cluster dissimilarity measures • Unweighted Pair Group Method with Arithmetic mean (UPGMA): average of pairwise dissimilarities. • Single-link: minimum of pairwise dissimilarities. • Complete-link: maximum of pairwise dissimilarities.

Divisive methods • Start with only one cluster. At each step, split clusters into two parts. • Advantages. Obtain the main structure of the data, i.e. focus on upper levels of dendogram. • Disadvantages. Computational difficulties when considering all possible divisions into two groups.

Partitioning: Advantages Optimal for certain criteria. Disadvantages Need initial k; Often require long computation times. Hierarchical Advantages Faster computation. Disadvantages Rigid; Cannot correct later for erroneous decisions made earlier. Partitioning vs. hierarchical

Three generic clustering problems Three important tasks (which are generic) are: 1. Estimating the number of clusters; 2. Assigning each observation to a cluster; 3. Assessing the strength/confidence of cluster assignments for individual observations. Not equally important in every problem. We now return to our mouse olfactory bulb experiment.

Recall: we have sets of contrasts (two forms) across the bulb as our patterns Gene #15,228

Patterns and a metric on them Aim: To identify spatial patterns of differential gene expression. For every gene, we can define a magnitude for its expression profile. Simply ranking genes according to the size of di produced a list dominated by similar patterns. Sub aim: To identify genes with different gene expression patterns. The slight novelty here is that we are clustering on estimated contrasts from the data, not the log ratios themselves.

Clustering genes based on their similarity in expression patterns • Start with sets of genes exhibiting some minimal level of differential expression across the bulb - “top 100” from each comparison x 15 comparisons = 635 genes chosen here. • Carry out hierarchical clustering of the 635 genes, building a dendrogram using Mahalanobis distance and Ward agglomeration. • Now consider all 635 clusters of >2 genes in the tree. Singles are added separately. • Measure the heterogeneity h of a cluster by calculating the 15 SDs across the cluster of each of the pairwise effects, and taking the largest. • Choose a score s (see plots) and take all maximal disjoint clusters with h < s. Here we used s = 0.46 and obtained 16 clusters.

Plots guiding choice of clusters of genes Number of genes Number of clusters (patterns) Cluster heterogeneity h (max of 15 SDs)

P DA VA MA LA MP LA VP DP LP VD MD LD MV LA LM LV Red :genes chosen Blue: :controls 15 p/w effects

16 Clusters Systematically Arranged (6 point representation) * * * * *

One Cluster (6 point representation) Thick red line: averageDashed blue: +/-2h

Visualizing gene expression patterns across regions in our data set and find a set of genes that show different spatial patterns.

Discrimination

Discrimination • A predictor or classifier for K tumor classes partitions the space X of gene expression profiles into K disjoint subsets, A1, ..., AK, such that for a sample with expression profile x=(x1, ...,xp)  Ak the predicted class is k. • Predictors are built from past experience, i.e., from observations which are known to belong to certain classes. Such observations comprise the learning setL = (x1, y1), ..., (xn,yn). • A classifier built from a learning set L is denoted by C( . ,L): X  {1,2, ... ,K}, with the predicted class for observation x being C(x,L).

Fisher linear discriminant analysis First applied in 1935 by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA) consists of i. finding linear combinations x a of the gene expression profiles x=(x1,...,xp) with large ratios of between-groups to within-groups sums of squares - discriminant variables; ii. predicting the class of an observation x by the class whose mean vector is closest to x in terms of the discriminant variables.

FLDA

Maximum likelihood discriminant rules When the class conditional densities pr(x|y=k) are known, the maximum likelihood (ML) discriminant rule predicts the class of an observation x by C(x) = maxk pr(x|y=k). For multivariate normal class densities, i.e., for x|y= k ~ N(k,k), this is (in general) a quadratic rule. In practice, population mean vectors and covariance matrices are estimated by the corresponding sample quantities.

ML discriminant rules - special cases 1. Linear discriminant analysis, LDA. When the class densities have the same covariance matrix, k , the discriminant rule is based on the square of the Mahalanobis distance and is linear and given by 2. Diagonal linear discriminant analysis, DLDA. In this simplest case, when the class densities have the same diagonal covariance matrix = diag(s12, …, sp2) Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for two classes (wrong variance calculation). C(x) = arg min k (x-)’ -1(x - )

Nearest neighbor rule Nearest neighbor methods are based on a measure of distance between observations, such as the Euclidean distance or one minus the correlation between two gene expression profiles. The k-nearest neighbor rule, due to Fix and Hodges (1951), classifies an observation x as follows: i. find the k observations in the learning set that are closest to x; ii. predict the class of x by majority vote, i.e., choose the class that is most common among those k observations. The number of neighbors k can be chosen by cross-validation.

Nearest neighbor rule

Classification trees Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets, starting with X itself. Each terminal subset is assigned a class label and the resulting partition of X corresponds to the classifier. Three main aspects of tree construction are: i. the selection of the splits; ii. the decision to declare a node terminal or to continue splitting; iii. the assignment of each terminal node to a class. Different tree classifiers use different approaches to these three issues. Here, we use CART: Classification And Regression Trees, Breiman et al. (1984).

Classification trees

Aggregating predictors Breiman (1996, 1998) found that gains in accuracy could be obtained by aggregating predictors built from perturbed versions of the learning set. In classification, the multiple versions of the predictor are aggregated by voting. Let C(., Lb) denote the classifier built from the bth perturbed learning set Lb , and let wbdenote the weight given to predictions made by this classifier. The predicted class for an observation x is given by argmaxkb wb I(C(x,Lb) = k).

Aggregating predictors • 1. Bagging.Bootstrap samples of the same size as the original learning set. • - non-parametric bootstrap, Breiman (1996); - convex pseudo-data, Breiman (1998). • 2. Boosting.Freund and Schapire (1997), Breiman (1998). • The data areresampled adaptively so that the weights in the resampling are increased for those cases most often misclassified. • The aggregation of predictors is done by weighted voting.

Prediction votes For aggregated classifiers, prediction votes assessing the strength of a prediction may be defined for each observation. Theprediction vote (PV)for an observation x is defined to be PV(x) = maxk∑b wb I(C(x,Lb) = k) / ∑bwb . When the perturbed learning sets are given equal weights, i.e., wb =1, the prediction vote is simply the proportion of votes for the “winning'' class, regardless of whether it is correct or not. Prediction votes belong to [0,1].

Other prediction methods • Support vector machines (SVMs) • Neural networks • Bayesian regression methods

Datasets • Leukemia data, Golub et al. (1999). • n=72 tumor mRNA samples; • 3 known classes (B-cell ALL, T-cell ALL, AML); • p=6,817 genes. • Lymphoma data, Alizadeh et al. (2000). • n=81 tumor mRNA samples; • 3 known classes (CLL, FL, DLBCL); • p=4,682 genes. • NCI 60 data, Ross et al. (2000). • n=64 cell line mRNA samples; • 9 known classes; • p=5,244 genes.

Data pre-processing • Image analysis and normalization: beyond our control for these data. • Imputation:k-nearest neighbor imputation, where genes are``neighbors'' and the similarity measure between two genes is the correlation in their expression profiles. • Standardization: Standardize observations (arrays) to have mean 0 and variance 1 across variables (genes).

Advanced analysis: Classification, clustering and other multivariate methods.