Exploring Data using Dimension Reduction and Clustering

Exploring Data usingDimension Reduction andClustering Naomi Altman Nov. 06

Spellman Cell Cycle data Yeast cells were synchronized by arrest of a cdc15 temperature-sensitive mutant. Samples were taken every 10 minutes and one array was hybridized for each sample using a reference design. 2 complete cycles are in the data. I downloaded the data and normalized using loess. (Print tip data were not available.) I used the normalized value of M as the primary data.

What they did Supervised dimension reduction = regression They were looking for genes that have cyclic behavior - i.e. a sine or cosine wave in time. They regressed Mi on sine and cosine waves and selected genes for which the R2 was high. The period of the wave was known (from observing the cells?), so they regression against sine(wt) and cos(wt) where w is set to give the appropriate period. If the period is unknown, a method called Fourier analysis can be used to discover it.

Regression Suppose we are looking for genes that are associated with a particular quantitative phenotype, or have a pattern that is known in advance. E.g. Suppose we are interested in genes that change linearly with temperature and quadratically with pH. Y=b0 + b1Temp + b2pH + b3pH2 + noise We might fit this model for each gene (assuming that the arrays came from samples subjected to different levels of Temp and pH. This is similar to differential expression analysis - we have a multiple comparisons problem.

Regression We might compute an adjusted p-value, or goodness-of-fit statistic to select genes based on the fit to a pattern. If we have many "conditions" we do not need to replicate as much as in differential expression analysis because we consider any deviation from the "pattern" to be random variation.

What I did Unsupervised dimension reduction: I used SVD on the 832 genes x 24 time points. We can see that eigengene 5 has the cyclic genes.

For class I extracted the 304 spots with variance greater than 0.25. To my surprise, several of these were empty or control spots. I removed these. This leaves 295 genes which are in yeast.txt. Read these into R. Also: time=c(10,30,50,10*(7:25),270,290)

yeast=read.delim("yeast.txt",header=T) time=c(10,30,50,10*(7:25),270,290) M.yeast=yeast[,2:25] #strip off the gene names svd.m=svd(M.yeast) # svd #scree plot plot(1:24,svd.m$d) par(mfrow=c(4,4)) # plot the first 16 "eigengenes" for (i in 1:16) plot(time,svd.m$v[,i],main=paste("Eigen",i),type="l") par(mfrow=c(1,1)) plot(time,svd.m$v[,1],type="l",ylim=c(min(svd.m$v),max(svd.m$v))) for (i in 2:4) lines(time,svd.m$v[,i],col=i) #It looks like "eigengenes" 2-4 have the periodic components.

# Reduce dimension by finding genes that are linear combinations # of these 3 patterns by regression # We can use limma to fit a regression to every gene and use e.g. # the F or p-value to pick significant genes library(limma) design.reg=model.matrix(~svd.m$v[,2:4) fit.reg=lmFit(M.yeast,design.reg) # The "reduced dimension" version of the genes are the fitted # values: b0+ b1v2 + b2v3 +b3v4 vi is the ith column of svd.m$v # bi are the coefficients # Lets look at gene 1 (not periodic) and genes 5, 6, 7 plot(time,M.yeast[i,],type="l") lines(time,fit.reg$coef[i,1]+ fit.reg$coef[i,2]*svd.m$v[,2]+ fit.reg$coef[i,3]*svd.m$v[,3]+fit.reg$coef[i,4]*svd.m$v[,4])

# Select the genes with a strong period component # We could use R2 but in limma, it is simplest to compute the # moderated F-test for regression and then use the p-values. # Limma requires us to remove the intercept from the coefficients # to get this test :( contrast.matrix=cbind(c(0,1,0,0),c(0,0,1,0),c(0,0,0,1)) fit.contrast=contrasts.fit(fit.reg,contrast.matrix) efit=eBayes(fit.contrast) # We will use the Bonferroni method to pick a significance level # a=0.05/#genes = 0.00017 sigGenes=which(efit$F.p.value<0.00017) #plot a few of these genes # You might also want to plot a few genes with p-value > 0.5

Note that we used the normalized but uncentered unscaled data for this exercise. Things might look very different if the data were transformed.

Clustering We might ask which genes have similar expression patterns. Once we have expressed (dis)similarity as a distance measure, we can use this measure to cluster genes that are similar. There are many methods. We will discuss 2 - hierarchical clustering k-means clustering

Hierarchical Clustering (agglomerative) • Choose a distance function for points d(x1,x2) • Choose a distance function for clusters D(C1,C2) (for clusters formed by just one point, D reduces to d). • Start from N clusters, each containing one data point. • At each iteration: • a) Using the current matrix of cluster distances, find the two closest clusters. • b)Update the list of clusters by merging the two closest. • c) Update the matrix of cluster distances accordingly • Repeat until all data points are joined in one cluster. • Remarks: • • The method is sensitive to anomalous data points/outliers • F. Chiaromonte Sp 06 5

Hierarchical Clustering (agglomerative) • Choose a distance function for points d(x1,x2) • Choose a distance function for clusters D(C1,C2) (for clusters formed by just one point, D reduces to d). • Start from N clusters, each containing one data point. • At each iteration: • a) Using the current matrix of cluster distances, find the two closest clusters. • b)Update the list of clusters by merging the two closest. • c) Update the matrix of cluster distances accordingly • Repeat until all data points are joined in one cluster. • Remarks: • The method is sensitive to anomalous data points/outliers. • Mergers are irreversible: “bad” mergers occurring early on affect the structure of the nested sequence. • If two pairs of clusters are equally (and maximally) close at a given iteration, we have to choose arbitrarily; the choice will affect the structure of the nested sequence. • F. Chiaromonte Sp 06 5

Defining cluster distance: the linkage function D(C1,C2) is a function of the distances f{ d(x1i,x2j) } x1i in C1 x2j in C2 Single (string-like, long) f=min Complete (ball-like, compact) f=max Average f=average Centroid d(ave(x1i),ave(x2j) ) Single and complete linkages produce nested sequences invariant under monotone transformations of d – not the case for average linkage. However, the latter is a compromise between “long”, “stringy” clusters produced by single, and “round”, “compact” clusters produced by complete. F. Chiaromonte Sp 06 5

Example Agglomeration step in constructing the nested sequence (first iteration): 1. 3 and 5 are the closest, and are therefore merged in cluster “35”. 2. new distance matrix computed with complete linkage. Ordinate: distance, or height, at which each merger occurred. Horizontal ordering of the data points is any order preventing intersections of branches. F. Chiaromonte Sp 06 5 single linkage complete linkage

Hierarchical Clustering Hierarchical clustering, per se, does not dictate a partition and a number of clusters. It provides a nested sequence of partitions (this is more informative than just one partition). To settle on one partition, we have to “cut” the dendrogram. Usually we pick a height and cut there - but the most informative cuts are often at different heights for different branches. F. Chiaromonte Sp 06 5

hclust(dist(M.yeast), method="single")

Partitioning algorithms: K-means. • Choose a distance function for points d(xi,xj). • Choose K = number of clusters. • Initialize the K cluster centroids (with points chosen at random). • Use the data to iteratively relocate centroids, and reallocate points to closest centroid. At each iteration: • Compute distance of each data point from each current centroid. • Update current cluster membership of each data point, selecting the centroid to which the point is closest. • Update current centroids, as averages of the new clusters formed in 2. • Repeat until cluster memberships, and thus centroids, stop changing. F. Chiaromonte Sp 06 5

Remarks: • This method is sensitive to anomalous data points/outliers. • Points can move from one cluster to another, but the final solution depends strongly on centroid initialization (so we usually restart several times to check). • If two centroids are equally (and maximally) close to an observation at a given iteration, we have to choose arbitrarily (the problem here is not so serious because points can move later). • There are several “variants” of the k-means algorithm using e.g. median. • K-means converges to a local minimum of the total within-cluster square distance (total within cluster sum of squares) – not necessarily a global one. • Clusters tend to be ball-shaped with respect to the chosen distance.

Starting from the arbitrarily chosen open rectangles: Assign every data value to a cluster defined by the nearest centroid. Recompute the centroids based on the most current clustering. Reassign data values to cluster and repeat. Remarks: The algorithm does not indicate how to pick K. To change K, redo the partitioning. The clusters are not necessarily nested. F. Chiaromonte Sp 06 5

Here is the yeast data. (4 runs) To display the clusters, we often use the main eigendirections (svd$u). These do show that much of the clustering is defined by these 2 directions, but it is not clear that there really are clusters.

6 clusters 4 clusters k.out=kmeans(M.yeast,centers=6) plot(svd.m$u[,1],svd.m$u[,2],col=k.out5$cl)

Other partitioning Methods • Partitioning around medioids (PAM): instead of averages, use multidim medians as centroids (cluster “prototypes”). Dudoit and Freedland (2002). • Self-organizing maps (SOM): add an underlying “topology” (neighboring structureon a lattice) that relates cluster centroids to one another. Kohonen (1997), Tamayo et al. (1999). • Fuzzy k-means: allow for a “gradation” of points between clusters; soft partitions. Gash and Eisen (2002). • Mixture-based clustering: implemented through an EM (Expectation-Maximization)algorithm. This provides soft partitioning, and allows for modeling of cluster centroids and shapes. Yeung et al. (2001), McLachlan et al. (2002) F. Chiaromonte Sp 06 5

Assessing the ClustersComputationally The bottom line is that the clustering is "good" if it is biologically meaningful (but this is hard to assess). Computationally we can: 1) Use a goodness of cluster measure, such as the within cluster distances compared to the between cluster distances. 2) Perturb the data and assess cluster changes: a) add noise (maybe residuals after ANOVA) b) resample (genes, arrays)

Exploring Data using Dimension Reduction and Clustering

Exploring Data using Dimension Reduction and Clustering

Presentation Transcript

Dimension Reduction and Feature Selection

FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization

Dimension Reduction - PCA

Data reduction for weighted and outlier-resistant clustering

CS 277, Data Mining Dimension Reduction Methods

Dimension Reduction Methods

DACIDR A Deterministic Annealing Clustering and Interpolative Dimension Reduction Method

Dimension reduction (2)

Dimension reduction techniques for distributional data

Dimension reduction (1)

Scalable Supervised Dimensionality Reduction using Clustering

Nonlinear Dimension Reduction:

Dimension Reduction

Dimension reduction : PCA and Clustering

Dimension Reduction

Nonlinear Dimension Reduction

Dimension Reduction - PCA

Dimension Reduction and Feature Selection

Data Analysis and Dimension Reduction - PCA, LDA and LSA

Nonlinear Dimension Reduction:

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Data Reduction using SORTAV