Agenda

Agenda • Introduction to clustering • Dissimilarity measure • Preprocessing • Clustering method • Hierarchical clustering • K-means and K-memoids • Self-organizing maps (SOM) • Model-based clustering • Estimate # of clusters • Two new methods allowing scattered genes 4.1 Tight clustering • 4.2 Penalized and weighted K-means • Cluster validation and evaluation • Comparison and discussion

4.1 Tight clustering A common situation for gene clustering in microarray: k=10 k=30 k=15 K-means Clustering looks informative. A closer look, however, finds lots of noises in each cluster.

4.1 Tight clustering Main challenges for clustering in microarray Challenge 1: Lots of scattered genes. i.e. genes not belonging to any tight cluster of biological function.

4.1 Tight clustering Main challenges for clustering in microarray Challenge 2: Microarray is an exploratory tool to guide further biological experiments Hypothesis driven: hypothesis => experimental data. Data driven: high-throughput experiment => data mining => hypothesis => further validation experiment  Important to provide the most informative clusters instead of lots of loose clusters (reduce false positives).

4.1 Tight clustering • Tight Clustering: • Directly identify informative, tight and stable clusters with reasonable size, say, 20~60 genes. • Need not estimate k !! • Need not assign all genes into clusters. • Traditional: • Estimate the number of clusters, k. (except for hierarchical clustering) • Perform clustering through assigning all genes into clusters.

whole data 4 3 2 y 1 0 0 1 2 3 4 5 x 6 7 8 9 10 Basic idea: 11 1 2 3 4 5

co-membership matrix D[C(X', k), X] cluster centers C(X', k)=(C1,…, Ck) K-means sub-sample X' 4.1 Tight clustering Original Data X

4.1 Tight clustering • X={xij}nd : data to be clustered. • X'={x'ij}n/2d : random sub-sample • C(X', k)=(C1, C2,…, Ck): the cluster centers obtained from clustering X' into k clusters. • D[C(X', k), X]: an nn matrix denoting co-membership relations of X classified by C(X', k). (Tibshirani 2001) D[C(X', k), X]ij =1 if i and j in the same cluster. =0 o.w.

4.1 Tight clustering Algorithm 1 (when fixing k): • Fix k. Random sub-sampling X(1), …, X(B). Define the average co-membership matrix to be Note: •  i and j always clustered together in each sub-sampling judgment. •  i and j never clustered together in each sub-sampling judgment.

4.1 Tight clustering Algorithm 1 (when fixing k): (cont’d) Search for a large set of points such that . Sets with this property are candidates of tight clusters. Order sets with this property by their size to obtain Vk1,Vk2, …

1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 11 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 11 1 2 3 4 5 11 6 7 8 9 10 1 2 3 4 5 11 6 7 8 9 10 6 7 8 9 10 1 2 3 4 5 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11

4.1 Tight clustering Tight Clustering Algorithm: (relax estimation of k) 0.7 0.01 0.17 0 0.3 0.05 0.01 0.1 0.21 0.95 0 0.01 0 0 1 0.52 0 0 0.03 0.14 0.03 0.11 0.01 0.11 0.23 0 0

4.1 Tight clustering Tight Clustering Algorithm: Start with a suitable k0. Search for consecutive k’s and choose the top 3 clusters for each k. Stop when Select to be the tightest cluster.

4.1 Tight clustering Tight Clustering Algorithm: (cont’d) Identify the tightest cluster and remove it from the whole data. Decrease k0 by 1. Repeat 1.~3. to identify the next tight cluster. Remark: and k0 determines the tightness and size of resulting clusters.

4.1 Tight clustering Example: A simple simulation on 2-D: 14 clusters normally distributed (50 points each) plus 175 scattered points. Stdev=0.1, 0.2, …, 1.4.

4.1 Tight clustering Example: Tight clustering on simulated data:

4.1 Tight clustering Example:

4.1 Tight clustering Example: Gene expression during the life cycle of Drosophila melanogaster. (2002) Science 297:2270-2275 • 4028 genes monitored. Reference sample is pooled from all samples. • 66 sequential time points spanning embryonic (E), larval (L), pupal (P) and adult (A) periods. • Filter genes without significant pattern (1100 genes) and standardize each gene to have mean 0 and stdev 1.

4.1 Tight clustering Example: Comparison of various K-means and tight clustering: Seven mini-chromosome maintenance (MCM) deficient genes K-means k=30 K-means k=50

4.1 Tight clustering Example: K-means k=70 K-means k=100 Tight clustering

4.1 Tight clustering Scattered (noisy) genes TightClust software download: http://www.pitt.edu/~ctseng/research/tightClust_download.html

4.2 Penalized and weighted K-means Formulation: K-means K-means criterion: Minimize the within-group sum-squared dispersion to obtain C: K-memoids criterion:

4.2 Penalized and weighted K-means Formulation: K-means Proposition: K-means is a special case of CML under Gaussian model of identical spherical clusters. K-means: CML:

4.2 Penalized and weighted K-means Formulation: PW-Kmeans Goal 1: • Allow a set of scattered genes without being clustered. Goal 2: • Incorporation of prior information in cluster formation.

4.2 Penalized and weighted K-means Formulation: PW-Kmeans Formulation: d(xi, Cj): dispersion of point xi in Cj. |S|: # of objects in noise set S. w(xi; P): weight function to incorporate prior info P. : a tuning parameter • Penalty term : assign outlying objects of a cluster to the noise set S. • Weighting term w: utilize prior knowledge of preferred or prohibited patterns P.

4.2 Penalized and weighted K-means Properties of PW-Kmeans

4.2 Penalized and weighted K-means Properties of PW-Kmeans Relation to classification likelihood P-Kmeans loss function: Classification likelihood: (Gaussian model) V is the space where noise set is uniformly distributed.

4.2 Penalized and weighted K-means Formulation: PW-Kmeans Prior information Six groups of validated cell cycle genes:

4.2 Penalized and weighted K-means Formulation: PW-Kmeans Prior information Six groups of validated cell cycle genes: ICSA 06/15/2006

4.2 Penalized and weighted K-means Formulation: PW-Kmeans A special example of PW-Kmeans for microarray: Prior knowledge of p pathways The weight is designed as a transformation of logistic function:

4.2 Penalized and weighted K-means Design of weight function Formulation: PW-Kmeans

4.2 Penalized and weighted K-means Application : Yeast cell cycle expression Prior information Six groups of validated cell cycle genes: 8 histone genes tightly coregulated in S phase ICSA 06/15/2006

4.2 Penalized and weighted K-means Application : Yeast cell cycle expression Penalized K-means with no prior information C1 58 C2 31 C3 39 101 C4 The 8 histone genes are left in noise set S without being clustered. C5 71 1276 S : :

4.2 Penalized and weighted K-means Application : Yeast cell cycle expression PW-Kmeans: take three randomly selected histone genes as prior information, P. C1 112 C2 158 C3 88 The 8 histone genes are now in cluster 3. C4 139 C5 57 S 1109

5. Cluster evaluation • Evaluation and comparison of clustering methods is always difficult. • In supervised learning (classification), the class labels (underlying truth) are known and performance can be evaluated through cross validation. • In unsupervised learning (clustering), external validation is usually not available. • Ideal data for cluster evaluation: • Data with class/tumor labels (for clustering samples) • Cell cycle data (for clustering genes) • Simulated data

5. Cluster evaluation Rand index: (Rand 1971) Y={(a,b,c), (d,e,f)} Y'={(a,b), (c,d,e), (f)} • Rand index: c(Y, Y') =(2+7)/15=0.6 (percentage of concordance) • 1c(Y, Y')0 • Clustering methods can be evaluated by c(Y, Ytruth) if Ytruth available.

5. Cluster evaluation Adjusted Rand index: (Hubert and Arabie 1985) Adjusted Rand index = The adjusted Rand index will take maximum value at 1 and constant expected value 0 (when two clusterings are totally independent)

6. Comparison

6. Comparison • Simulation: • 20 time-course samples for each gene. • In each cluster, four groups of samples with similar intensity. • Individual sample and gene variation are added. • # of genes in each cluster ~ Poisson(10) • Scattered (noise) genes are added. • The simulated data well assembles real data by visualization. 20 samples a number (0, 5, 10, 20, 60, 100 and 200% of the original total number of clustered genes) of randomly 15 clusters Thalamuthu et al. 2006

Different types of perturbations • Type I: a number (0, 5, 10, 20, 60, 100 and 200% of the original total number of clustered genes) of randomly simulated scattered genes are added. E.g. For sample j in a scattered gene, the expression level is randomly sampled from the empirical distribution of expressions of all clustered genes in sample j. • Type II: For each element of the log-transformed expression matrix, a small random error from normal distribution (SD =0.05, 0.1, 0.2, 0.4, 0.8, 1.2) is added, to evaluate robustness of the clustering against potential random errors. • Type III: combination of type I and II.

6. Comparison Different degree of perturbation in the simulated microarray data

6. Comparison Simulation schemes performed in the paper. In total, 25 (simulation settings) X 100 (data sets) = 2500 are evaluated.

6. Comparison • Adjusted Rand index: a measure of similarity of two clustering; • Compare each clustering result to the underlying true clusters. Obtain the adjusted Rand index (the higher the better). T: tight clustering M: model-based P: K-medoids K: K-means H: hierarchical S: SOM

Consensus Clustering Simpson et al.BMC Bioinformatics 2010 11:590 doi:10.1186/1471-2105-11-590

6. Comparison • Consensus clustering with PAM (blue) • Consensus clustering with hierarchical clustering (red) • HOPACH (black) • Fuzzy c-means (green)

6. Comparison Comparison in real data sets: (see paper for detailed comparison criteria)

References • Tight clustering: • George C. Tseng and Wing H. Wong. (2005) Tight Clustering: A Resampling-based Approach for Identifying Stable and Tight Patterns in Data. Biometrics.61:10-16. • Penalized and weighted K-means: • George C. Tseng. (2007). Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics. 23:2247-2255. • Comparative study: • George C. Anbupalam Thalamuthu*, Indranil Mukhopadhyay*, Xiaojing Zheng* and George C. Tseng. (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 22:2405-2412.

6. Conclusion • Despite many sophisticated methods for detecting regulatory interactions (e.g. Shortest-path and Liquid Association), cluster analysis remains a useful routine in microarray analysis. • We should use these methods for visualization, investigation and hypothesis generation. • We should not use these methods inferentially. • In general, methods with resampling evaluation, allowing scattered genes and related to model-based approach are better. • Hierarchical clustering specifically: we are provided with a picture from which we can make many/any conclusions.

6. Conclusion • Common mistakes or warnings: • Run K-means with large k and get excited to see patterns without further investigation. • K-means can let you see patterns even in randomly generated data and besides human eyes tend to see “patterns”. • Identify genes that are predictive to survival (e.g. apply t-statistics to long and short survivors). Cluster samples based on the selected genes and find the samples are clustered according to survival status. • The gene selection procedure is already biased towards the result you desire.

Agenda

Agenda

Presentation Transcript

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda:

Agenda

Agenda

AGENDA