1 / 54

Agenda

Agenda. Introduction to clustering Dissimilarity measure Preprocessing Clustering method Hierarchical clustering K-means and K-memoids Self-organizing maps (SOM) Model-based clustering Estimate # of clusters Two new methods allowing scattered genes 4.1 Tight clustering

akamu
Download Presentation

Agenda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Agenda • Introduction to clustering • Dissimilarity measure • Preprocessing • Clustering method • Hierarchical clustering • K-means and K-memoids • Self-organizing maps (SOM) • Model-based clustering • Estimate # of clusters • Two new methods allowing scattered genes 4.1 Tight clustering • 4.2 Penalized and weighted K-means • Cluster validation and evaluation • Comparison and discussion

  2. 4.1 Tight clustering A common situation for gene clustering in microarray: k=10 k=30 k=15 K-means Clustering looks informative. A closer look, however, finds lots of noises in each cluster.

  3. 4.1 Tight clustering Main challenges for clustering in microarray Challenge 1: Lots of scattered genes. i.e. genes not belonging to any tight cluster of biological function.

  4. 4.1 Tight clustering Main challenges for clustering in microarray Challenge 2: Microarray is an exploratory tool to guide further biological experiments Hypothesis driven: hypothesis => experimental data. Data driven: high-throughput experiment => data mining => hypothesis => further validation experiment  Important to provide the most informative clusters instead of lots of loose clusters (reduce false positives).

  5. 4.1 Tight clustering • Tight Clustering: • Directly identify informative, tight and stable clusters with reasonable size, say, 20~60 genes. • Need not estimate k !! • Need not assign all genes into clusters. • Traditional: • Estimate the number of clusters, k. (except for hierarchical clustering) • Perform clustering through assigning all genes into clusters.

  6. whole data 4 3 2 y 1 0 0 1 2 3 4 5 x 6 7 8 9 10 Basic idea: 11 1 2 3 4 5

  7. co-membership matrix D[C(X', k), X] cluster centers C(X', k)=(C1,…, Ck) K-means sub-sample X' 4.1 Tight clustering Original Data X

  8. 4.1 Tight clustering • X={xij}nd : data to be clustered. • X'={x'ij}n/2d : random sub-sample • C(X', k)=(C1, C2,…, Ck): the cluster centers obtained from clustering X' into k clusters. • D[C(X', k), X]: an nn matrix denoting co-membership relations of X classified by C(X', k). (Tibshirani 2001) D[C(X', k), X]ij =1 if i and j in the same cluster. =0 o.w.

  9. 4.1 Tight clustering Algorithm 1 (when fixing k): • Fix k. Random sub-sampling X(1), …, X(B). Define the average co-membership matrix to be Note: •  i and j always clustered together in each sub-sampling judgment. •  i and j never clustered together in each sub-sampling judgment.

  10. 4.1 Tight clustering Algorithm 1 (when fixing k): (cont’d) Search for a large set of points such that . Sets with this property are candidates of tight clusters. Order sets with this property by their size to obtain Vk1,Vk2, …

  11. 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 11 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 11 1 2 3 4 5 11 6 7 8 9 10 1 2 3 4 5 11 6 7 8 9 10 6 7 8 9 10 1 2 3 4 5 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11

  12. 4.1 Tight clustering Tight Clustering Algorithm: (relax estimation of k) 0.7 0.01 0.17 0 0.3 0.05 0.01 0.1 0.21 0.95 0 0.01 0 0 1 0.52 0 0 0.03 0.14 0.03 0.11 0.01 0.11 0.23 0 0

  13. 4.1 Tight clustering Tight Clustering Algorithm: Start with a suitable k0. Search for consecutive k’s and choose the top 3 clusters for each k. Stop when Select to be the tightest cluster.

  14. 4.1 Tight clustering Tight Clustering Algorithm: (cont’d) Identify the tightest cluster and remove it from the whole data. Decrease k0 by 1. Repeat 1.~3. to identify the next tight cluster. Remark: and k0 determines the tightness and size of resulting clusters.

  15. 4.1 Tight clustering Example: A simple simulation on 2-D: 14 clusters normally distributed (50 points each) plus 175 scattered points. Stdev=0.1, 0.2, …, 1.4.

  16. 4.1 Tight clustering Example: Tight clustering on simulated data:

  17. 4.1 Tight clustering Example:

  18. 4.1 Tight clustering Example: Gene expression during the life cycle of Drosophila melanogaster. (2002) Science 297:2270-2275 • 4028 genes monitored. Reference sample is pooled from all samples. • 66 sequential time points spanning embryonic (E), larval (L), pupal (P) and adult (A) periods. • Filter genes without significant pattern (1100 genes) and standardize each gene to have mean 0 and stdev 1.

  19. 4.1 Tight clustering Example: Comparison of various K-means and tight clustering: Seven mini-chromosome maintenance (MCM) deficient genes K-means k=30 K-means k=50

  20. 4.1 Tight clustering Example: K-means k=70 K-means k=100 Tight clustering

  21. 4.1 Tight clustering Scattered (noisy) genes TightClust software download: http://www.pitt.edu/~ctseng/research/tightClust_download.html

  22. 4.2 Penalized and weighted K-means Formulation: K-means K-means criterion: Minimize the within-group sum-squared dispersion to obtain C: K-memoids criterion:

  23. 4.2 Penalized and weighted K-means Formulation: K-means Proposition: K-means is a special case of CML under Gaussian model of identical spherical clusters. K-means: CML:

  24. 4.2 Penalized and weighted K-means Formulation: PW-Kmeans Goal 1: • Allow a set of scattered genes without being clustered. Goal 2: • Incorporation of prior information in cluster formation.

  25. 4.2 Penalized and weighted K-means Formulation: PW-Kmeans Formulation: d(xi, Cj): dispersion of point xi in Cj. |S|: # of objects in noise set S. w(xi; P): weight function to incorporate prior info P. : a tuning parameter • Penalty term : assign outlying objects of a cluster to the noise set S. • Weighting term w: utilize prior knowledge of preferred or prohibited patterns P.

  26. 4.2 Penalized and weighted K-means Properties of PW-Kmeans

  27. 4.2 Penalized and weighted K-means Properties of PW-Kmeans Relation to classification likelihood P-Kmeans loss function: Classification likelihood: (Gaussian model) V is the space where noise set is uniformly distributed.

  28. 4.2 Penalized and weighted K-means Formulation: PW-Kmeans Prior information Six groups of validated cell cycle genes:

  29. 4.2 Penalized and weighted K-means Formulation: PW-Kmeans Prior information Six groups of validated cell cycle genes: ICSA 06/15/2006

  30. 4.2 Penalized and weighted K-means Formulation: PW-Kmeans A special example of PW-Kmeans for microarray: Prior knowledge of p pathways The weight is designed as a transformation of logistic function:

  31. 4.2 Penalized and weighted K-means Design of weight function Formulation: PW-Kmeans

  32. 4.2 Penalized and weighted K-means Application : Yeast cell cycle expression Prior information Six groups of validated cell cycle genes: 8 histone genes tightly coregulated in S phase ICSA 06/15/2006

  33. 4.2 Penalized and weighted K-means Application : Yeast cell cycle expression Penalized K-means with no prior information C1 58 C2 31 C3 39 101 C4 The 8 histone genes are left in noise set S without being clustered. C5 71 1276 S : :

  34. 4.2 Penalized and weighted K-means Application : Yeast cell cycle expression PW-Kmeans: take three randomly selected histone genes as prior information, P. C1 112 C2 158 C3 88 The 8 histone genes are now in cluster 3. C4 139 C5 57 S 1109

  35. 5. Cluster evaluation • Evaluation and comparison of clustering methods is always difficult. • In supervised learning (classification), the class labels (underlying truth) are known and performance can be evaluated through cross validation. • In unsupervised learning (clustering), external validation is usually not available. • Ideal data for cluster evaluation: • Data with class/tumor labels (for clustering samples) • Cell cycle data (for clustering genes) • Simulated data

  36. 5. Cluster evaluation Rand index: (Rand 1971) Y={(a,b,c), (d,e,f)} Y'={(a,b), (c,d,e), (f)} • Rand index: c(Y, Y') =(2+7)/15=0.6 (percentage of concordance) • 1c(Y, Y')0 • Clustering methods can be evaluated by c(Y, Ytruth) if Ytruth available.

  37. 5. Cluster evaluation Adjusted Rand index: (Hubert and Arabie 1985) Adjusted Rand index = The adjusted Rand index will take maximum value at 1 and constant expected value 0 (when two clusterings are totally independent)

  38. 6. Comparison

  39. 6. Comparison • Simulation: • 20 time-course samples for each gene. • In each cluster, four groups of samples with similar intensity. • Individual sample and gene variation are added. • # of genes in each cluster ~ Poisson(10) • Scattered (noise) genes are added. • The simulated data well assembles real data by visualization. 20 samples a number (0, 5, 10, 20, 60, 100 and 200% of the original total number of clustered genes) of randomly 15 clusters Thalamuthu et al. 2006

  40. Different types of perturbations • Type I: a number (0, 5, 10, 20, 60, 100 and 200% of the original total number of clustered genes) of randomly simulated scattered genes are added. E.g. For sample j in a scattered gene, the expression level is randomly sampled from the empirical distribution of expressions of all clustered genes in sample j. • Type II: For each element of the log-transformed expression matrix, a small random error from normal distribution (SD =0.05, 0.1, 0.2, 0.4, 0.8, 1.2) is added, to evaluate robustness of the clustering against potential random errors. • Type III: combination of type I and II.

  41. 6. Comparison Different degree of perturbation in the simulated microarray data

  42. 6. Comparison Simulation schemes performed in the paper. In total, 25 (simulation settings) X 100 (data sets) = 2500 are evaluated.

  43. 6. Comparison • Adjusted Rand index: a measure of similarity of two clustering; • Compare each clustering result to the underlying true clusters. Obtain the adjusted Rand index (the higher the better). T: tight clustering M: model-based P: K-medoids K: K-means H: hierarchical S: SOM

  44. Consensus Clustering Simpson et al.BMC Bioinformatics 2010 11:590   doi:10.1186/1471-2105-11-590

  45. 6. Comparison • Consensus clustering with PAM (blue) • Consensus clustering with hierarchical clustering (red) • HOPACH (black) • Fuzzy c-means (green)

  46. 6. Comparison Comparison in real data sets: (see paper for detailed comparison criteria)

  47. References • Tight clustering: • George C. Tseng and Wing H. Wong. (2005) Tight Clustering: A Resampling-based Approach for Identifying Stable and Tight Patterns in Data. Biometrics.61:10-16. • Penalized and weighted K-means: • George C. Tseng. (2007). Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics. 23:2247-2255. • Comparative study: • George C. Anbupalam Thalamuthu*, Indranil Mukhopadhyay*, Xiaojing Zheng* and George C. Tseng. (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 22:2405-2412.

  48. 6. Conclusion • Despite many sophisticated methods for detecting regulatory interactions (e.g. Shortest-path and Liquid Association), cluster analysis remains a useful routine in microarray analysis. • We should use these methods for visualization, investigation and hypothesis generation. • We should not use these methods inferentially. • In general, methods with resampling evaluation, allowing scattered genes and related to model-based approach are better. • Hierarchical clustering specifically: we are provided with a picture from which we can make many/any conclusions.

  49. 6. Conclusion • Common mistakes or warnings: • Run K-means with large k and get excited to see patterns without further investigation. • K-means can let you see patterns even in randomly generated data and besides human eyes tend to see “patterns”. • Identify genes that are predictive to survival (e.g. apply t-statistics to long and short survivors). Cluster samples based on the selected genes and find the samples are clustered according to survival status. • The gene selection procedure is already biased towards the result you desire.

More Related