1 / 16

Important clustering methods used in microarray data analysis

Important clustering methods used in microarray data analysis. Steve Horvath Human Genetics and Biostatistics UCLA. Contents. Multidimensional scaling plots Related to principal component analysis k-means clustering hierarchical clustering. Introduction to clustering.

lynn
Download Presentation

Important clustering methods used in microarray data analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

  2. Contents • Multidimensional scaling plots • Related to principal component analysis • k-means clustering • hierarchical clustering

  3. Introduction to clustering

  4. MDS plot of clusters

  5. MDS plot of clusters

  6. 2 references for clustering • T. Hastie, R. Tibshirani, J. Friedman (2002) The elements of Statistical Learning. Springer Series • L. Kaufman, P. Rousseeuw (1990) Finding groups in data. Wiley Series in Probability

  7. Introduction to clustering Cluster analysis aims to group or segment a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. An object can be described by a set of measurements (e.g. covariates, features, attributes) or by its relation to other objects. Sometimes the goal is to arrange the clusters into a natural hierarchy, which involves successively grouping or merging the clusters themselves so that at each level of the hierarchy clusters within the same group are more similar to each other than those in different groups.

  8. Proximity matrices are the input to most clustering algorithms Proximity between pairs of objects: similarity or dissimilarity. If the original data were collected as similarities, a monotone-decreasing function can be used to convert them to dissimilarities. Most algorithms use (symmetric) dissimilarities (e.g. distances) But the triangle inequality does *not* have to hold. Triangle inequality:

  9. Different intergroup dissimilarities Let G and H represent 2 groups.

  10. Agglomerative clustering, hierarchical clustering and dendrograms

  11. Hierarchical clustering plot

  12. Agglomerative clustering • Agglomerative clustering algorithms begin with every observation representing a singleton cluster. • At each of the N-1 the closest 2 (least dissimilar) clusters are merged into a single cluster. • Therefore a measure of dissimilarity between 2 clusters must be defined.

  13. Comparing different linkage methods  If there is a strong clustering tendency, all 3 methods produce similar results. Single linkage has a tendency to combine observations linked by a series of close intermediate observations ("chaining“). Good for elongated clusters Bad: Complete linkage may lead to clusters where observations assigned to a cluster can be much closer to members of other clusters than they are to some members of their own cluster. Use for very compact clusters (like perls on a string). Group average clustering represents a compromise between the extremes of single and complete linkage. Use for ball shaped clusters

  14. Dendrogram Recursive binary splitting/agglomeration can be represented by a rooted binary tree. The root node represents the entire data set. The N terminal nodes of the trees represent individual observations. Each nonterminal node ("parent") has two daughter nodes. Thus the binary tree can be plotted so that the height of each node is proportional to the value of the intergroup dissimilarity between its 2 daughters. A dendrogram provides a complete description of the hierarchical clustering in graphical format.

  15. Comments on dendrograms Caution: different hierarchical methods as well as small changes in the data can lead to different dendrograms. Hierarchical methods impose hierarchical structure whether or not such structure actually exists in the data. In general dendrograms are a description of the results of the algorithm and not graphical summary of the data. Only valid summary to the extent that the pairwise *observation* dissimilarities obey the ultrametric inequality for all i,i’,k

  16. Figure 1 average complete single

More Related