1 / 48

Cluster Analysis I

Cluster Analysis I. 9/28/2012. Outline. Introduction Distance and similarity measures for individual data points A few widely used methods: hierachical clustering, K-means, model-based clustering. Introduction.

forest
Download Presentation

Cluster Analysis I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster Analysis I 9/28/2012

  2. Outline • Introduction • Distance and similarity measures for individual data points • A few widely used methods: hierachical clustering, K-means, model-based clustering

  3. Introduction • To group or segment a collection of objects into subsets or “clusters”, such that those within each cluster are more closely related to one another than objects assigned to different clusters. • Some times, the goal is to arrange the clusters into a natural hierarchy. • Cluster genes: similar expression pattern implies co-regulation. • Cluster samples: identify potential sub-classes of disease.

  4. Introduction • Assigning subjects into group. • Estimating number of clusters. • Assess the strength/confidence of cluster assignments for individual objects

  5. Proximity Matrix • An NxN matrix D (N=number of objects), each element records the proximity (distance) between object i and i’. • Most often, we have measurement of p dimension on each object. Then we can define

  6. Dissimilarity Measures • Two main classes of distance for continuous variables: • Distance metric (scale-dependent) • 1- Correlation coefficients (scale-invariant)

  7. Minkowski distance • For vectors and of length S, the Minkowski family of distance measures are defined as

  8. Two commonly used special case • Manhattan distance (a.k.a. city-block distance, k=1) • Euclidean distance (k=2)

  9. Mahalanobis distance • Taking the correlation structure into account. • When assuming identity covariance matrix, it is the same as Euclidian distance.

  10. Pearson correlation and inner product • Pearson correlation • After standardization: • Sensitive to outliers.

  11. Spearman correlation • Calculate using the rank of the two vectors (note: sum of the ranks is n(n+1)/2)

  12. Spearman correlation • When there is no tied observations • Robust to outliers since it is based on ranks of the data.

  13. Standardization of the data • Standardize gene rows to mean 0 and stdev 1. • Advantage: makes Euclidean distance and correlation equivalent. Many useful methods require the data to be in Euclidean space.

  14. Clustering methods • Clustering algorithms come in two flavors

  15. Hierarchical clustering • Produce a tree or dendrogram. • They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level. • The tree can by built in two distinct ways • Bottom-up: agglomerative clustering (most used). • Top-down: divisive clustering.

  16. Agglomerative Methods • The most popular hierarchical clustering method. • Start with n clusters. • At each step, merge the two closest clusters using a measure of between-cluster dissimilarity .

  17. Compute group similarities

  18. Choice of linkage

  19. Comparison of the three methods • Single-link • Elongated clusters • Individual decision, sensitive to outliers • Complete-link • Compact clusters • Individual decision, sensitive to outliers • Average-link or centroid • “In between” • Group decision, insensitive to outliers.

  20. Divisive Methods • Begin with the entire data set as a single cluster, and recursively divide one of the existing clusters into two daughter clusters. • Do it till each cluster only have one object or all members overlapped with each other. • Not as popular as agglomeriative methods.

  21. Divisive Algorithms • At each division, other method, e.g. K-means with K=2, could be used. • Smith et al. 1965 proposed a method that does not involve other clustering method • Start with 1 cluster G, assign the object that is the furthest from the others (with the highest average pair-wise distance) to cluster H. • For the remaining iterations, each time assign the one object in G that is the closest to H (maximum difference between the average pair-wise distance to objects in H and G). • Do it till all objects in G is closer to each other than to objects in H.

  22. Hierarchical clustering • The most overused statistical method in gene expression. • Gives us pretty pictures. • Results tend to be unstable, sensitive to small changes.

  23. Partitioning method • Partition the data (size N) into a pre-specified number K of mutually excusive and exhaustive groups: a many-to-one mapping, or encoder k=C(i), that assings the ith observation to the kth cluster. • Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimization of a specific loss function

  24. Partitioning method • A natural loss function would be the within cluster point scatter: • The total point scatter: • is the between cluster point scatter. • Minimizing is equivalent to minimize

  25. Partitioning method • In principle, we simply need to minimize W or maximize B over all possible assignments of N objects to K clusters. • However, the number of distinct assignment, grows rapidly as N and K goes large.

  26. Partitioning method • In practice, we can only examine a small fraction of all possible encoders. • Such feasible strategies are based on iterative greedy descent: • An initial partition is specified. • At each iterative step, the cluster assignments are changed in such a way that the value of the criterion is improved from its previous value.

  27. K-means • Choose the squared Euclidean distance as dissimilarity measure: . • Minimize the within cluster point scatter: • Where .

  28. K-means Algorithm—closely related to the EM algorithm for estimating a certainGaussian mixture model • Choose K centroids at random. • Make initial partition of objects into k clusters by assigning objects to closest centroid. • E step:Calculate the centroid (mean) of each of the k clusters. • M step: Reassign objects to the closest centroids. • Repeat 3 and 4 until no reallocations occur.

  29. K-means example

  30. K-means: local minimum problem Initial values for K-means. “x” falls into local minimum.

  31. K-means: discussion • Advantages: • Fast and easy • Nice relationship with Gaussian mixture model. • Disadvantages: • Run into local minimum (should start with multiple initials). • Need to know the number of clusters (estimation for number of clusters). • Does not allow scattered objects (tight clustering).

  32. Mixture model for clustering

  33. Model based clustering • Fraley and Raftery (1998) applied a Gaussian mixture model. • The parameters can be estimated by EM algorithm. • The cluster membership is decided on the posterior probability of each belong to cluster k.

  34. Review of EM algorithm • It is widely used in solving missing data problem. • Here our missing data is the cluster membership. • Let us review the EM algorithm with a simple example.

  35. E M

  36. The CML approach • Indicators , identifying the mixture component origin for , are treated as unknown parameters. • Two CML criteria have been proposed according to the sampling scheme.

  37. Two CMLs • Random sample within each cluster • Random sample from a population of mixture density

  38. -- Classification likelihood: -- Mixture likelihood: Gaussian assumption:

  39. The Classification EM algorithm

  40. Related to K-means • When f(x) is assumed to be Gaussian and the covariance matrix is the identical and spherical across all clusters, i.e. for all k, . • So maximize C1-CML is equivalent to minimize W.

  41. Model-based methods • Advantages: • Flexibility on cluster covariance structure. • Rigorous statistical inference with full model. • Disadvantages: • Model selection is usually difficult. Data may not fit Gaussian model. • Too many parameters to estimate in complex covariance structure. • Local minimum problem

  42. References • Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning (2nd ed.), New York: Springer. http://www-stat.stanford.edu/~tibs/ElemStatLearn/ • Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011), Cluster Analysis (5th ed.), West Sussex, UK: John Wiley & Sons Ltd. • Celeux G, Govaert G. A Classification EM algorithm for clustering and two stochastic versions. Computational Statistics & Data Analysis 1992; 14:315-332

More Related