1 / 37

Clustering

Clustering. Overview. What is clustering? Similarity/distance metrics Hierarchical clustering algorithms K-means. What is clustering?. Group similar objects together Objects in the same cluster (group) are more similar to each other than objects in different clusters

glyn
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering

  2. Overview • What is clustering? • Similarity/distance metrics • Hierarchical clustering algorithms • K-means

  3. What is clustering? • Group similar objects together • Objects in the same cluster (group) are more similar to each other than objects in different clusters • Data exploratory tool

  4. How to define similarity? Experiments X genes n 1 p 1 X • Similarity metric: • A measure of pairwise similarity or dissimilarity • Examples: • Correlation coefficient • Euclidean distance genes genes Y Y n n Raw matrix Similarity matrix

  5. Similarity metrics • Euclidean distance • Correlation coefficient

  6. Example Correlation (X,Y) = 1 Distance (X,Y) = 4 Correlation (X,Z) = -1 Distance (X,Z) = 2.83 Correlation (X,W) = 1 Distance (X,W) = 1.41

  7. Clustering algorithms • Inputs: • Raw data matrix or similarity matrix • Number of clusters classifications of clustering algorithms: • Hierarchical • partitional

  8. Steps to Cluster Analysis • The two key steps within cluster analysis are the measurement of distances between objects and to group the objects based upon the resultant distances (linkages). • The distances provide for a measure of similarity between objects and may be measured in a variety of ways, such as Euclidean metric distance. • Linkages are based upon how the association between groups is measured.

  9. Hierarchical Clustering • Agglomerative(bottom-up) • Algorithm: • Initialize: each item a cluster • Iterate: • select two most similarclusters • merge them • Halt: when required number of clusters is reached dendrogram

  10. Clustering Procedure –Linkage Method • The single linkage method is based on minimum distance, or the nearest neighbor rule. At every stage, the distance between two clusters is the distance between their two closest points . • The complete linkage method is similar to single linkage, except that it is based on the maximum distance or the furthest neighbor approach. In complete linkage, the distance between two clusters is calculated as the distance between their two furthest points. • The average linkage method works similarly. However, in this method, the distance between two clusters is defined as the average of the distances between all pairs of objects, where one member of the pair is from each of the clusters

  11. A B C D Hierarchical Clustering Initial Data Items Distance Matrix

  12. A B C D Hierarchical Clustering Initial Data Items Distance Matrix

  13. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix 2

  14. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix

  15. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix

  16. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix 3

  17. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix

  18. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix

  19. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix 10

  20. A B C D Hierarchical Clustering Single Linkage Final Result Distance Matrix

  21. Hierarchical: Single Link • cluster similarity = similarity of two most similar members - Potentially long and skinny clusters + Fast

  22. Example: single link 5 4 3 2 1

  23. Example: single link 5 4 3 2 1

  24. Example: single link 5 4 3 2 1

  25. Hierarchical: Complete Link • cluster similarity = similarity of two least similar members + tight clusters - slow

  26. Example: complete link 5 4 3 2 1

  27. Example: complete link 5 4 3 2 1

  28. Example: complete link 5 4 3 2 1

  29. Hierarchical: Average Link • cluster similarity = average similarity of all pairs + tight clusters - slow

  30. Example: average link 5 4 3 2 1

  31. Example: average link 5 4 3 2 1

  32. Example: average link 5 4 3 2 1

  33. Hierarchical divisive clustering algorithms • Top down • Start with all the objects in one cluster • Successively split into smaller clusters • Tend to be less efficient than agglomerative • Resolver implemented a deterministic annealing approach from

  34. Partitioning –Non Hierarchical • Assign each item to the cluster whose centroid(mean) is closest. • Recalculate the centroid for the cluster receiving the item and the cluster loosing the item • Repeat the last step until no reassignments

  35. Details of k-means • Iterate until converge: -Group the cases into a collection of K Clusters. • Assign each data point to the closest centroid • Compute new centroid

  36. Properties of k-means • Fast • Proved to converge to local optimum • In practice, converge quickly • Tend to produce spherical, equal-sized clusters

  37. Summary • Definition of clustering • Pairwise similarity: • Correlation • Euclidean distance • Clustering algorithms: • Hierarchical (single-link, complete-link, average-link) • K-means • Different clustering algorithms  different clusters

More Related