1 / 47

Clustering Prof. Navneet Goyal BITS, Pilani

Clustering Prof. Navneet Goyal BITS, Pilani. Hierarchical Algorithms. Single Link (MIN) MST Single Link Complete Link (MAX) Average Link (Group Average). Single Linkage Clustering. It is an example of agglomerative hierarchical clustering.

Download Presentation

Clustering Prof. Navneet Goyal BITS, Pilani

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ClusteringProf. Navneet GoyalBITS, Pilani

  2. Hierarchical Algorithms • Single Link (MIN) • MST Single Link • Complete Link (MAX) • Average Link (Group Average)

  3. Single Linkage Clustering • It is an example of agglomerative hierarchical clustering. • We consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.

  4. Algorithm Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of single linkage clustering is as follows: 1.Start by assigning each item to its own cluster, so that if we have N items, we now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain. 2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3.Compute distances (similarities) between the new cluster and each of the old clusters. 4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

  5. Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . Starting Situation Proximity Matrix

  6. After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation C3 C4 C1 Proximity Matrix C5 C2

  7. We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation C3 C4 Proximity Matrix C1 C5 C2

  8. The question is “How do we update the proximity matrix?” After Merging C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5

  9. Similarity? p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  10. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  11. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  12. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  13.  p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  14. An Example A hierarchical clustering of distances in kilometers between some Italian cities. The method used is single-linkage.

  15. Input distance matrix (L = 0 for all the clusters): The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called "MI/TO". The level of new cluster is L(MI/TO) = 138

  16. Now, min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RML(NA/RM) = 219

  17. min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RML(BA/NA/RM) = 474

  18. min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RML(BA/FI/NA/RM) = 742

  19. Finally, we merge the last two clusters at level 1037.

  20. Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

  21. Interpreting Dendrograms Clusters Dendrogram

  22. Advantages • Single linkage is best suited to detect lined structure • Invariant against monotonic transformation of the dissimilarities or similarities. For example, the results do not change, if the dissimilarities or similarities are squared, or if we take the log. • Intuitive

  23. Agglomerative Example A B E C D Threshold of 1 2 3 4 5 A B C D E

  24. MST Example A B E C D

  25. Agglomerative Algorithm

  26. Single Link • View all items with links (distances) between them. • Finds maximal connected components in this graph. • Two clusters are merged if there is at least one edge which connects them. • Uses threshold distances at each level. • Could be agglomerative or divisive.

  27. MST Single Link Algorithm

  28. Single Link Clustering

  29. How to Compute Group Similarity? Three Popular Methods: Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair complete-link algorithm: s(g1,g2)= similarity of the farthest pair average-link algorithm: s(g1,g2)= average of similarity of all pairs

  30. complete-link algorithm g2 g1 ? …… Single-link algorithm average-link algorithm Three Methods Illustrated

  31. Hierarchical: Single Link • cluster similarity = similarity of two most similar members - Potentially long and skinny clusters + Fast

  32. Example: single link 5 4 3 2 1

  33. 5 4 3 2 1 Example: single link

  34. Example: single link 5 4 3 2 2 1

  35. Hierarchical: Complete Link • cluster similarity = similarity of two least similar members + tight clusters - slow

  36. Example: complete link 5 4 3 2 2 1

  37. 5 4 3 2 1 Example: complete link

  38. 5 4 3 2 1 Example: complete link

  39. Hierarchical: Average Link • cluster similarity = average similarity of all pairs + tight clusters - slow

  40. 5 4 3 2 1 Example: average link

  41. 5 4 3 2 1 Example: average link

  42. 5 4 3 2 1 Example: average link

  43. Comparison of the Three Methods • Single-link • “Loose” clusters • Individual decision, sensitive to outliers • Complete-link • “Tight” clusters • Individual decision, sensitive to outliers • Average-link • “In between” • Group decision, insensitive to outliers • Which one is the best? Depends on what you need!

  44. Other Approaches to Clustering • Density-based methods • Based on connectivity and density functions • Filter out noise, find clusters of arbitrary shape • Grid-based methods • Quantize the object space into a grid structure

  45. Some Research Directions • Ensemble Clustering • Parallelizing Clustering Algorithms to leverage a Cluster

  46. Ensemble Clustering • Similar to Ensemble Classification • Consensus Clustering • Obtain different clustering solutions and then reconcile them

  47. Parallelizing Clustering Algorithms • Parallelize to leverage a cluster • Nodes are typically multi-core • Two levels of parallelism • Node Level • Core Level • Not Necessarily Orthogonal • Hybrid – Non Trivial • Programming Environment: • MPI • Open MP

More Related