html5-img
1 / 27

Clustering II

Clustering II. Relevant keywords: K-nearest neighbor, Single-link, Complete-link, Average-link, and Centroid-based Clustering. Outline. Motivations Hierarchical Overview Dendrogram Single-link vs. centroid-based approach Other clustering approaches. Motivation. Help identify structure

debra
Download Presentation

Clustering II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering II Relevant keywords: K-nearest neighbor, Single-link, Complete-link, Average-link, and Centroid-based Clustering

  2. Outline • Motivations • Hierarchical • Overview • Dendrogram • Single-link vs. centroid-based approach • Other clustering approaches

  3. Motivation • Help identify structure • To support browsing • To help refine queries • Recall vs. precision • To improve retrieval efficiency

  4. Hierarchical Agglomerative 0 1 2 3 4 a a, b b a, b, c, d, e c c, d, e e d, e d 4 3 2 1 0 Divisive

  5. Agglomerative • More widely applied • Start with n distinct entities (single member clusters) and end with a cluster of all n entitites • At each stage fuse entities that are closest (most similar) • Variant agglomerative methods exist depending on how distance (similarity) is defined between an entity and a group or between two groups

  6. Single link agglomerative approach • Requires starting with a similarity matrix • Distance between two clusters is defined in terms distance between the closes two pairs of items in the two clusters, i.e., nearest neighbor

  7. Example - SLA = D1 Let’s assume we begin with a distance matrix D1 of five entities. The goal is to derive a dendrogram as depicted in the next slide.

  8. 5.0 P5 [12 3 4 5] 4.0 P4 [1 2] [3 4 5] 3.0 P3 [1 2] [3] [4 5] 2.0 P2 [1 2] [3] [4] [5] 1.0 P1 [1] [2] [3] [4] [5] 0.0 1 2 3 4 5

  9. Dendrogram • The height represents the distance between pair of items • Branching represents density of merges conducted to achieve clusters

  10. Agglomerative SL Algorithm • Given the distance matrix find the smallest non-zero entry and merge the corresponding two clusters • Recalculate distances between clusters based on the closest two neighbors of all clusters (i.e, nearest neighbor approach) • Test to see if the desired number of clusters is achieved; if not loop back to the top step

  11. Stepping through the SLA – Merge 1 • The smallest non-zero entry in the initial matrix is for items 1 and 2; these are fused and the distance is recalculated based on SL • After the first merge: • d(12)3 = min[d13, d23] = d23 = 5.0 • d(12)4 = min[d14, d24] = d24 = 9.0 • d(12)5 = min[d15, d25] = d25 = 8.0

  12. SLA – Post merge 1 • The new matrix after merge 1 is above. The smallest entry is for entities 4 and 5. D2 =

  13. SLA – Merge 2 • The entities 4 and 5 are merged and we recalculate the distances: • d(12)3 = 5.0 as before • d(12)(45) = min[d14, d15, d24, d25] = d25 = 8.0 • d(45)3 = min[d34, d35] = d34 = 4.0

  14. SLA – Post merge 2 • The smallest entry in the matrix above is d(45)3

  15. SL – Merges 3 and 4 • The item 3 is merged with (45) and we achieve two clusters, namely (345) and (12) • The above is at the partition level 4 (P4) in the dendrogram • These two clusters, (345) and (12) are then merged into one to form the top level cluster

  16. Centroid Clustering • Another type of clustering takes into account all members of a cluster and requires access to the original raw data • The centroid approach may produce clusters with different “topologies” compared to the single link method

  17. Euclidean Distance Centroid Clustering • Recall Euclidean distance is “as the crow flies” distance – i.e., geometric measure • Most such distance measures are special cases of the so called Minkowski metric Euclidean distance dij =

  18. CC - Example • Let’s assume we start with the “raw” matrix below:

  19. Euclidean Distance Matrix • Inter-object distance based on Euclidean distance is as below: C1 =

  20. CC - Merge 1 • Examination of C1 shows that the c12 is the smallest entry and objects 1 and 2 are merged into one cluster • The mean vector centroid of the group (12) is calculated (1, 1.5) and new Euclidean distance matrix is produced

  21. CC – Post Merge 1 • The new resulting matrix is as below: C2 =

  22. CC – Merge 2 • In the new matrix the smallest entry is (45), hence these two entities are merged to form a second cluster • The mean vector centroid of the new cluster containing 4 and 5 is (8.0, 1.0) • A new distance matrix is now calculated

  23. CC – Post Merge 2 • After calculating a new distance matrix the following is achieved: C3 =

  24. CC – Merges 3 and 4 • In the last distance matrix the smallest entry is C(45)3 and so entities 4,5, and 3 are merged into one cluster – Merge 3 • Now there are only two clusters (12) and (453) and in the next iteration these two are merged into one – Merge 4

  25. Additional Clustering Methods • We can think of two classes of techniques: • Those that rely only on the proximity or distance matrix • Single link, complete link, and average link • Those that require access to the “raw” data matrix • Centroid clustering

  26. Illustrations of other methods Cluster B Cluster A Cluster B 4 5 3 1 Single link Complete link 2 Average link Cluster A DAB = (d12+d14+d15+d23+d24+d25)/6

  27. Additional Methods - Explanation • Complete link – distance between clusters defined in terms of the distance between furthest members of the two clusters • In the average link approach the distance between clusters is calculated by averaging the distances of all members to each other

More Related