1 / 58

层次聚类

层次聚类. Hierarchical Clustering. Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits. Strengths of Hierarchical Clustering.

georgeshawn
Download Presentation

层次聚类

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 层次聚类 Hierarchical Clustering

  2. Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram • A tree like diagram that records the sequences of merges or splits

  3. Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

  4. Two main types of hierarchical clustering • Agglomerative (凝聚): • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive (分裂的) : • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Traditional hierarchical algorithms use a similarity or distance matrix • Merge or split one cluster at a time

  5. An Agglomerative Clustering Algorithm • Basic algorithm is straightforward • Compute the proximity (接近程度) matrix • Let each data point be a cluster • Repeat • Merge the two closest clusters • Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms

  6. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . Starting Situation • Start with clusters of individual points and a proximity matrix Proximity Matrix

  7. C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation • After some merging steps, we have some clusters C3 C4 C1 Proximity Matrix C5 C2

  8. C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C3 C4 Proximity Matrix C1 C5 C2

  9. After Merging C2 U C5 • The question is “How do we update the proximity matrix?” C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5

  10. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  11. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  12. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX (两集合之间) • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  13. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  14. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity   • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

  15. 1 2 3 4 5 Cluster Similarity: MIN or Single Link • Similarity of two clusters is based on the two most similar (closest) points in the different clusters • Determined by one pair of points, i.e., by one link in the proximity graph.

  16. 5 1 3 5 2 1 2 3 6 4 4 Hierarchical Clustering: MIN Nested Clusters Dendrogram (系统书图)

  17. Dist({3,6},{2,5}) = min {dist(3,2),dist(6,2),dist(3,5),dist(6,5)) = min{0.15,0.25,0.28,0.39} = 0.15

  18. Two Clusters Strength of MIN Original Points • Can handle non-elliptical (椭圆)shapes

  19. Two Clusters Limitations of MIN Original Points • Sensitive to noise and outliers

  20. 1 2 3 4 5 Cluster Similarity: MAX or Complete Linkage • Similarity of two clusters is based on the two least similar (most distant) points in the different clusters • Determined by all pairs of points in the two clusters

  21. 4 1 2 5 5 2 3 6 3 1 4 Hierarchical Clustering: MAX Nested Clusters Dendrogram

  22. Dist({3,6},{4}) = max(dist(3,4),dist(6,4)} = max(0.15,0.22) = 0.22 Dist({3,6},{2,5}) = max{dist(3,2),dist(6,2),dist(3,5),dist(6,5)} = max {0.15,0.25,0.28,0.39} = 0.39

  23. Two Clusters Strength of MAX Original Points • Less susceptible(易受影响) to noise and outliers

  24. Two Clusters Limitations of MAX Original Points • Tends to break large clusters • Biased towards globular clusters

  25. 1 2 3 4 5 Cluster Similarity: Group Average • Proximity of two clusters is the average of pairwise proximity between points in the two clusters. • Need to use average connectivity for scalability since total proximity favors large clusters

  26. 5 4 1 2 5 2 3 6 1 4 3 Hierarchical Clustering: Group Average Nested Clusters Dendrogram

  27. Hierarchical Clustering: Group Average • Compromise between Single and Complete Link • Strengths • Less susceptible to noise and outliers • Limitations • Biased towards globular clusters

  28. Cluster Similarity: Ward’s Method • Similarity of two clusters is based on the increase in squared error when two clusters are merged • Similar to group average if distance between points is distance squared • Less susceptible to noise and outliers • Biased towards globular clusters • Hierarchical analogue of K-means • Can be used to initialize K-means

  29. 5 1 5 5 4 1 3 1 4 1 2 2 5 2 5 5 2 1 5 2 5 2 2 2 3 3 6 6 3 6 3 1 6 3 3 1 4 4 4 1 3 4 4 4 Hierarchical Clustering: Comparison MIN MAX Ward’s Method Group Average

  30. Hierarchical Clustering: Time and Space requirements • O(N2) space since it uses the proximity matrix. • N is the number of points. • O(N3) time in many cases • There are N steps and at each step the size, N2, proximity matrix must be updated and searched • Complexity can be reduced to O(N2 log(N) ) time for some approaches

  31. Hierarchical Clustering: Problems and Limitations • 缺乏全局目标函数 • Different schemes have problems with one or more of the following: • Sensitivity to noise and outliers • Difficulty handling different sized clusters and convex shapes • Breaking large clusters • 一旦合并,不能回退。 • O(n2logn)时间复杂性影响应用。

  32. MST: Divisive Hierarchical Clustering • Build MST (Minimum Spanning Tree) • Start with a tree that consists of any point • In successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is not • Add q to the tree and put an edge between p and q

  33. MST: Divisive Hierarchical Clustering • Use MST for constructing hierarchy of clusters

  34. The advantage • It can be considered the algorithm for both planar or hierarchical clustering • The time complexity of the algorithm is only O(n2 ) ,which is time complexity of building a MST

  35. DBSCAN • DBSCAN is a density-based algorithm. • Density = number of points within a specified radius (半径) (Eps) • A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster • A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point • A noise point is any point that is not a core point or a border point.

  36. DBSCAN: Core, Border, and Noise Points

  37. BASCAN算法 • 输入:eps, minpt; 输出:聚类 • 根据eps,minpt, 把每个顶点标记为核心、边界和噪音 • 删除噪音点 • 为距离在eps 之内的所有核心点建立一条边。 • 计算图的连图分支,每个连图分支为一个簇; • 对所有的边界点,随机地指派一个与之关联的簇。

  38. DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4

  39. Clusters When DBSCAN Works Well Original Points • Resistant to Noise • Can handle clusters of different shapes and sizes

  40. When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Original Points • Varying densities • High-dimensional data (MinPts=4, Eps=9.92)

  41. 聚类评测 • Accuracy(准确率), precision, recall • 评测的目的 • To avoid finding patterns in noise • To compare clustering algorithms • To compare two sets of clusters • To compare two clusters

  42. DBSCAN K-means Complete Link Clusters found in Random Data Random Points

  43. 聚类验证的不同方面 • Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. • Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data • Comparing the results of two different sets of cluster analyses to determine which is better. • Determining the ‘correct’ number of clusters. • For 1, 2 and 3, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

  44. Cluster 验证测度 • SSE 点到聚类中心的距离之和 设 c 是所有顶点的中心,并且距离计算是欧式距离,定义 总 其中, mi表示簇 i的顶点个数。SSB 越大,说明簇之间的分离度越好。

  45. 若共有K 个簇,并且满足 mi = m/K, m 为所有顶点的总数,则可定义 若定义总的平方数TSS 为

  46. 则可以证明: TSS= SSE + SSB 即SSE +SSB 是个常数 非监督族评估:使用邻近度矩阵 如果给定数据集的相似度矩阵和聚类标号。显然。在理想情况下,簇内的任两顶点的相似度为1,而不同簇中的两点相似度为0.若我们将顶点按簇标号排序,则对应的相似度矩阵应为对角矩阵。在具体实现时,可以取相似度的值。我们可以把相似度的值作为图像的色彩值,这样可以通过可视化的方法评价聚类效果。

  47. Ward’s algorithm • start out with all sample units in n clusters of size 1 each. Merge two clusters with In the largest r2 value (equally the minimum error square), repeat the process until there is only one cluster.

  48. Measuring Cluster Validity Via Correlation • Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = -0.9235 明显分离的簇 Corr = -0.5810 不是很明显分离的簇

  49. Using Similarity Matrix for Cluster Validation • Order the similarity matrix with respect to cluster labels and inspect visually. 通过相似度矩阵可视化的评价聚类

  50. Using Similarity Matrix for Cluster Validation • Clusters in random data are not so crisp DBSCAN

More Related