1 / 12

SEEM4630 2011-2012

SEEM4630 2011-2012. Tutorial 4 – Clustering. What is Cluster Analysis?. Finding groups of objects such that the objects in a group will be similar (or related to one another and different from (or unrelated to) the objects in other groups.

rtyner
Download Presentation

SEEM4630 2011-2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SEEM4630 2011-2012 Tutorial 4 – Clustering

  2. What is Cluster Analysis? • Finding groups of objects such that the objects in a group will be similar (or related to one another and different from (or unrelated to) the objects in other groups. • A good clustering method will produce high quality clusters • high intra-class similarity: cohesive within clusters • low inter-class similarity: distinctive between clusters

  3. How many clusters? Six Clusters Two Clusters Four Clusters Notion of a Cluster can be Ambiguous

  4. K-Means Clustering fixed Euclidean Distance etc.

  5. K-Means Clustering: Example Given: Means of the cluster ki, mi = (ti1 + ti2 + … + tim)/m Data {2, 4, 10, 12, 3, 20, 30, 11, 25} K = 2 Solution: m1 = 2, m2 = 4, K1 = {2, 3}, and K2 = {4, 10, 12, 20, 30, 11, 25} m1 = 2.5, m2 = 16 K1 = {2, 3, 4}, and K2 = {10, 12, 20, 30, 11, 25} m1 = 3, m2 = 18 K1 = {2, 3, 4, 10}, and K2 = {12, 20, 30, 11, 25} m1 = 4.75, m2 = 19.6 K1 = {2, 3, 4, 10, 11, 12}, and K2 = {20, 30, 25} m1 = 7, m2 = 25 K1 = {2, 3, 4, 10, 11, 12}, and K2 = {20, 30, 25}

  6. K-Means Clustering: Evaluation Evaluation Sum of Squared Error (SSE) Given clusters, choose the one with the smallest error Data point in cluster Ci Centroid of cluster Ci

  7. Limitations of K-means • It is hard to determine a good • K value • The initial K centroids • K-means has problems when the data contains outliers. • Outliers can be handled better by hierarchical clustering and density-based clustering

  8. Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram • A tree like diagram that records the sequences of merges or splits

  9. Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level • Partition direction • Agglomerative: starting with single elements and aggregating them into clusters • Divisive: starting with the complete data set and dividing it into partitions

  10. Agglomerative Hierarchical Clustering • Basic algorithm is straightforward • Compute the proximity matrix • Let each data point be a cluster • Repeat • Merge the two closest clusters • Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to define the distance between clusters

  11. Hierarchical Clustering • Define Inter-Cluster Similarity • Min • Max • Group Average • Distance between Centroids

  12. I1 {I2, I5} {I3, I6} I4 I1 {I2, I5, I3, I6, I4} I1 {I2, I5, I3, I6} I4 I1 I2 {I3, I6} I4 I5 I1 0.00 0.24 0.22 0.37 I1 0.00 0.22 I1 0.00 0.22 0.37 I1 0.00 0.24 0.22 0.37 0.34 {I2, I5} 0.24 0.00 0.15 0.20 {I2, I5, I3, I6, I4} 0.22 0.00 {I2, I5, I3, I6} 0.22 0.00 0.15 I2 0.24 0.00 0.15 0.20 0.14 {I3, I6} 0.22 0.15 0.00 0.15 {I4} 0.37 0.15 0.00 {I3, I6} 0.22 0.15 0.00 0.15 0.28 I4 0.37 0.20 0.15 0.00 I4 0.37 0.20 0.15 0.00 0.29 I5 0.34 0.14 0.28 0.29 0.00 I1 I2 I3 I4 I5 I6 I1 0.00 0.24 0.22 0.37 0.34 0.23 I2 0.24 0.00 0.15 0.20 0.14 0.25 I3 0.22 0.15 0.00 0.15 0.28 0.11 I4 0.37 0.20 0.15 0.00 0.29 0.22 I5 0.34 0.14 0.28 0.29 0.00 0.39 0.23 0.00 I6 0.25 0.11 0.22 0.39 Hierarchical Clustering: Min or Single Link Euclidean distance 0.2 0.15 0.1 0.05 0 3 6 2 5 4 1

More Related