SEEM4630 2011-2012

SEEM4630 2011-2012 Tutorial 4 – Clustering

What is Cluster Analysis? • Finding groups of objects such that the objects in a group will be similar (or related to one another and different from (or unrelated to) the objects in other groups. • A good clustering method will produce high quality clusters • high intra-class similarity: cohesive within clusters • low inter-class similarity: distinctive between clusters

How many clusters? Six Clusters Two Clusters Four Clusters Notion of a Cluster can be Ambiguous

K-Means Clustering fixed Euclidean Distance etc.

K-Means Clustering: Example Given: Means of the cluster ki, mi = (ti1 + ti2 + … + tim)/m Data {2, 4, 10, 12, 3, 20, 30, 11, 25} K = 2 Solution: m1 = 2, m2 = 4, K1 = {2, 3}, and K2 = {4, 10, 12, 20, 30, 11, 25} m1 = 2.5, m2 = 16 K1 = {2, 3, 4}, and K2 = {10, 12, 20, 30, 11, 25} m1 = 3, m2 = 18 K1 = {2, 3, 4, 10}, and K2 = {12, 20, 30, 11, 25} m1 = 4.75, m2 = 19.6 K1 = {2, 3, 4, 10, 11, 12}, and K2 = {20, 30, 25} m1 = 7, m2 = 25 K1 = {2, 3, 4, 10, 11, 12}, and K2 = {20, 30, 25}

K-Means Clustering: Evaluation Evaluation Sum of Squared Error (SSE) Given clusters, choose the one with the smallest error Data point in cluster Ci Centroid of cluster Ci

Limitations of K-means • It is hard to determine a good • K value • The initial K centroids • K-means has problems when the data contains outliers. • Outliers can be handled better by hierarchical clustering and density-based clustering

Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram • A tree like diagram that records the sequences of merges or splits

Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level • Partition direction • Agglomerative: starting with single elements and aggregating them into clusters • Divisive: starting with the complete data set and dividing it into partitions

Agglomerative Hierarchical Clustering • Basic algorithm is straightforward • Compute the proximity matrix • Let each data point be a cluster • Repeat • Merge the two closest clusters • Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to define the distance between clusters

Hierarchical Clustering • Define Inter-Cluster Similarity • Min • Max • Group Average • Distance between Centroids

I1 {I2, I5} {I3, I6} I4 I1 {I2, I5, I3, I6, I4} I1 {I2, I5, I3, I6} I4 I1 I2 {I3, I6} I4 I5 I1 0.00 0.24 0.22 0.37 I1 0.00 0.22 I1 0.00 0.22 0.37 I1 0.00 0.24 0.22 0.37 0.34 {I2, I5} 0.24 0.00 0.15 0.20 {I2, I5, I3, I6, I4} 0.22 0.00 {I2, I5, I3, I6} 0.22 0.00 0.15 I2 0.24 0.00 0.15 0.20 0.14 {I3, I6} 0.22 0.15 0.00 0.15 {I4} 0.37 0.15 0.00 {I3, I6} 0.22 0.15 0.00 0.15 0.28 I4 0.37 0.20 0.15 0.00 I4 0.37 0.20 0.15 0.00 0.29 I5 0.34 0.14 0.28 0.29 0.00 I1 I2 I3 I4 I5 I6 I1 0.00 0.24 0.22 0.37 0.34 0.23 I2 0.24 0.00 0.15 0.20 0.14 0.25 I3 0.22 0.15 0.00 0.15 0.28 0.11 I4 0.37 0.20 0.15 0.00 0.29 0.22 I5 0.34 0.14 0.28 0.29 0.00 0.39 0.23 0.00 I6 0.25 0.11 0.22 0.39 Hierarchical Clustering: Min or Single Link Euclidean distance 0.2 0.15 0.1 0.05 0 3 6 2 5 4 1

SEEM4630 2011-2012

SEEM4630 2011-2012

Presentation Transcript

2011-2012

2011-2012

2011-2012 …

2011/2012

2011-2012

2011-2012

2011-2012

2011-2012

2011 - 2012

SEEM4630 2013-2014

2011-2012

2011-2012

2011 - 2012

2011/2012

2011-2012

2011-2012

2011-2012

2011-2012