K-means Clustering

1. K-means Clustering Ke Chen

2. 2 Outline Introduction K-means Algorithm Example K-means Demo Relevant Issues Conclusion

3. 3 Introduction Partitioning Clustering Approach a typical clustering analysis approach via partitioning data set iteratively construct a partition of a data set to produce several non-empty clusters (usually, the number of clusters given in advance) in principle, partitions achieved via minimising the sum of squared distance in each cluster Given a K, find a partition of K clusters to optimise the chosen partitioning criterion global optimal: exhaustively enumerate all partitions heuristic method: K-means algorithm K-means algorithm (MacQueen�67): each cluster is represented by the centre of the cluster and the algorithm converges to stable centres of clusters.

4. 4 K-mean Algorithm Given the cluster number K, the K-means algorithm is carried out in three steps:

5. 5 Problem

6. 6 Example Step 1: Use initial seed points for partitioning

7. 7 Example Step 2: Compute new centroids of the current partition

8. 8 Example Step 2: Renew membership based on new centroids

9. 9 Example Step 3: Repeat the first two steps until its convergence

10. 10 Example Step 3: Repeat the first two steps until its convergence

11. 11 K-means Demo







18. 18 Relevant Issues Efficient in computation O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n. Local optimum sensitive to initial seed points converge to a local optimum that may be unwanted solution Other problems Need to specify K, the number of clusters, in advance Unable to handle noisy data and outliers (K-Medoids algorithm) Not suitable for discovering clusters with non-convex shapes Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)

19. 19 Relevant Issues Cluster Validity With different initial conditions, the K-means algorithm may result in different partitions for a given data set. Which partition is the �best� one for the given data set? In theory, no answer to this question as there is no ground-truth available in unsupervised learning Nevertheless, there are several cluster validity criteria to assess the quality of clustering analysis from different perspectives A common cluster validity criterion is the ratio of the total between-cluster to the total within-cluster distances Between-cluster distance (BCD): the distance between means of two clusters Within-cluster distance (WCD): sum of all distance between data points and the mean in a specific cluster A large ratio of BCD:WCD suggests good compactness inside clusters and good separability among different clusters!

20. 20 Conclusion K-means algorithm is a simple yet popular method for clustering analysis Its performance is determined by initialisation and appropriate distance measure There are several variants of K-means to overcome its weaknesses K-Medoids: resistance to noise and/or outliers K-Modes: extension to categorical data clustering analysis CLARA: dealing with large data sets Mixture models (EM algorithm): handling uncertainty of clusters

K-means Clustering

K-means Clustering

Presentation Transcript

k -means Clustering

K-means Clustering

K means Clustering ( Weka )

Canopy Clustering and K-Means Clustering

K-MEANS CLUSTERING

K-Means Clustering

K-means clustering

Combinatorial clustering algorithms. Example: K-means clustering

K-means Clustering

Initial K-Means Clustering :

K-means Clustering

Determining the ‘k’ in k-Means Clustering

K-means Clustering

Clustering Beyond K -means

Clustering: K-Means

Rek-means A k-means Based Clustering Algorithm

K-means clustering

Categorical K-means Clustering Algorithm