1 / 20

K-means Clustering

2. Outline. IntroductionK-means AlgorithmExampleK-means DemoRelevant IssuesConclusion. 3. Introduction. Partitioning Clustering Approacha typical clustering analysis approach via partitioning data set iterativelyconstruct a partition of a data set to produce several non-empty clusters (u

clancy
Download Presentation

K-means Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. K-means Clustering Ke Chen

    2. 2 Outline Introduction K-means Algorithm Example K-means Demo Relevant Issues Conclusion

    3. 3 Introduction Partitioning Clustering Approach a typical clustering analysis approach via partitioning data set iteratively construct a partition of a data set to produce several non-empty clusters (usually, the number of clusters given in advance) in principle, partitions achieved via minimising the sum of squared distance in each cluster Given a K, find a partition of K clusters to optimise the chosen partitioning criterion global optimal: exhaustively enumerate all partitions heuristic method: K-means algorithm K-means algorithm (MacQueen’67): each cluster is represented by the centre of the cluster and the algorithm converges to stable centres of clusters.

    4. 4 K-mean Algorithm Given the cluster number K, the K-means algorithm is carried out in three steps:

    5. 5 Problem

    6. 6 Example Step 1: Use initial seed points for partitioning

    7. 7 Example Step 2: Compute new centroids of the current partition

    8. 8 Example Step 2: Renew membership based on new centroids

    9. 9 Example Step 3: Repeat the first two steps until its convergence

    10. 10 Example Step 3: Repeat the first two steps until its convergence

    11. 11 K-means Demo

    12. 12 K-means Demo

    13. 13 K-means Demo

    14. 14 K-means Demo

    15. 15 K-means Demo

    16. 16 K-means Demo

    17. 17 K-means Demo

    18. 18 Relevant Issues Efficient in computation O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n. Local optimum sensitive to initial seed points converge to a local optimum that may be unwanted solution Other problems Need to specify K, the number of clusters, in advance Unable to handle noisy data and outliers (K-Medoids algorithm) Not suitable for discovering clusters with non-convex shapes Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)

    19. 19 Relevant Issues Cluster Validity With different initial conditions, the K-means algorithm may result in different partitions for a given data set. Which partition is the “best” one for the given data set? In theory, no answer to this question as there is no ground-truth available in unsupervised learning Nevertheless, there are several cluster validity criteria to assess the quality of clustering analysis from different perspectives A common cluster validity criterion is the ratio of the total between-cluster to the total within-cluster distances Between-cluster distance (BCD): the distance between means of two clusters Within-cluster distance (WCD): sum of all distance between data points and the mean in a specific cluster A large ratio of BCD:WCD suggests good compactness inside clusters and good separability among different clusters!

    20. 20 Conclusion K-means algorithm is a simple yet popular method for clustering analysis Its performance is determined by initialisation and appropriate distance measure There are several variants of K-means to overcome its weaknesses K-Medoids: resistance to noise and/or outliers K-Modes: extension to categorical data clustering analysis CLARA: dealing with large data sets Mixture models (EM algorithm): handling uncertainty of clusters

More Related