1 / 5

K-means

K-means. Arbitrarily choose k objects as the initial cluster centers Until no change, do (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster

pricew
Download Presentation

K-means

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. K-means • Arbitrarily choose k objects as the initial cluster centers • Until no change, do • (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster • Update the cluster means, i.e., calculate the mean value of the objects for each cluster

  2. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K-Means: Example 10 9 8 7 6 5 Update the cluster means Assign each objects to most similar center 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means

  3. Pros and Cons of K-means • Relatively efficient: O(tkn) • n: # objects, k: # clusters, t: # iterations; k, t << n. • Often terminate at a local optimum • Applicable only when mean is defined • What about categorical data? • Need to specify the number of clusters • Unable to handle noisy data and outliers • unsuitable to discover non-convex clusters

  4. Variations of the K-means • Aspects of variations • Selection of the initial k means • Dissimilarity calculations • Strategies to calculate cluster means • Handling categorical data: k-modes • Use mode instead of mean • Mode: the most frequent item(s) • A mixture of categorical and numerical data: k-prototype method

  5. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 A Problem of K-means + + • Sensitive to outliers • Outlier: objects with extremely large values • May substantially distort the distribution of the data • K-medoids: the most centrally located object in a cluster

More Related