1 / 16

Determining the ‘k’ in k-Means Clustering

Determining the ‘k’ in k-Means Clustering. Jacob Halvorson. Overview. K-Means overview Dr. Perrizo’s Total Variation Theory Killer Idea #1 Results Killer Idea #2 Results Conclusion. K-Means Overview.

Download Presentation

Determining the ‘k’ in k-Means Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Determining the ‘k’ in k-Means Clustering Jacob Halvorson

  2. Overview • K-Means overview • Dr. Perrizo’s Total Variation Theory • Killer Idea #1 • Results • Killer Idea #2 • Results • Conclusion

  3. K-Means Overview • Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. • Assign each object to the group that has the closest centroid. • When all objects have been assigned, recalculate the positions of the K centroids • Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

  4. Dr. Perrizo’s Total Variation Theory • Start at a point in the dataset. • Expand around that point until the density drops off. • Add that cluster center to a list of possible clusters and remove all points in the radius from the original list. • Repeat until no more points left and choosing a new cluster center that is far from the previous one. • Total Variation, Radius, and Density are factors.

  5. Killer Idea #1 • Pick any random point in the dataset as the cluster center • Expand the radius some value • Minimum distance between all points • Determine density • If (new density)/(old density) > high density threshold • We have run into another cluster. Throw out data • If (new density)/(old density) < low density threshold • We have a cluster or an outlier. Add cluster to list • Remove cluster from original list • Else • expand again • Repeat

  6. Simple 2D data

  7. Upper Threshold = 2.65

  8. Upper Threshold = 3.0

  9. Upper Threshold = 3

  10. Killer Idea #2 • Similar to Killer Idea #1, except we want to run into another cluster. That is our stopping condition. • If [(current ring density) > (previous ring density) && (new density) > (old density)] • Add cluster to list. • Remove the cluster from original list • Repeat • Outlier trouble?

  11. New Algorithm

  12. New Algorithm

  13. Simple 2D data

  14. New Algorithm – Iris Data

  15. Conclusion • Both Killer Ideas are very sensitive. • The results can be somewhat different due to the random nature of the program. • Killer Idea #2 found extra potential clusters that I hadn’t even thought of. • What about outliers? • More work needs to be done.

  16. References • “K-Means Clustering” http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/kmeans.html. 28 Nov. 2004 • IRIS data ftp://ftp.ics.uci.edu/pub/machine-learning-databases/iris/. 21 Nov. 2004 • Dr. Perrizo’s lecture notes.

More Related