1 / 72

How much k-means can be improved by using better initialization and repeats?

How much k-means can be improved by using better initialization and repeats?. Pasi Fränti and Sami Sieranoja. 11.4.2019. P. Fränti and S. Sieranoja, ”How much k-means can be improved by using better initialization and repeats?", Pattern Recognition , 2019. Introduction. Goal of k-means.

keysr
Download Presentation

How much k-means can be improved by using better initialization and repeats?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How much k-means can be improved by using better initialization and repeats? Pasi Fränti and Sami Sieranoja 11.4.2019 P. Fränti and S. Sieranoja, ”How much k-means can be improved by using better initialization and repeats?", Pattern Recognition, 2019.

  2. Introduction

  3. Goal of k-means Input N points: X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: SSE = sum-of-squared errors

  4. Goal of k-means Input N points: X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: • Assumptions: • SSE is suitable • k is known

  5. Using SSE objective function

  6. K-means algorithm http://cs.uef.fi/sipu/clustering/animator/

  7. K-means optimization steps Assignment step: Centroid step: Before After

  8. Examples

  9. Iterate by k-means 1st iteration

  10. Iterate by k-means 2nd iteration

  11. Iterate by k-means 3rd iteration

  12. Iterate by k-means 16th iteration

  13. Iterate by k-means 17th iteration

  14. Iterate by k-means 18th iteration

  15. Iterate by k-means 19th iteration

  16. Final result 25 iterations

  17. Problems of k-meansDistance of clusters Cannot move centroids between clusters far away

  18. Data and methodology

  19. Clustering basic benchmark Fränti and SieranojaK-means properties on six clustering benchmark datasetsApplied Intelligence, 2018. Unbalance 100 100 S1 S2 S3 S4 2000 100 2000 2000 100 100 Birch2 Birch1 DIM32 A1 A2 A3

  20. Centroid indexRequires ground truth CI = Centroid index: P. Fränti, M. Rezaei and Q. Zhao "Centroid index: cluster level similarity measure” Pattern Recognition, 47 (9), 3034-3045, September 2014.

  21. Centroid index example CI=4 Missing centroids Too many centroids

  22. Success rateHow often CI=0? 17% CI=1 CI=2 CI=1 CI=0 CI=2 CI=2

  23. Properties of k-means

  24. Cluster overlapDefinition d1 = distance to nearest centroid d2 = distance to nearest in other cluster

  25. overlap increases Dependency on overlapS datasets Success rates and CI-values: S S S S 1 2 3 4 3% 11% 12% 26% CI=1.3 CI=0.9 CI=1.8 CI=1.4

  26. Why overlap helps? Overlap = 22% 90 iterations Overlap = 7% 13 iterations

  27. Linear dependency on clusters (k) Birch2 dataset 16%

  28. Dimensions increases Dependency on dimensionsDIM datasets 32 64 128 512 1024 256 CI: 3.6 3.5 3.8 3.8 3.9 3.7 0% Success rate:

  29. Dimensions increases Overlap increases Lack of overlap is the cause!G2 datasets Overlap Correlation: 0.91 Success

  30. Effect of unbalanceUnbalabce datasets Success: Average CI: Problem originates from the random initialization. 3.9 0%

  31. Summary of k-means properties 1. Overlap 3. Dimensionality Good! No direct effect 4. Unbalance of cluster sizes 2. Number of clusters Linear dependency Bad!

  32. How to improve?

  33. Repeated k-means (RKM) Must include randomness Initialize Repeat 100 times K-means

  34. How to initialize? • Some obvious heuristics: • Furthest point • Sorting • Density • Projection • Clear state-of-the-art is missing: • No single technique outperforms others in all cases. • Initialization not significantly easier than the clustering itself. • K-means can be used as fine-tuner with almost anything. • Another desperate effort: • Repeat it LOTS OF times

  35. Initialization techniquesCriteria Simple to implement Lower (or equal) time complexity than k-means No additional parameters Include randomness

  36. Requirements for initialization • Simple to implement • Random centroids has 2 functions + 26 lines of code • Repeated k-means 5 functions + 162 lines of code • Simpler than k-means itself • 2. Faster than k-means • K-means real-time (<1 s) up to N10,000 • O(IkN)  25N1.5 --- assuming I25 and k=N • Must be faster than O(N2)

  37. Simplicity of algorithms • Parameters • Functions • Lines of code Kinnunen, Sidoroff, Tuononen and Fränti, "Comparison of clustering methods: a case study of text-independent speaker modeling" Pattern Recognition Letters, 2011.

  38. Alternative: A better algorithm Random Swap (RS) Genetic Algorithm (GA) CI = 0 P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 2018 CI = 0 P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 2000.

  39. Initialization techniques

  40. Techniques considered

  41. Random centroids

  42. Random partitions

  43. Random partitions

  44. Furthest points (maxmin)

  45. Furthest points (maxmin)

  46. Projection-based initialization • Most common projection axis: • Diagonal • Principal axis (PCA) • Principle curves • Initialization: • Uniform partition along axis • Used in divisive clustering: • Iterative split PCA = O(DN)-O(D2N)

  47. Furthest point projection

  48. Projection example (1) Good projection! Birch2

  49. Projection example (2) Bad projection! Birch1

  50. Projection example (3) Bad projection! Unbalance

More Related