1 / 51

Centroid index Cluster level quality measure

Centroid index Cluster level quality measure. Pasi Fränti. 3.9.2018. Clustering accuracy. P. Classification accuracy. Known class labels. Solution A:. Solution B:. Oranges. Oranges. 100 %. Precision = 5/7 = 71% Recall = 5/5 = 100%. Apples. Apples. 100 %. Precision = 3/3 = 100%

Download Presentation

Centroid index Cluster level quality measure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Centroid indexCluster level quality measure Pasi Fränti 3.9.2018

  2. Clustering accuracy

  3. P Classification accuracy Known class labels Solution A: Solution B: Oranges Oranges 100 % Precision = 5/7 = 71% Recall = 5/5 = 100% Apples Apples 100 % Precision = 3/3 = 100% Recall = 3/5 = 60%

  4. Clustering accuracy No class labels! Solution A: Solution B: ???

  5. Internal index Sum-of-squared error (MSE) Solution B: Solution A: 11.3 31.2 30 40 18.7 8.8

  6. External index Compare two solutions Solution B: Solution A: 5 5/7 = 71% 2/5 = 40% 2 3/5 = 60% 3 • Two clustering (A and B) • Clustering against ground truth

  7. Set-matching based methods M. Rezaei and P. Fränti, "Set-matching measures for external cluster validity", IEEE Trans. on Knowledge and Data Engineering, 28 (8), 2173-2186, August 2016.

  8. What about this…? Solution 2: Solution 1: Solution 3: ? ?

  9. External index Selection of existed methods • Pair-counting measures • Rand index (RI) [Rand, 1971] • Adjusted Rand index (ARI) [Hubert & Arabie, 1985] • Information-theoretic measures • Mutual information (MI) [Vinh, Epps, Bailey, 2010] • Normalized Mutual information (NMI) [Kvalseth, 1987] • Set-matching based measures • Normalized van Dongen (NVD) [Kvalseth, 1987] • Criterion H (CH) [Meila & Heckerman, 2001] • Purity [Rendon et al, 2011] • Centroid index (CI) [Fränti, Rezaei & Zhao, 2014]

  10. Cluster level measure

  11. Point level vs. cluster level Cluster-level mismatches Point-level differences

  12. Point level vs. cluster level Agglomerative (AC) ARI=0.91 ARI=0.82 Random Swap (RS) K-means ARI=0.88

  13. Point level vs. cluster level

  14. Centroid index

  15. Pigeon hole principle • 15 pigeons (clustering A) • 15 pigeon holes (clustering B) • Only one bijective (1:1) • mapping exists

  16. Centroid index (CI) [Fränti, Rezaei, Zhao, Pattern Recognition 2014] empty CI = 4 15 prototypes(pigeons) 15 real clusters (pigeon holes) empty empty empty

  17. Definitions • Find nearest centroids (AB): • Detect prototypes with no mapping: • Number of orphans: • Centroid index: Number of zero mappings! Mapping both ways

  18. Example S2 1 1 2 IndegreeCounts 1 2 1 Mappings 1 1 0 1 1 1 1 1 0 Value 0 indicates an orphan cluster CI = 2

  19. Example Birch1 Mappings CI = 7

  20. Example Birch2 Mappings 1 0 1 1 Two clustersbut only one allocated 3 1 Three mappedinto one CI = 18

  21. Example Birch2 Mappings 2 1 Two clustersbut only one allocated 0 1 1 0 Three mappedinto one 1 CI = 18

  22. Unbalanced exampleK-means result KMGT CI = 4 500 0 2000 0 1 1 1011 492 4 2 458 560 0 989 490 0 K-means tend to put too many clusters here … … and too few here

  23. Unbalanced exampleK-means result GTKM CI = 4 5 1 1 0 1 0 0 0

  24. Experiments

  25. Mean Squared Errors Green = Correct clustering structure Raw numbers don’t tell much

  26. Adjusted Rand Index [Hubert & Arabie, 1985] How highis good?

  27. Normalized Mutual information [Kvalseth, 1987]

  28. Normalized Van Dongen [Kvalseth, 1987] Lower is better

  29. Centroid Similarity Index [Fränti, Rezaei, Zhao, 2014] Ok but lacks threshold

  30. Centroid Index [Fränti, Rezaei, Zhao, 2014]

  31. Going deeper…

  32. Bridge Accurate clustering GAIS-2002 similar to GAIS-2012 ? -0.04 -0.29 -0.29

  33. GAIS’02 and GAIS’12 the same? Virtually the same MSE-values GAIS 2002 GAIS 2012 160.8 160.72 -0.04 160.68 160.6 -0.29 -0.29 160.43 -0.04 160.39 160.4 GAIS 2002 +RS(8M) GAIS 2002 +RS(8M) 160.2

  34. GAIS’02 and GAIS’12 the same? But different structure! GAIS 2002 GAIS 2012 160.8 160.72 CI=17 160.68 160.6 CI=0 CI=1 160.43 CI=18 160.39 160.4 GAIS 2002 +RS(8M) GAIS 2012 +RS(8M) 160.2

  35. Seemingly the same solutions Same structure “same family”

  36. But why? • Real cluster structure missing • Clusters allocated like well optimized “grid” • Several grids results different allocation • Overall clustering quality can still be the same

  37. RS runs

  38. RS runs 1M vs 8M CI=26 MSE=0.07% CI=26 MSE=0.02% B1 C1 A1 160.8 1M: 161.37 161.48 161.44 CI=1 MSE=0.15% CI=3 MSE=0.28% CI=1 MSE=0.09% 160.6 8M: 160.99 161.22 161.24 160.4 B8 C8 A8 CI=24 MSE=0.01% CI=26 MSE=0.16% 160.2

  39. Generalization

  40. Three alternatives 3. Partition similarity 1. Prototype similarity Prototype must exist 2. Model similarity Derived from model

  41. Partition similarity • Cluster similarity using Jaccard • Calculated from contigency table

  42. x Split-and-Merge EM x Random Swap EM Gaussian mixture model 1 1 RI 0.98 ARI 0.84 MI 3.60 NMI 0.94 NVD 0.08 CH 0.16 1 1 1 1 3 1 CI=2 1 1 1 1 0 1

  43. Gaussian mixture model CI=2 CI=1 CI=3 CI=0 CI=1 CI=0

  44. Arbitrary-shape data

  45. KMGT 1 1 2 CI = 2 2 0 1 0

  46. GTKM 1 1 1 0 CI = 1 1 1 2

  47. SLGT 1 0 1 3 0 1 CI = 2

  48. GTSL 0 2 2 0 0 2 CI = 3

  49. Summary of experimentsprototype similarity

  50. Summary of experimentspartition similarity

More Related