1 / 28

Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?

Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?. Nguyen Xuan Vinh, UNSW Julien Epps, UNSW James Bailey, Uni Melbourne. Australia's ICT Research Centre of Excellence. Correction for Chance for Information Theoretic based measures - Outline.

justinah
Download Presentation

Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? Nguyen Xuan Vinh, UNSW Julien Epps, UNSW James Bailey, Uni Melbourne Australia's ICT Research Centre of Excellence

  2. Correction for Chance for Information Theoretic based measures - Outline Introduction to clustering and clustering comparison A brief survey of clustering comparison measures How chance agreement affects information theoretic based measures? Adjusted-for-chance measures Conclusion

  3. Introduction Clustering: the “art” of dividing data points in a data set into meaningful groups Notation: • Data set: S={s1,s2, ... sN} • (Hard) clustering: a way to partition the data set into non-overlapping parts • U={U1, U2, ... UR} • Ui's are non-overlapping subsets of S • V={V1, V2, ... VC} • Vj's are non-overlapping subsets of S

  4. Introduction • Clustering comparison measures are used to: • Evaluate the goodness of clustering solutions (assuming the “true” clustering is known) • Evaluate clustering algorithms • (over multiple data sets) • More active use: • To search for a good clustering solution • as in Ensemble clustering • To quantify the discordance within a set of clusterings • => stability assessment • might give useful hint to model selection, such as choosing the “right” number of cluster

  5. Correction for Chance for Information Theoretic based measures - Outline • Introduction to clustering and clustering comparison • A brief review of Clustering comparison measures • How chance agreement affects Information theoretic based measures? • Adjusted-for-chance measures • Conclusion

  6. Clustering comparison measuresA brief review 3 categories: • Pair-counting based • Rand Index (RI), Adjusted Rand Index (ARI) • Jaccard Index, Folkes & Mallows index… • 22 in total (Albatineh et. al. (2006)) • Set-matching based • The “classification error”, the Van Dongen metric • Information theoretic based • Mutual Information (MI), normalized MI • Variation of Information (Meila (2005))

  7. Clustering comparison measuresA brief review – Rand Index • Rand Index (RI): a set of N data point has N(N-1)/2 pairs of points, which can be classified into 4 categories • Proportions of pairs that both clusterings agree

  8. Clustering comparison measuresA brief review – Adjusted Rand Index • Problem with the Rand Index: baseline value (average value between random clusterings) is high and varies • Solution: Adjusted index • Adjusted Rand Index

  9. Why do we care about Information theoretic based measures? • Strong theoretical foundation: information theory • Ability to detect general non-linear correlation • As a matter of fact, they are receiving increasing interest • Ensemble clustering: Strehl and Ghosh (2002), (Fern and Brodley (2003); Singh et al. (2007); He et al. (2008); • Comparison measures: Meila (2003, 2005, 2007), Vinh and Phuong (2008a, 2008b)

  10. Ingredients for Information theoretic based measures • Given a clustering U={U1, U2, ... UR} • Entropy of U: • Where: • The uncertainty in determining the cluster label of a datapoint in S

  11. Ingredients for Information theoretic based measures Mutual information Given the clusterings with points distributions U={U1, U2, ... UR} P(i)=|Ui|/N V={V1, V2, ... VC} P’(j)=|Vj|/N Mutual information is calculated from the Joint distribution of points into clusters in U & V Where:

  12. Information theoretic based measures • Mutual information • A similarity measure, range: [0, log N] • Measures the information shared between two clusterings • Normalized Mutual Information • Range: [0,1] • Variation of Information (Meila (2005)) VI(U,V)= H(U)+H(V)- 2I(U,V) • A dissimilarity measure • Is a true metric on the space of the clusterings • Range: [0, log N]

  13. Correction for Chance for Information Theoretic based measures - Outline • Introduction to clustering and clustering comparison • A brief survey of Clustering comparison measures • How chance agreement affects Information theoretic based measures? • Adjusted-for-chance measures • Conclusion

  14. How chance agreement affects Information theoretic based measures? – Scenario 1 • A known ground-truth clustering with Ktrue clusters • Algorithm 1 generate clustering with K1 clusters • Algorithm 2 generate clustering with K2 clusters • K1> K2, would the comparison be “fair”? True clustering, Ktrue=5 d d’ Algorithm 2, K2=7 Algorithm 1, K1=3

  15. How chance agreement affects Information theoretic based measures? – Experiment 1 • Fix a ground-true clusteringwith Ktrue clusters • 10000 random clusterings are generated for each value of K in the range [0, Kmax=2Ktrue] • Measure the average similarity from each set to the ground-truth using the Normalized Mutual Information True clustering, Ktrue=Kmax/2 Average d … K=2 K=Kmax K=3

  16. How chance agreement affects Information theoretic based measures? – Experiment 1 • Using NMI and RI: Random clusterings with larger number of clusters tend to lie closer to the ground-truth than clusterings with fewer clusters • ARI appears to be unbiased in favour of any particular number of clusters

  17. How chance agreement affects Information theoretic based measures? – Scenario 2 • Select the appropriate number of clusters • In clustering K is unknown • Approach: • For Hierarchical clustering: 30 stopping rule procedures (Milligan and Cooper (1985)) • For model based clustering: Bayesian Information Criterion (BIC) • The Gap statistics • … • The stability assessment approach

  18. How chance agreement affects Information theoretic based measures? – Scenario 2 • Select the appropriate number of clusters via stability assessment • Generate a multiple sets of clusterings, each having the same number of clusters • Measure the concordance within each set by calculating the average pairwise similarity value (Consensus index) • Higher value indicate stability => a hint to select the true number of clusters … #clusters=2 … #clusters=Ktrue #clusters=Kmax

  19. How chance agreement affects Information theoretic based measures? – Experiment 2 • Experiment 2: • Generate 200 random clusterings of N data points for each value of K in the range [0, Kmax] • Measure the average pairwise similarity within each set using the Normalized Mutual Information

  20. How chance agreement affects Information theoretic based measures? – Experiment 2 • Using NMI and ARI: Average pairwise similarity within sets of random clusterings with a larger number of clusters tend to be higher than that within sets of random clusterings with fewer clusters • Using ARI: unbiased toward any particular number of cluster

  21. Correction for Chance for Information Theoretic based measures - Outline • Introduction to clustering and clustering comparison • A brief survey of Clustering comparison measures • How chance agreement affects Information theoretic based measures? • Adjusted-for-chance measures • Conclusion

  22. Adjusting information theoretic measure for chance • Model of Randomness: hypergeometric distribution model (clusterings are created randomly subject to the fixed maginal condition) • Model previously employed for the Adjusted Rand Index.

  23. Adjusting information theoretic measure for chance • Expected Mutual Information between a pair of random clustering is given by:

  24. Adjusting Mutual Information (AMI) • General formula for an adjusted similarity measure: • The Adjusted Mutual Information

  25. Experiment 1 • Variation due to chance is negligible • Note: data are generated not according to the assumed model

  26. Experiment 2 • Variation due to chance is negligible • Note: data are generated not according to the assumed model

  27. Conclusion & Future work • Information theoretic measures for clustering comparision are affected by chance, especially when the number of data point is per cluster is small • Adjusted-for-chance measures have been proposed • Work well in practice, despite the hypergemetric assumption of randomness • Code: http://ee.unsw.edu.au/~nguyenv/Software.htm • What are the differences between the ARI and the AMI? 'Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance', N.X. Vinh, Epps, J. and Bailey, J., to be submitted.

  28. Thank you !

More Related