1 / 32

Adapting the Right Measures for K-means Clustering

Adapting the Right Measures for K-means Clustering. Junjie Wu ( wujj@buaa.edu.cn) Beihang University. Joint Work with Hui Xiong (Rutgers Univ.) & Jian Chen (Tsinghua Univ.). Outline. Introduction Defective Validation Measures Measure Normalization Measure Properties Concluding Remarks.

roths
Download Presentation

Adapting the Right Measures for K-means Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adapting the Right Measures for K-means Clustering Junjie Wu (wujj@buaa.edu.cn) Beihang University Joint Work with Hui Xiong (Rutgers Univ.) & Jian Chen (Tsinghua Univ.)

  2. Outline • Introduction • Defective Validation Measures • Measure Normalization • Measure Properties • Concluding Remarks

  3. Clustering and Cluster Validation Data Preprocessing Clustering ClusterValidation Input Data Output Clusters • Cluster analysis provides insight into the data by dividing the objects into groups (clusters) of objects, such that objects in a cluster are more similar to each other than to objects in other clusters. • Cluster validationrefers to procedures that evaluate the results of clustering in a quantitative and objective fashion.[Jain & Dubes, 1988] • How to be “quantitative”: To employ the measures. • How to be “objective”: To validate the measures!

  4. Cluster Validation Measures • A Typical View of Cluster Validation Measures: • External measures • Match a cluster structure to a prior information, e.g., class labels. • E.g.,Rand index, Γ statistics, F-measure, Mutual Information • Internal measures • Assess the fit between the structure and the data themselves only. • E.g., Silhouette index, CPCC, Γ statistics • Relative measures • Decide which of two structures is better, often used for selecting the right clustering parameters, e.g., the cluster number. • E.g., Dunn’s indices, Davies-Bouldin index, partition coefficient • Other Views: • Partitional Indices vs. Hierarchical Indices • Fuzzy Indices vs. Non-Fuzzy Indices • Statistics-based Indices vs. Information-based Indices

  5. Research Motivations • There is littlework on evaluating the effectiveness of cluster validation measures in a systematic way. Many questions remain! • What are the measures widely used? • Are these measures objective? • Why and how these measures should be normalized? • What are the properties and interrelationships of these measures? • How to adapt the right measures for a specific clustering algorithm? • The answers to the above questions are essential to the success of cluster analysis!

  6. The Scope of this Study • To provide an organized study of external validation measures for K-means clustering. • K-means is a well-known, widely used, and successful clustering method. • 16 external measures studied, 13 remained.

  7. Workflow Towards Right Measures

  8. Main Contributions • In general, we provided an organized study of selecting the right measures for K-means clustering. Specifically, we • Reviewed 16 well-known external validation measures; • Identified some defective measures • Established the importance of measure normalization and designed normalization solutions for several validation measures; • Revealed some major properties of these external measures, such as consistency, sensitivity, and symmetry properties. • Provided the final guidance for adapting right measures for K-means clustering

  9. Outline • Introduction • Defective Validation Measures • Measure Normalization • Measure Properties • Concluding Remarks

  10. K-means: The Uniform Effect • For data sets in skewed distributions, K-means tends to produce clusters with relatively uniform sizes. On the document data set “sports”

  11. A Necessary Selection Criterion • Two Clustering Results for a Sample Data Set CV1=0 CV1=1.125 far away CV0=1.166

  12. Identifying Defective Measures: An Example • The cluster validation results • Now only 10 measures remained.

  13. Exploring the Defectiveness horizontal vertical • Entropy and Purity • Mutual Information + ∑jmaxinij/n + H(P|C)

  14. Improving the Defective Measures • Variation of Information (VI) vs. Entropy (E) • van Dongen criterion (VD) vs. Purity (P)

  15. Outline • Introduction • Defective Validation Measures • Measure Normalization • Measure Properties • Concluding Remarks

  16. Two Normalization Methods • Normalization enables the use of measures for the comparisons of clustering results of different data sets. • Two types of normalization schemes • Statistics-based normalization • Extreme value-based normalization • Basic Assumptions • Multivariate hyper-geometricdistribution fixed

  17. Normalization Solutions • The Normalized Measures Type I Type II

  18. Test Normalizations: The DCV Criterion and the Settings • The DCV Criterion • DCV=CV1-CV0 • As the DCV values go down, the clustering results by K-means tend to be away from “true” class distributions. • As the DCV values go down, the good measures are expected to show worse clustering performances. • The Experimental Setup • Data Sets: simulated + sampled, with increased DCV. • Tools: Matlab 7.1 Cluto 2.1.1

  19. Normalization Experiments: The Results : Kendall’s rank correlation • Remark • If we use the unnormalized measures to do cluster validation, only three measures, namely R, Γ, Γ’, have strong consistency with DCV. • All the normalized measures show perfect consistency with DCV except for Fn and ξn. • Wider value ranges afternormalization.

  20. The Impact of the Number of Clusters • Remark • The measurement values for all the measures will change as the increase of the cluster numbers. • The normalized measures can capture the same optimal cluster number: 5. The data set “la2”

  21. Outline • Introduction • Defective Validation Measures • Measure Normalization • Measure Properties • Concluding Remarks

  22. The Consistency • The Experiment Setup • Data Sets: 29 benchmark document data sets. • Tools: CLUTO. • Measures: Kendall’s rank correlation. • Result: Correlations of the Measures • The normalized measures have much stronger consistency than the unnormalized measures.

  23. The Consistency, Cont’d • Hierarchical Clustering on the Normalized Measures • are equivalent. • are more similar to one another. • show inconsistency in varying degrees. • Only 7 normalized measures remained!

  24. The Sensitivity • Remarks • All the measures show different validation results for the two clusterings except for VDn and Fn. • VIn is the most sensitive measure.

  25. Math Properties .

  26. Math Properties, Cont’d

  27. Math Properties, Cont’d

  28. The Selection Process: An Overview • The Way to the Right Measures • Step I: Discard M, MAP and GK. 13 measures remained. • Step II: Filter out E, P, and MI. 10 measures remained. • Step III: Normalize the measures. 10 normalized measures remained. • Step IV: Discard . 7 normalized measures remained. • Step V: Filter out Fn and ξn. 5 normalized measures remained. • Step VI: Discard FMn and Γn. 3 normalized measures remained. • The Three Right Measures for K-means Clustering • Normalized van Dongen criterion (VDn) • Normalized variation ofinformation (VIn) • Normalized Rand index (Rn)

  29. Insights • Guidance for K-means Clustering Validation • It is most suitable touse VDn, since VDnhas a simple computation form, satisfies allmathematicallysound properties, and can measure wellon the data with imbalanced class distributions. • For the case that the clustering performances are hard todistinguish, we may use VIn instead, since VIn has high sensitivityon detecting the clustering changes. • Rn can also beused as a complementary to the above two measures.

  30. Outline • Introduction • Defective Validation Measures • Measure Normalization • Measure Properties • Concluding Remarks

  31. Conclusions • In this study, we compared and contrasted external validation measures for K-means clustering • It is necessary to normalize validation measures before they can be employed for clustering validation • Provided normalization solutions for the measures whose normalized solutions are not available • Summarized the key properties of these measures. These properties should be considered before deciding what is the right measure to use in practice • Investigated the relationships among these validation measures.

  32. Thank You! http://datamining.buaa.edu.cn

More Related