More on Choosing #Clusters in General (not just k-means (fusion plot etc in chapter)) • Some researchers do their cluster analysis and then to demonstrate that the resulting clusters are “significantly” different, they run a (one-way) anova and voila, show the F is large. • Well duh! The cluster analysis’s objective was to find groups that were maximally separable. • Take a look at Milligan & Cooper (1985). They compared some 30 methods of trying to determine the proper #clusters. They found 3 criteria that produced good results: a pseudo F (Calinski & Harabasz 1974), a J statistic (Duda & Hart 1973), and CCC, the cubic clustering criterion. The 1st and 3rd of these are displayed in SAS (Proc Cluster). • For example, the pseudo F: • N=#observations (sample size) • C=#clusters (at a particular level of the clustering hierarchy) • Look at the eqn: it’s basically MSbetween/MSwithin • so larger is better, and of course, need to factor in that it should get better w >C • If multivariate normal, distributed F on p(C-1) & p(N-C) df (where p=#vars), • And can compare F across # C’s to find optimal C
More on Choosing #Clusters in General • References • Breckenridge, James N. (2000), “Validating Cluster Analysis: Consistent Replication and Symmetry,” Multivariate Behavioral Research, 35 (2), 261-285. • Calinski, R. B. and J. Harabasz (1974), “A Dendrite Method for Cluster Analysis,” Communications in Statistics, 3, 1-27. • Krolak-Schwerdt, Sabine and Thomas Eckes (1992), “A Graph Theoretic Criterion for Determining the Number of Clusters in a Data Set,” Multivariate Behavioral Research, 27 (4), 541-565. • Milligan, Glenn W. and Martha C. Cooper (1985), “An Examination of Procedures for Determining the Number of Clusters in a Data Set,” Psychometrika, 50, 159-179. • Steinley, Douglas and Michael J. Brusco (2011), “Choosing the Number of Clusters in K-Means Clustering,” Psychological Methods, 16 (3), 285-297.