1 / 19

Estimating the number of data clusters via the Gap statistic

Estimating the number of data clusters via the Gap statistic. Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001), 63, pp. 411--423. Cluster Analysis. Finding groups in data No training data needed – Unsupervised

Download Presentation

Estimating the number of data clusters via the Gap statistic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Estimating the number of data clusters via the Gap statistic Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001), 63, pp. 411--423

  2. Cluster Analysis • Finding groups in data • No training data needed – Unsupervised • Major challenge – estimation of the optimal number of clusters

  3. Wk -- Measure of compactness of clusters • Suppose we have clustered the data into k clusters, with Cr denoting the indices of observations in cluster r, and nr = |Cr| • Let • If dist is the squared Euclidean distance

  4. Using Wk to determine # clusters

  5. elbow • Wk decreases monotonically as the number of clusters k increases • But from some k on, the decrease flattens markedly • Such an “elbow” indicates the appropriate number of clusters

  6. The Gap statistic • Standardize the graph of log(Wk) by comparing it to its expectation under an appropriate null reference distribution of the data • E*ndenotes expectation under a sample of size n

  7. Reference distribution • Adopt a null model of a single component and reject it in favor of a k component model (k>1) • Two choices for the reference distribution • Generate each reference feature uniformly in the range of the observed values for that feature • Generate the reference features from a uniform distribution over a box aligned with the principal components of the data

  8. Align with feature axes Bounding Box (aligned with feature axes) Monte Carlo Simulations Observations

  9. Align with principal axes Bounding Box (aligned with principle axes) Monte Carlo Simulations Observations

  10. Computation of the Gap statistic • Cluster the observed data, varying the total number of clusters from k = 1,2, …, K, giving within cluster dispersion measures Wk, k = 1,2,…, K • Generate B reference datasets, using one of the uniform prescription, and cluster each one giving W*kb, b = 1,2, …, K. Compute the (estimated) Gap statistic: • Let , compute the standard deviation , and define . Finally find the smallest k such that

  11. 2-Cluster Example

  12. No-Cluster Example

  13. Example on cDNA microarray data-- hierarchical clustering 6834 genes 64 human tumour

  14. Other Approaches • Calinski and Harabasz ’74 • Krzanowski and Lai ’85 • Hartigan ’75 • Kaufman and Rousseeuw ’90 (silhouette)

  15. Simulation (50 times) • 1 cluster: 200 points in 10-D, uniformly distributed • 3 clusters: each with 25 or 50 points in 2-D, normally distributed, w/ centers (0,0), (0,5) and (5,-3) • 4 clusters: each with 25 or 50 points in 3-D, normally distributed, w/ centers randomly chosen from N(0,5I) (simulation w/ clusters having min distance less than 1.0 was discarded.) • 4 clusters: each w/ 25 or 50 points in 10-D, normally distributed, w/ centers randomly chosen from N(0,1.9I) (simulation w/ clusters having min distance less than 1.0 was discarded.) • 2 clusters: each cluster contains 100 points in 3-D, elongated shape, well-separated

  16. Overlapping Clusters • 50 observations from each of two bivariate normal populations with means (0,0) and (,0), and covariance I. • = 10 value in [0, 5] 10 simulations for each 

  17. Conclusion • Focus on well-separated clusters • Outperforms other approaches, when used with a uniform reference distribution in the principal component orientation • The simpler uniform reference over the range of data works well except wen the data are concentrated on a subspace

More Related