1 / 31

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation. Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University of Washington.

aglaia
Download Presentation

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University of Washington This work has been supported by NSA grant 62-1942

  2. Motivating Example • Consider clustering documents • Topic Detection and Tracking corpus • 15,863 news stories for one year from Reuters and CNN • 25,000 unique words • Possibly many topics • Large numbers of observations • High dimensions • Many groups

  3. Goal of Clustering Detect that there are 5 or 6 groups Assign Observations to groups

  4. NonParametric Clustering • Premise: • Observations are sampled from a density p(x) • Groups correspond to modes of p(x)

  5. NonParametric Clustering Fitting: Estimate p(x) nonparametrically and find significant modes of the estimate

  6. Model Based Clustering • Premise: • Observations are sampled from a mixture density • p(x) = åpg pg(x) • Groups correspond to mixture components

  7. Model Based Clustering Fitting: Estimate pg and parameters of pg(x)

  8. Model Based Clustering Fitting a Mixture of Gaussians • Use the EM algorithm to maximize the log likelihood • Estimates the probabilities of each observation belonging to each group • Maximizes likelihood given these probabilites • Requires a good starting point

  9. Model Based Clustering Hierarchical Clustering • Provides a good starting point for EM algorithm • Start with every point being it’s own cluster • Merge the two closest clusters • Measured by the decrease in likelihood when those two clusters are merged • Uses the Classification Likelihood – not the Mixture Likelihood • Algorithm is quadratic in the number of observations

  10. Likelihood Distance Merge gives small decrease in likelihood Merge gives big decrease in likelihood p(x) p1(x) p2(x) p(x) p1(x) p2(x)

  11. Bayesian Information Criterion • Choose number of clusters by maximizing the Bayesian Information Criterion • r is the number of parameters • n is the number of observations • Log likelihood penalized for complexity

  12. Fractionation Original Data – size n n/M fractions of size M Partition each fraction into aM clusters a < 1 an clusters If an >M (meta-obervations, mi) Invented by Cutting, Karger, Pederson and Tukey for nonparametric clustering of large datasets. M is the largest number of observations for which a hierarchical O(M2) algorithm is computationally feasible

  13. Fractionation • an meta-observations after the first round • a2n meta-observations after the second round • ain meta-observations after the ith round • For the ith pass, we have ai-1n/M fractions taking O(M2) operations each • Total number of operations is: • Total running time is linear in n!

  14. Model Based Fractionation • Use model based clustering • Meta-observations contain all sufficient statistics – (ni, mi, Si) • niis the number of observations – size • miis the mean – location • Siis the covariance matrix – shape and volume

  15. Model Based Fractionation 10 meta-observations from the third fraction 10 meta-observations from the fourth fraction The 40 Meta-observations The Final Clusters Chosen by BIC Success! An example, 400 observations in 4 groups Observations in the first fraction 10 meta-observations from the first fraction 10 meta-observations from the second fraction

  16. Example 2 The data – 400 observations in 25 groups Observations in fraction 1 10 meta-observations from the first fraction 10 meta-observations from the second fraction 10 meta-observations from the third fraction 10 meta-observations from the fourth fraction The 40 meta-observations The clusters chosen by BIC Fractionation fails!

  17. Refractionation Problem: • If the number of meta-observations generated from a fraction is less than the number of groups in that fraction then two or more groups will be merged. • Once observations from two groups are merged they can never be split again. Solution: • Apply fractionation repeatedly. • Use meta-observations from the previous pass of fractionation to create “better” fractions.

  18. Example 2 Continued The 40 meta-observations 4 new clusters 4 new fractions

  19. Example 2 – Pass 2 Observations in the new fraction 1 Clusters from the first fraction Clusters from the second fraction Clusters from the third fraction Clusters from the fourth fraction The 40 meta-observations Clusters chosen by BIC

  20. Example 2 – Pass 3 Clusters chosen by BIC Refractionation Succeeds The 40 meta-observations of pass 2 of fractionation 4 new clusters 4 new fractions Observations in the new fraction 1 Clusters from the first fraction Clusters from the second fraction Clusters from the third fraction Clusters from the fourth fraction The 40 meta-observations

  21. Realistic Example • 1100 documents from the TDT corpus partitioned by people into 19 topics • Transformed into 50 dimensional space using Latent Semantic Indexing Projection of the data onto a plane – colors represent topics

  22. Realistic Example Want to create a dataset with more observations and more groups Idea: Replace each group with a scaled and transformed version of the entire data set.

  23. Realistic Example Want to create a dataset with more observations and more groups Idea: Replace each group with a scaled and transformed version of the entire data set.

  24. Realistic Example To measure similarity of clusters to groups: Fowlkes-Mallows index • Geometric average of: • Probability of 2 randomly chosen observations from the same clusterbeing in the same group • Probability of 2 randomly chosen observations from the same group being in the same cluster • Fowlkes–Mallows index near 1 means clusters are good estimates of the groups • Clustering the 1100 documents gives a Fowlkes–Mallows index of 0.76 – our “gold standard”

  25. Realistic Example • 19£19=361 clusters, 19£1100=20900 observations in 50 dimensions • Fraction size¼1000 with 100 metaobservations per fraction • 4 passes of fractionation choosing 361 clusters Number of fractions Distribution of the number of groups per fraction.

  26. Realistic Example • 19£19=361 clusters, 19£1100=20900 observations in 50 dimensions • Fraction size¼1000 with 100 metaobservations per fraction • 4 passes of fractionation choosing 361 clusters • The sum of the number of groups represented in each cluster: • 361 is perfect

  27. Realistic Example • 19£19=361 clusters, 19£1100=20900 observations in 50 dimensions • Fraction size¼1000 with 100 metaobservations per fraction • 4 passes of fractionation choosing 361 clusters Refractionation: • Purifies fractions • Successfully deals with the case where the number of groups is greater than aM, the number of meta-observations

  28. Contributions • Model Based Fractionation • Extended fractionation idea to parametric setting • Incorporates information about size, shape and volume of clusters • Chooses number of clusters • Still linear in n • Model Based ReFractionation • Extended fractionation to handle larger number of groups

  29. Extensions • Extend to 100,000s of observations – 1000s of groups • Currently the number of groups must be less than M • Extend to a more flexible class of models • With small groups in high dimensions, we need a more constrained model (fewer parameters) than the full covariance model • Mixture of Factor Analyzers

  30. Fowlkes-Mallows Index Pr(2 documents in same group | they are in the same cluster) Pr(2 documents in same cluster | they are in the same group)

More Related