1 / 52

Randomized Algorithms for Bayesian Hierarchical Clustering

Randomized Algorithms for Bayesian Hierarchical Clustering. Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London. Hierarchies:. are natural outcomes of certain generative processes are intuitive representations for certain kinds of data Examples:

flint
Download Presentation

Randomized Algorithms for Bayesian Hierarchical Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London

  2. Hierarchies: • are natural outcomes of certain generative processes • are intuitive representations for certain kinds of data • Examples: • Biological organisms • Newsgroups, Emails, Actions • …

  3. Traditional Hierarchical Clustering As in Duda and Hart (1973): * Many distance metrics are possible

  4. Limitations of Traditional Hierarchical Clustering Algorithms • How many clusters should there be? • It is hard to choose a distance metric • They do not define a probabilistic model of the data, so they cannot: • Predict the probability or cluster assignment of new data points • Be compared to or combined with other probabilistic models • Our Goal: To overcome these limitations by defining a novel statistical approach to hierarchical clustering

  5. Bayesian Hierarchical Clustering • Our algorithm can be understood from two different perspectives: • A Bayesian way to do hierarchical clustering where marginal likelihoods are used to decide which merges are advantageous • A novel fast bottom-up way of doing approximate inference in a Dirichlet Process mixture model (e.g. an infinite mixture of Gaussians)

  6. Outline • Background • Traditional Hierarchical Clustering and its Limitations • Marginal Likelihoods • Dirichlet Process Mixtures (infinite mixture models) • Bayesian Hierarchical Clustering (BHC) algorithm: • Theoretical Results • Experimental Results • Randomized BHC algorithms • Conclusions

  7. Dirichlet Process Mixtures(a.k.a. infinite mixtures models) • Consider a mixture model with K components (eg. Gaussians) • How to choose K? Infer K from data? • But this would require that we really believe that the data came from a mixture of some finite number of components – highly implausible. • Instead a DPM has K = countably infinite components. • DPM can be derived by taking of a finite mixture model with Dirichlet prior on mixing proportions. • The prior on partitions of data points into clusters in a DPM is called a Chinese Restaurant Process. • The key to avoiding overfitting in DPMs is Bayesian inference: • you can integrate out all infinitely many parameters • and sample from assignments of data points to clusters

  8. Outline • Background • Traditional Hierarchical Clustering and its Limitations • Marginal Likelihoods • Dirichlet Process Mixtures (infinite mixture models) • Bayesian Hierarchical Clustering (BHC) Algorithm: • Theoretical Results • Experimental Results • Randomized BHC algorithms • Conclusions

  9. Bayesian Hierarchical Clustering : Building the Tree • The algorithm is virtually identical to traditional hierarchical clustering except that instead of distance it uses marginal likelihood to decide on merges. • For each potential merge it compares two hypotheses: all data in came from one cluster data in came from some other clustering consistent with the subtrees • Prior: • Posterior probability of merged hypothesis: • Probability of data given the tree :

  10. Building the Tree • The algorithm compares hypotheses: in one cluster all other clusterings consistent with the subtrees

  11. Comparison Bayesian Hierarchical Clustering Traditional Hierarchical Clustering

  12. Comparison Bayesian Hierarchical Clustering Traditional Hierarchical Clustering

  13. Computing the Single Cluster Marginal Likelihood • The marginal likelihood for the hypothesis that all data points in belong to one cluster is • If we use models which have conjugate priors this integral is tractable and is a simple function of sufficient statistics of • Examples: • For continuous Gaussian data we can use Normal-Inverse Wishart priors • For discrete Multinomial data we can use Dirichlet priors

  14. Theoretical Results • The BHC algorithm can be thought of as a new approximate inference method for Dirichlet Process mixtures. • Using dynamic programming, for any tree it sums over exponentially many tree-consistent partitions in O(n) time, whereas the exact algorithm is O(nn). • BHC provides a new lower bound on the marginal likelihood of DPMs.

  15. Tree-Consistent Partitions • Consider the above tree and all 15 possible partitions of {1,2,3,4}: (1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4), (2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3), (1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4) • (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions • (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions

  16. Simulations • Toy Example 1 – continuous data • Toy Example 2 – binary data • Toy Example 3 – digits data

  17. Results: a Toy Example

  18. Results: a Toy Example

  19. Predicting New Data Points

  20. Toy Examples

  21. Toy Examples

  22. Binary Digits Example

  23. Binary Digits Example

  24. 4 Newsgroups Results 800 examples, 50 attributes: rec.sport.baseball, rec.sports.hockey, rec.autos, sci.space

  25. Results: Average Linkage HC

  26. Results: Bayesian HC

  27. Results: Purity Scores Purity is a measure of how well the hierarchical tree structure is correlated with the labels of the known classes.

  28. Limitations • Greedy algorithm: • The algorithm may not find the globally optimal tree • No tree uncertainty: • The algorithm finds a single tree, rather than a distribution over plausible trees • complexity for building tree: • Fast, but not for very large datasets; this can be improved

  29. Randomized BHC

  30. Randomized BHC • Algorithm is O(n log n) • Each level of the tree has O(n) operations. • Assumptions: • The top level clustering built from a subset of m data points will be a good approximation to the true top level clustering. • The BHC algorithm tends to produce roughly balanced binary trees. • Can stop after any desired number of levels, before nodes containing only 1 data point are reached

  31. Randomized BHC – An Alternative based on EM This randomized algorithm is O(n)

  32. Approximation Methods for Marginal Likelihoods of Mixture Models • Bayesian Information Criterion (BIC) • Laplace Approximation • Variational Bayes (VB) • Expectation Propagation (EP) • Markov chain Monte Carlo (MCMC) • Hierarchical Clustering new!

  33. BHC Conclusions • We have shown a Bayesian Hierarchical Clustering algorithm which • Is simple, deterministic and fast (no MCMC, one-pass, etc.) • Can take as input any simple probabilistic model p(x|q) and gives as output a mixture of these models • Suggests where to cut the tree and how many clusters there are in the data • Gives more reasonable results than traditional hierarchical clustering algorithms • This algorithm: • Recursively computes an approximation to the marginal likelihood of a Dirichlet Process Mixture… • …which can be easily turned into a new lower bound

  34. Future Work • Try on some other real hierarchical clustering data sets • Gene Expression data • More text data • Spam/Email clustering • Generalize to other models p(x|q), including more complex models which will require approximate inference • Compare to other marginal likelihood approximations (Variational Bayes, EP, MCMC) • Hyperparameter optimization using EM-like algorithm • Test randomized algorithms

  35. Appendix: Additional Slides

  36. Computing the Prior for Merging Where do we get from? • This can be computed bottom-up as the tree is built: • is the relative mass of the partition where all points are in one cluster vs all other partitions consistent with the subtrees, in a Dirichlet process mixture model with hyperparameter a

  37. Bayesian Occam’s Razor

  38. Model Structure: polynomials

  39. Bayesian Model Comparison

  40. Nonparametric Bayesian Methods

  41. DPM - III

  42. Dendrogram Purity

  43. Marginal Likelihoods • The marginal likelihood (a.k.a. evidence) is one of the key concepts in Bayesian statistics • We will review the concept of a marginal likelihood using the example of a Gaussian mixture model

  44. Theoretical Results

More Related