Bayesian Hierarchical Clustering

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group (10.07.05)

Outline • Traditional Hierarchical Clustering • Bayesian Hierarchical Clustering • Algorithm • Results • Potential Application

Hierarchical Clustering • Given a set of data points, output is a tree • Leaves are the data points • Internal nodes are nested clusters • Examples • Evolutionary tree of living organisms • Internet newsgroups • Newswire documents

Traditional Hierarchical Clustering • Bottom-up agglomerative algorithm • Begin with each data point in own cluster • Iteratively merge two “closest” clusters • Stop when have single cluster • Closeness based on given distance measure (e.g., Euclidean distance between cluster means) • Limitations • No guide to choosing “correct” number of clusters, or where to prune tree • Distance metric selection (especially for data such as images or sequences) • How to evaluate how good result is, how to compare to other models, how to make predictions and cluster new data with existing hierarchy

Bayesian Hierarchical Clustering (BHC) • Basic idea: • Use marginal likelihoods to decide which clusters to merge • Asks what the probability is that all the data in a potential merge were generated from the same mixture component. Compare to exponentially many hypotheses at lower levels of the tree • Generative model used is a Dirichlet Process Mixture Model (DPM)

BHC Algorithm Overview • One-pass, bottom-up method • Initializes each data point in own cluster, and iteratively merges pairs of clusters • Uses a statistical hypothesis test to choose which clusters to merge • At each stage, algorithm considers merging all pairs of existing trees

BHC Algorithm: Merging • Two hypotheses compared • 1. all data in the pair of trees to be merged was generated i.i.d. from the same probabilistic model with unknown parameters: (e.g., a Gaussian) • 2. said data has two or more clusters in it

Hypothesis H1 • Probability of the data under H1: • Prior over the parameters: • Dk is the data in the two trees to be merged • Integral is tractable when conjugate prior employed

Hypothesis H2 • Probability of the data under H2: • Is a product over sub-trees • Prior that all points belong to one cluster: • Probability of the data in tree Tk:

Merging Clusters • From Bayes Rule, the posterior probability of the merged hypothesis: • The pair of trees with highest probability are merged • Natural place to cut the final tree: where

Dirichlet Process Mixture Models (DPMs) • Probability of a new data point belonging to a cluster is proportional to the number of points already in that cluster • αcontrols the probability of the new point creating a new cluster

Merged Hypothesis Prior • DPM with αdefines a prior on all partitions of the nkdata points in Dk • Prior on merged hypothesis, πk, is the relative mass of all nk points belonging to one cluster versus all other partitions of those nk points, consistent with the tree structure.

DPM • Other quantities needed for the posterior merged hypothesis probabilities can also be written and computed with the DPM (see math/proofs in paper)

Results • Some sample results…

Unique Aspects of Algorithm • Is a hierarchical way of organizing nested clusters, not a hierarchical generative model • Is derived from DPMs • Hypothesis test is not for one vs. two clusters at each stage (is one vs. many other clusterings) • Is not iterative and does not require sampling

Summary • Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree. • Model-based criterion to decide on merging clusters. • Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree. • Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.

Why This Paper? • Mixed-type data problems: both continuous and discrete features • How to perform density estimation? • One way: partition continuous data into groups determined by the values of the discrete features. • Problem: number of groups grows quickly. (e.g., 5 features, each of which can take 4 values, gives 45=1024 groups) • How to determine which groups should be combined to reduce the total number of groups? • Possible solution: idea in this paper, except rather than leaves being individual data points, they would be groups of data points as determined by the discrete feature-values

Bayesian Hierarchical Clustering

Bayesian Hierarchical Clustering

Presentation Transcript

Hierarchical Clustering

LECTURE 28: HIERARCHICAL CLUSTERING

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

“Bayesian Identity Clustering”

Hierarchical Clustering

Randomized Algorithms for Bayesian Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering in R

Interactive Exploration of Hierarchical Clustering Results HCE (Hierarchical Clustering Explorer)

Lecture 17: Hierarchical Clustering

Hierarchical Clustering

TOWARDS HIERARCHICAL CLUSTERING

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Bayesian Hierarchical Clustering

Hierarchical Clustering