- 112 Views
- Uploaded on
- Presentation posted in: General

Bayesian Hierarchical Clustering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Bayesian Hierarchical Clustering

Paper by K. Heller and Z. Ghahramani

ICML 2005

Presented by David Williams

Paper Discussion Group (10.07.05)

- Traditional Hierarchical Clustering
- Bayesian Hierarchical Clustering
- Algorithm
- Results

- Potential Application

- Given a set of data points, output is a tree
- Leaves are the data points
- Internal nodes are nested clusters

- Examples
- Evolutionary tree of living organisms
- Internet newsgroups
- Newswire documents

- Bottom-up agglomerative algorithm
- Begin with each data point in own cluster
- Iteratively merge two “closest” clusters
- Stop when have single cluster
- Closeness based on given distance measure (e.g., Euclidean distance between cluster means)

- Limitations
- No guide to choosing “correct” number of clusters, or where to prune tree
- Distance metric selection (especially for data such as images or sequences)
- How to evaluate how good result is, how to compare to other models, how to make predictions and cluster new data with existing hierarchy

- Basic idea:
- Use marginal likelihoods to decide which clusters to merge
- Asks what the probability is that all the data in a potential merge were generated from the same mixture component. Compare to exponentially many hypotheses at lower levels of the tree
- Generative model used is a Dirichlet Process Mixture Model (DPM)

- One-pass, bottom-up method
- Initializes each data point in own cluster, and iteratively merges pairs of clusters
- Uses a statistical hypothesis test to choose which clusters to merge
- At each stage, algorithm considers merging all pairs of existing trees

- Two hypotheses compared
- 1. all data in the pair of trees to be merged was generated i.i.d. from the same probabilistic model with unknown parameters: (e.g., a Gaussian)
- 2. said data has two or more clusters in it

- Probability of the data under H1:
- Prior over the parameters:
- Dk is the data in the two trees to be merged
- Integral is tractable when conjugate prior employed

- Probability of the data under H2:
- Is a product over sub-trees
- Prior that all points belong to one cluster:
- Probability of the data in tree Tk:

- From Bayes Rule, the posterior probability of the merged hypothesis:
- The pair of trees with highest probability are merged
- Natural place to cut the final tree: where

- Probability of a new data point belonging to a cluster is proportional to the number of points already in that cluster
- αcontrols the probability of the new point creating a new cluster

- DPM with αdefines a prior on all partitions of the nkdata points in Dk
- Prior on merged hypothesis, πk, is the relative mass of all nk points belonging to one cluster versus all other partitions of those nk points, consistent with the tree structure.

- Other quantities needed for the posterior merged hypothesis probabilities can also be written and computed with the DPM (see math/proofs in paper)

- Some sample results…

- Is a hierarchical way of organizing nested clusters, not a hierarchical generative model
- Is derived from DPMs
- Hypothesis test is not for one vs. two clusters at each stage (is one vs. many other clusterings)
- Is not iterative and does not require sampling

- Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree.
- Model-based criterion to decide on merging clusters.
- Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree.
- Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.

- Mixed-type data problems: both continuous and discrete features
- How to perform density estimation?
- One way: partition continuous data into groups determined by the values of the discrete features.
- Problem: number of groups grows quickly. (e.g., 5 features, each of which can take 4 values, gives 45=1024 groups)
- How to determine which groups should be combined to reduce the total number of groups?
- Possible solution: idea in this paper, except rather than leaves being individual data points, they would be groups of data points as determined by the discrete feature-values