Bayesian hierarchical clustering
1 / 19

Bayesian Hierarchical Clustering - PowerPoint PPT Presentation

  • Uploaded on

Bayesian Hierarchical Clustering. Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group (10.07.05). Outline. Traditional Hierarchical Clustering Bayesian Hierarchical Clustering Algorithm Results Potential Application. Hierarchical Clustering.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Bayesian Hierarchical Clustering' - inoke

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Bayesian hierarchical clustering

Bayesian Hierarchical Clustering

Paper by K. Heller and Z. Ghahramani

ICML 2005

Presented by David Williams

Paper Discussion Group (10.07.05)


  • Traditional Hierarchical Clustering

  • Bayesian Hierarchical Clustering

    • Algorithm

    • Results

  • Potential Application

Hierarchical clustering
Hierarchical Clustering

  • Given a set of data points, output is a tree

    • Leaves are the data points

    • Internal nodes are nested clusters

  • Examples

    • Evolutionary tree of living organisms

    • Internet newsgroups

    • Newswire documents

Traditional hierarchical clustering
Traditional Hierarchical Clustering

  • Bottom-up agglomerative algorithm

    • Begin with each data point in own cluster

    • Iteratively merge two “closest” clusters

    • Stop when have single cluster

    • Closeness based on given distance measure (e.g., Euclidean distance between cluster means)

  • Limitations

    • No guide to choosing “correct” number of clusters, or where to prune tree

    • Distance metric selection (especially for data such as images or sequences)

    • How to evaluate how good result is, how to compare to other models, how to make predictions and cluster new data with existing hierarchy

Bayesian hierarchical clustering bhc
Bayesian Hierarchical Clustering (BHC)

  • Basic idea:

    • Use marginal likelihoods to decide which clusters to merge

    • Asks what the probability is that all the data in a potential merge were generated from the same mixture component. Compare to exponentially many hypotheses at lower levels of the tree

    • Generative model used is a Dirichlet Process Mixture Model (DPM)

Bhc algorithm overview
BHC Algorithm Overview

  • One-pass, bottom-up method

  • Initializes each data point in own cluster, and iteratively merges pairs of clusters

  • Uses a statistical hypothesis test to choose which clusters to merge

  • At each stage, algorithm considers merging all pairs of existing trees

Bhc algorithm merging
BHC Algorithm: Merging

  • Two hypotheses compared

    • 1. all data in the pair of trees to be merged was generated i.i.d. from the same probabilistic model with unknown parameters: (e.g., a Gaussian)

    • 2. said data has two or more clusters in it

Hypothesis h 1
Hypothesis H1

  • Probability of the data under H1:

  • Prior over the parameters:

  • Dk is the data in the two trees to be merged

  • Integral is tractable when conjugate prior employed

Hypothesis h 2
Hypothesis H2

  • Probability of the data under H2:

  • Is a product over sub-trees

  • Prior that all points belong to one cluster:

  • Probability of the data in tree Tk:

Merging clusters
Merging Clusters

  • From Bayes Rule, the posterior probability of the merged hypothesis:

  • The pair of trees with highest probability are merged

  • Natural place to cut the final tree: where

Dirichlet process mixture models dpms
Dirichlet Process Mixture Models (DPMs)

  • Probability of a new data point belonging to a cluster is proportional to the number of points already in that cluster

  • αcontrols the probability of the new point creating a new cluster

Merged hypothesis prior
Merged Hypothesis Prior

  • DPM with αdefines a prior on all partitions of the nkdata points in Dk

  • Prior on merged hypothesis, πk, is the relative mass of all nk points belonging to one cluster versus all other partitions of those nk points, consistent with the tree structure.


  • Other quantities needed for the posterior merged hypothesis probabilities can also be written and computed with the DPM (see math/proofs in paper)


  • Some sample results…

Unique aspects of algorithm
Unique Aspects of Algorithm

  • Is a hierarchical way of organizing nested clusters, not a hierarchical generative model

  • Is derived from DPMs

  • Hypothesis test is not for one vs. two clusters at each stage (is one vs. many other clusterings)

  • Is not iterative and does not require sampling


  • Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree.

  • Model-based criterion to decide on merging clusters.

  • Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree.

  • Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.

Why this paper
Why This Paper?

  • Mixed-type data problems: both continuous and discrete features

  • How to perform density estimation?

    • One way: partition continuous data into groups determined by the values of the discrete features.

    • Problem: number of groups grows quickly. (e.g., 5 features, each of which can take 4 values, gives 45=1024 groups)

    • How to determine which groups should be combined to reduce the total number of groups?

    • Possible solution: idea in this paper, except rather than leaves being individual data points, they would be groups of data points as determined by the discrete feature-values