Bayesian hierarchical clustering
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

Bayesian Hierarchical Clustering PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on
  • Presentation posted in: General

Bayesian Hierarchical Clustering. Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group (10.07.05). Outline. Traditional Hierarchical Clustering Bayesian Hierarchical Clustering Algorithm Results Potential Application. Hierarchical Clustering.

Download Presentation

Bayesian Hierarchical Clustering

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Bayesian hierarchical clustering

Bayesian Hierarchical Clustering

Paper by K. Heller and Z. Ghahramani

ICML 2005

Presented by David Williams

Paper Discussion Group (10.07.05)


Outline

Outline

  • Traditional Hierarchical Clustering

  • Bayesian Hierarchical Clustering

    • Algorithm

    • Results

  • Potential Application


Hierarchical clustering

Hierarchical Clustering

  • Given a set of data points, output is a tree

    • Leaves are the data points

    • Internal nodes are nested clusters

  • Examples

    • Evolutionary tree of living organisms

    • Internet newsgroups

    • Newswire documents


Traditional hierarchical clustering

Traditional Hierarchical Clustering

  • Bottom-up agglomerative algorithm

    • Begin with each data point in own cluster

    • Iteratively merge two “closest” clusters

    • Stop when have single cluster

    • Closeness based on given distance measure (e.g., Euclidean distance between cluster means)

  • Limitations

    • No guide to choosing “correct” number of clusters, or where to prune tree

    • Distance metric selection (especially for data such as images or sequences)

    • How to evaluate how good result is, how to compare to other models, how to make predictions and cluster new data with existing hierarchy


Bayesian hierarchical clustering bhc

Bayesian Hierarchical Clustering (BHC)

  • Basic idea:

    • Use marginal likelihoods to decide which clusters to merge

    • Asks what the probability is that all the data in a potential merge were generated from the same mixture component. Compare to exponentially many hypotheses at lower levels of the tree

    • Generative model used is a Dirichlet Process Mixture Model (DPM)


Bhc algorithm overview

BHC Algorithm Overview

  • One-pass, bottom-up method

  • Initializes each data point in own cluster, and iteratively merges pairs of clusters

  • Uses a statistical hypothesis test to choose which clusters to merge

  • At each stage, algorithm considers merging all pairs of existing trees


Bhc algorithm merging

BHC Algorithm: Merging

  • Two hypotheses compared

    • 1. all data in the pair of trees to be merged was generated i.i.d. from the same probabilistic model with unknown parameters: (e.g., a Gaussian)

    • 2. said data has two or more clusters in it


Hypothesis h 1

Hypothesis H1

  • Probability of the data under H1:

  • Prior over the parameters:

  • Dk is the data in the two trees to be merged

  • Integral is tractable when conjugate prior employed


Hypothesis h 2

Hypothesis H2

  • Probability of the data under H2:

  • Is a product over sub-trees

  • Prior that all points belong to one cluster:

  • Probability of the data in tree Tk:


Merging clusters

Merging Clusters

  • From Bayes Rule, the posterior probability of the merged hypothesis:

  • The pair of trees with highest probability are merged

  • Natural place to cut the final tree: where


Dirichlet process mixture models dpms

Dirichlet Process Mixture Models (DPMs)

  • Probability of a new data point belonging to a cluster is proportional to the number of points already in that cluster

  • αcontrols the probability of the new point creating a new cluster


Merged hypothesis prior

Merged Hypothesis Prior

  • DPM with αdefines a prior on all partitions of the nkdata points in Dk

  • Prior on merged hypothesis, πk, is the relative mass of all nk points belonging to one cluster versus all other partitions of those nk points, consistent with the tree structure.


Bayesian hierarchical clustering

DPM

  • Other quantities needed for the posterior merged hypothesis probabilities can also be written and computed with the DPM (see math/proofs in paper)


Results

Results

  • Some sample results…


Unique aspects of algorithm

Unique Aspects of Algorithm

  • Is a hierarchical way of organizing nested clusters, not a hierarchical generative model

  • Is derived from DPMs

  • Hypothesis test is not for one vs. two clusters at each stage (is one vs. many other clusterings)

  • Is not iterative and does not require sampling


Summary

Summary

  • Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree.

  • Model-based criterion to decide on merging clusters.

  • Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree.

  • Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.


Why this paper

Why This Paper?

  • Mixed-type data problems: both continuous and discrete features

  • How to perform density estimation?

    • One way: partition continuous data into groups determined by the values of the discrete features.

    • Problem: number of groups grows quickly. (e.g., 5 features, each of which can take 4 values, gives 45=1024 groups)

    • How to determine which groups should be combined to reduce the total number of groups?

    • Possible solution: idea in this paper, except rather than leaves being individual data points, they would be groups of data points as determined by the discrete feature-values


  • Login