Clustering
This presentation is the property of its rightful owner.
Sponsored Links
1 / 59

Clustering PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on
  • Presentation posted in: General

Clustering. Rong Jin. What is Clustering?. Identify the underlying structure for given data points Doc. clustering: groups documents of same topics into the same cluster. $$$. age. query. Improve IR by Document Clustering. Cluster-based retrieval. Improve IR by Document Clustering.

Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Clustering

Clustering

Rong Jin


What is clustering

What is Clustering?

  • Identify the underlying structure for given data points

  • Doc. clustering: groups documents of same topics into the same cluster

$$$

age


Improve ir by document clustering

query

Improve IR by Document Clustering

  • Cluster-based retrieval


Improve ir by document clustering1

Improve IR by Document Clustering

  • Cluster-based retrieval

    • Cluster docs in collection a priori

    • Only compute the relevance scores for docs in the cluster closest to the query

    • Improve retrieval efficiency by only search a small portion of the document collection


Application i search result clustering

Application (I): Search Result Clustering


Application ii navigation

Application (II): Navigation


Application iii google news

Application (III): Google News


Application iii visualization

Application (III): Visualization

Islands of music

(Pampalk et al., KDD’ 03)


How to find good clusters

How to Find good Clusters?

x2

x1

x4

x3

x5

x6

x7


How to find good clusters1

How to Find good Clusters?

  • Measure the compactness by the sum of distance square within clusters

x2

x1

x4

x3

x5

x6

x7


How to find good clusters2

How to Find good Clusters?

  • Measure the compactness by the sum of distance square within clusters

x2

x1

x4

x3

x5

x6

x7


How to find good clusters3

How to Find good Clusters?

  • Measure the compactness by the sum of distance square within clusters

x2

x1

C1

x4

x3

x5

x6

C2

x7


How to find good clusters4

How to Find good Clusters?

  • Measure the compactness by the sum of distance square within clusters

x2

x1

x4

x3

C1

C2

x5

x6

x7


How to find good clusters5

How to Find good Clusters?

  • Measure the compactness by the sum of distance square within clusters

  • Membership indicators:

    mi,j =1 if xi is assigned to Cj, and zero otherwise.

x2

x1

C1

x4

x3

x5

x6

C2

x7


How to find good clusters6

How to Find good Clusters?

  • Measure the compactness by the sum of distance square within clusters

  • Membership indicators:

    mi,j =1 if xi is assigned to Cj, and zero otherwise.

x2

x1

C1

x4

x3

x5

x6

C2

x7


How to find good clusters7

How to Find good Clusters?

  • Measure the compactness by the sum of distance square within clusters

x2

x1

C1

x4

x3

x5

x6

C2

x7


How to find good clusters8

How to Find good Clusters?

  • Measure the compactness by the sum of distance square within clusters

  • Find good clusters by minimizing the cluster compactness

    • Cluster centers C1 and C2

    • Membership mi,j

x2

x1

C1

x4

x3

x5

x6

C2

x7


How to find good clusters9

How to Find good Clusters?

  • Measure the compactness by the sum of distance square within clusters

  • Find good clusters by minimizing the cluster compactness

    • Cluster centers C1 and C2

    • Membership mi,j

x2

x1

C1

x4

x3

x5

x6

C2

x7


How to find good clusters10

How to Find good Clusters?

  • Find good clusters by minimizing the cluster compactness

    • Cluster centers C1 and C2

    • Membership mi,j

x2

x1

C1

x4

x3

x5

x6

C2

x7


How to efficiently cluster data

How to Efficiently Cluster Data?

Update mi,j: assign xi to the closest Cj


How to efficiently cluster data1

How to Efficiently Cluster Data?

Update mi,j: assign xi to the closest Cj

Update Cj as the average of xi assigned to Cj


How to efficiently cluster data2

How to Efficiently Cluster Data?

Update mi,j: assign xi to the closest Cj

K-means algorithm

Update Cj as the average of xi assigned to Cj


Example of k means

Example of k-means

  • Start with random cluster centers C1 than to C2

x2

x1

x4

x3

C1

C2

x5

x6

x7


Example of k means1

Example of k-means

  • Identify the points that are closer to C1 than to C2

x2

x1

x4

x3

C1

C2

x5

x6

x7


Example of k means2

Example of k-means

  • Update C1

x2

x1

x4

x3

C1

C2

x5

x6

x7


Example of k means3

Example of k-means

  • Identify the points that are closer to C2 than to C1

x2

x1

x4

x3

C1

C2

x5

x6

x7


Example of k means4

Example of k-means

  • Identify the points that are closer to C2 than to C1

x2

x1

x4

x3

C1

x5

x6

C2

x7


Example of k means5

Example of k-means

  • Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2

x2

x1

x4

x3

C1

x5

x6

C2

x7


Example of k means6

Example of k-means

  • Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2

  • Update C1 and C2

x2

x1

C1

x4

x3

x5

x6

C2

x7


K means for clustering

K-means for Clustering

  • K-means

    • Start with a random guess of cluster centers

    • Determine the membership of each data points

    • Adjust the cluster centers


K means for clustering1

K-means for Clustering

  • K-means

    • Start with a random guess of cluster centers

    • Determine the membership of each data points

    • Adjust the cluster centers


K means for clustering2

K-means for Clustering

  • K-means

    • Start with a random guess of cluster centers

    • Determine the membership of each data points

    • Adjust the cluster centers


K means

K-means

  • Ask user how many clusters they’d like. (e.g. k=5)


K means1

K-means

  • Ask user how many clusters they’d like. (e.g. k=5)

  • Randomly guess k cluster Center locations


K means2

K-means

  • Ask user how many clusters they’d like. (e.g. k=5)

  • Randomly guess k cluster Center locations

  • Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)


K means3

K-means

  • Ask user how many clusters they’d like. (e.g. k=5)

  • Randomly guess k cluster Center locations

  • Each datapoint finds out which Center it’s closest to.

  • Each Center finds the centroid of the points it owns


K means4

K-means

  • Ask user how many clusters they’d like. (e.g. k=5)

  • Randomly guess k cluster Center locations

  • Each datapoint finds out which Center it’s closest to.

  • Each Center finds the centroid of the points it owns


K means5

K-means

Any Computational Problem ?


K means6

K-means

Need to go through each data point at each iteration of k-means


Improve k means

Improve K-means

  • Group nearby data points by region

    • KD tree

    • SR tree

  • Try to update the membership for all the data points in the same region


Improved k means

Improved K-means

  • Find the closest center for each rectangle

  • Assign all the points within a rectangle to one cluster


Document clustering

Document Clustering


A mixture model for document clustering

A Mixture Model for Document Clustering

  • Assume that data are generated from a mixture of multinomial distributions

  • Estimate the mixture distribution from the observed documents


Gaussian mixture example start

Gaussian Mixture Example: Start

Measure the probability for every data point to be associated with each cluster


After first iteration

After First Iteration


After 2nd iteration

After 2nd Iteration


After 3rd iteration

After 3rd Iteration


After 4th iteration

After 4th Iteration


After 5th iteration

After 5th Iteration


After 6th iteration

After 6th Iteration


After 20th iteration

After 20th Iteration


Hierarchical doc clustering

Hierarchical Doc Clustering

  • Goal is to create a hierarchy of topics

  • Challenge: create this hierarchy automatically

  • Approaches: top-down or bottom-up


Hierarchical agglomerative clustering hac

Hierarchical Agglomerative Clustering (HAC)

  • Given a similarity measure for determining the similarity between two clusters

  • Start with each document in a separate cluster

  • repeatedly merge the two most similar clusters

  • Until there is only one cluster

  • The history of merging forms a binary tree

  • The standard way of depicting this history is a dendrogram.


An example of dendrogram

An Example of Dendrogram

similarity

With an appropriately chosen similarity cut, we can convert the dendrogram into a flat clustering.


Similarity of clusters

Similarity of Clusters

  • Single-link: Maximum similarity

    • Maximum over all document pairs


Similarity of clusters1

Similarity of Clusters

  • Single-link: Maximum similarity

    • Maximum over all document pairs

  • Complete-link: Minimum similarity

    • Minimum over all document pairs


Similarity of clusters2

Similarity of Clusters

  • Single-link: Maximum similarity

    • Maximum over all document pairs

  • Complete-link: Minimum similarity

    • Minimum over all document pairs

  • Centroid: Average “intersimilarity”

    • Average over all document pairs


Single link vs complete link

Single Link vs. Complete Link

Single Link

Complete Link

  • Complete link usually produces balanced clusters


Divisive hierarchical clustering

Divisive Hierarchical Clustering

  • Top-down (instead of bottom-up as in HAC)

  • Start with all docs in one big cluster

  • Then recursively split clusters

  • Eventually each node forms a cluster on its own.

  • Example: Bisecting K-means


  • Login