This presentation is the property of its rightful owner.
1 / 59

# Clustering PowerPoint PPT Presentation

Clustering. Rong Jin. What is Clustering?. Identify the underlying structure for given data points Doc. clustering: groups documents of same topics into the same cluster. \$\$\$. age. query. Improve IR by Document Clustering. Cluster-based retrieval. Improve IR by Document Clustering.

Clustering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Clustering

Rong Jin

### What is Clustering?

• Identify the underlying structure for given data points

• Doc. clustering: groups documents of same topics into the same cluster

\$\$\$

age

query

### Improve IR by Document Clustering

• Cluster-based retrieval

### Improve IR by Document Clustering

• Cluster-based retrieval

• Cluster docs in collection a priori

• Only compute the relevance scores for docs in the cluster closest to the query

• Improve retrieval efficiency by only search a small portion of the document collection

### Application (III): Visualization

Islands of music

(Pampalk et al., KDD’ 03)

x2

x1

x4

x3

x5

x6

x7

### How to Find good Clusters?

• Measure the compactness by the sum of distance square within clusters

x2

x1

x4

x3

x5

x6

x7

### How to Find good Clusters?

• Measure the compactness by the sum of distance square within clusters

x2

x1

x4

x3

x5

x6

x7

### How to Find good Clusters?

• Measure the compactness by the sum of distance square within clusters

x2

x1

C1

x4

x3

x5

x6

C2

x7

### How to Find good Clusters?

• Measure the compactness by the sum of distance square within clusters

x2

x1

x4

x3

C1

C2

x5

x6

x7

### How to Find good Clusters?

• Measure the compactness by the sum of distance square within clusters

• Membership indicators:

mi,j =1 if xi is assigned to Cj, and zero otherwise.

x2

x1

C1

x4

x3

x5

x6

C2

x7

### How to Find good Clusters?

• Measure the compactness by the sum of distance square within clusters

• Membership indicators:

mi,j =1 if xi is assigned to Cj, and zero otherwise.

x2

x1

C1

x4

x3

x5

x6

C2

x7

### How to Find good Clusters?

• Measure the compactness by the sum of distance square within clusters

x2

x1

C1

x4

x3

x5

x6

C2

x7

### How to Find good Clusters?

• Measure the compactness by the sum of distance square within clusters

• Find good clusters by minimizing the cluster compactness

• Cluster centers C1 and C2

• Membership mi,j

x2

x1

C1

x4

x3

x5

x6

C2

x7

### How to Find good Clusters?

• Measure the compactness by the sum of distance square within clusters

• Find good clusters by minimizing the cluster compactness

• Cluster centers C1 and C2

• Membership mi,j

x2

x1

C1

x4

x3

x5

x6

C2

x7

### How to Find good Clusters?

• Find good clusters by minimizing the cluster compactness

• Cluster centers C1 and C2

• Membership mi,j

x2

x1

C1

x4

x3

x5

x6

C2

x7

### How to Efficiently Cluster Data?

Update mi,j: assign xi to the closest Cj

### How to Efficiently Cluster Data?

Update mi,j: assign xi to the closest Cj

Update Cj as the average of xi assigned to Cj

### How to Efficiently Cluster Data?

Update mi,j: assign xi to the closest Cj

K-means algorithm

Update Cj as the average of xi assigned to Cj

x2

x1

x4

x3

C1

C2

x5

x6

x7

### Example of k-means

• Identify the points that are closer to C1 than to C2

x2

x1

x4

x3

C1

C2

x5

x6

x7

• Update C1

x2

x1

x4

x3

C1

C2

x5

x6

x7

### Example of k-means

• Identify the points that are closer to C2 than to C1

x2

x1

x4

x3

C1

C2

x5

x6

x7

### Example of k-means

• Identify the points that are closer to C2 than to C1

x2

x1

x4

x3

C1

x5

x6

C2

x7

### Example of k-means

• Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2

x2

x1

x4

x3

C1

x5

x6

C2

x7

### Example of k-means

• Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2

• Update C1 and C2

x2

x1

C1

x4

x3

x5

x6

C2

x7

### K-means for Clustering

• K-means

• Determine the membership of each data points

### K-means for Clustering

• K-means

• Determine the membership of each data points

### K-means for Clustering

• K-means

• Determine the membership of each data points

### K-means

• Ask user how many clusters they’d like. (e.g. k=5)

### K-means

• Ask user how many clusters they’d like. (e.g. k=5)

• Randomly guess k cluster Center locations

### K-means

• Ask user how many clusters they’d like. (e.g. k=5)

• Randomly guess k cluster Center locations

• Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

### K-means

• Ask user how many clusters they’d like. (e.g. k=5)

• Randomly guess k cluster Center locations

• Each datapoint finds out which Center it’s closest to.

• Each Center finds the centroid of the points it owns

### K-means

• Ask user how many clusters they’d like. (e.g. k=5)

• Randomly guess k cluster Center locations

• Each datapoint finds out which Center it’s closest to.

• Each Center finds the centroid of the points it owns

### K-means

Any Computational Problem ?

### K-means

Need to go through each data point at each iteration of k-means

### Improve K-means

• Group nearby data points by region

• KD tree

• SR tree

• Try to update the membership for all the data points in the same region

### Improved K-means

• Find the closest center for each rectangle

• Assign all the points within a rectangle to one cluster

### A Mixture Model for Document Clustering

• Assume that data are generated from a mixture of multinomial distributions

• Estimate the mixture distribution from the observed documents

### Gaussian Mixture Example: Start

Measure the probability for every data point to be associated with each cluster

### Hierarchical Doc Clustering

• Goal is to create a hierarchy of topics

• Challenge: create this hierarchy automatically

• Approaches: top-down or bottom-up

### Hierarchical Agglomerative Clustering (HAC)

• Given a similarity measure for determining the similarity between two clusters

• repeatedly merge the two most similar clusters

• Until there is only one cluster

• The history of merging forms a binary tree

• The standard way of depicting this history is a dendrogram.

### An Example of Dendrogram

similarity

With an appropriately chosen similarity cut, we can convert the dendrogram into a flat clustering.

### Similarity of Clusters

• Maximum over all document pairs

### Similarity of Clusters

• Maximum over all document pairs

• Minimum over all document pairs

### Similarity of Clusters

• Maximum over all document pairs

• Minimum over all document pairs

• Centroid: Average “intersimilarity”

• Average over all document pairs

• Complete link usually produces balanced clusters

### Divisive Hierarchical Clustering

• Top-down (instead of bottom-up as in HAC)