- 126 Views
- Uploaded on
- Presentation posted in: General

Clustering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Clustering

Rong Jin

- Identify the underlying structure for given data points
- Doc. clustering: groups documents of same topics into the same cluster

$$$

age

query

- Cluster-based retrieval

- Cluster-based retrieval
- Cluster docs in collection a priori
- Only compute the relevance scores for docs in the cluster closest to the query
- Improve retrieval efficiency by only search a small portion of the document collection

Islands of music

(Pampalk et al., KDD’ 03)

x2

x1

x4

x3

x5

x6

x7

- Measure the compactness by the sum of distance square within clusters

x2

x1

x4

x3

x5

x6

x7

- Measure the compactness by the sum of distance square within clusters

x2

x1

x4

x3

x5

x6

x7

- Measure the compactness by the sum of distance square within clusters

x2

x1

C1

x4

x3

x5

x6

C2

x7

- Measure the compactness by the sum of distance square within clusters

x2

x1

x4

x3

C1

C2

x5

x6

x7

- Measure the compactness by the sum of distance square within clusters
- Membership indicators:
mi,j =1 if xi is assigned to Cj, and zero otherwise.

x2

x1

C1

x4

x3

x5

x6

C2

x7

- Measure the compactness by the sum of distance square within clusters
- Membership indicators:
mi,j =1 if xi is assigned to Cj, and zero otherwise.

x2

x1

C1

x4

x3

x5

x6

C2

x7

- Measure the compactness by the sum of distance square within clusters

x2

x1

C1

x4

x3

x5

x6

C2

x7

- Measure the compactness by the sum of distance square within clusters
- Find good clusters by minimizing the cluster compactness
- Cluster centers C1 and C2
- Membership mi,j

x2

x1

C1

x4

x3

x5

x6

C2

x7

- Measure the compactness by the sum of distance square within clusters
- Find good clusters by minimizing the cluster compactness
- Cluster centers C1 and C2
- Membership mi,j

x2

x1

C1

x4

x3

x5

x6

C2

x7

- Find good clusters by minimizing the cluster compactness
- Cluster centers C1 and C2
- Membership mi,j

x2

x1

C1

x4

x3

x5

x6

C2

x7

Update mi,j: assign xi to the closest Cj

Update mi,j: assign xi to the closest Cj

Update Cj as the average of xi assigned to Cj

Update mi,j: assign xi to the closest Cj

K-means algorithm

Update Cj as the average of xi assigned to Cj

- Start with random cluster centers C1 than to C2

x2

x1

x4

x3

C1

C2

x5

x6

x7

- Identify the points that are closer to C1 than to C2

x2

x1

x4

x3

C1

C2

x5

x6

x7

- Update C1

x2

x1

x4

x3

C1

C2

x5

x6

x7

- Identify the points that are closer to C2 than to C1

x2

x1

x4

x3

C1

C2

x5

x6

x7

- Identify the points that are closer to C2 than to C1

x2

x1

x4

x3

C1

x5

x6

C2

x7

- Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2

x2

x1

x4

x3

C1

x5

x6

C2

x7

- Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2
- Update C1 and C2

x2

x1

C1

x4

x3

x5

x6

C2

x7

- K-means
- Start with a random guess of cluster centers
- Determine the membership of each data points
- Adjust the cluster centers

- K-means
- Start with a random guess of cluster centers
- Determine the membership of each data points
- Adjust the cluster centers

- K-means
- Start with a random guess of cluster centers
- Determine the membership of each data points
- Adjust the cluster centers

- Ask user how many clusters they’d like. (e.g. k=5)

- Ask user how many clusters they’d like. (e.g. k=5)
- Randomly guess k cluster Center locations

- Ask user how many clusters they’d like. (e.g. k=5)
- Randomly guess k cluster Center locations
- Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

- Ask user how many clusters they’d like. (e.g. k=5)
- Randomly guess k cluster Center locations
- Each datapoint finds out which Center it’s closest to.
- Each Center finds the centroid of the points it owns

- Ask user how many clusters they’d like. (e.g. k=5)
- Randomly guess k cluster Center locations
- Each datapoint finds out which Center it’s closest to.
- Each Center finds the centroid of the points it owns

Any Computational Problem ?

Need to go through each data point at each iteration of k-means

- Group nearby data points by region
- KD tree
- SR tree

- Try to update the membership for all the data points in the same region

- Find the closest center for each rectangle
- Assign all the points within a rectangle to one cluster

- Assume that data are generated from a mixture of multinomial distributions
- Estimate the mixture distribution from the observed documents

Measure the probability for every data point to be associated with each cluster

- Goal is to create a hierarchy of topics
- Challenge: create this hierarchy automatically
- Approaches: top-down or bottom-up

- Given a similarity measure for determining the similarity between two clusters
- Start with each document in a separate cluster
- repeatedly merge the two most similar clusters
- Until there is only one cluster
- The history of merging forms a binary tree
- The standard way of depicting this history is a dendrogram.

similarity

With an appropriately chosen similarity cut, we can convert the dendrogram into a flat clustering.

- Single-link: Maximum similarity
- Maximum over all document pairs

- Single-link: Maximum similarity
- Maximum over all document pairs

- Complete-link: Minimum similarity
- Minimum over all document pairs

- Single-link: Maximum similarity
- Maximum over all document pairs

- Complete-link: Minimum similarity
- Minimum over all document pairs

- Centroid: Average “intersimilarity”
- Average over all document pairs

Single Link

Complete Link

- Complete link usually produces balanced clusters

- Top-down (instead of bottom-up as in HAC)
- Start with all docs in one big cluster
- Then recursively split clusters
- Eventually each node forms a cluster on its own.
- Example: Bisecting K-means