Clustering

1 / 54

# Clustering - PowerPoint PPT Presentation

Clustering. “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ ACM CS’99 ] Instances within a cluster are very similar Instances in different clusters are very different. t e r m 2. term1. Example.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Clustering' - ethan-golden

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Clustering
• “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99]
• Instances within a cluster are very similar
• Instances in different clusters are very different

Text Clustering

.

.

.

t

e

r

m

2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

term1

Example

Text Clustering

Applications
• Faster retrieval
• Faster and better browsing
• Structuring of search results
• Revealing classes and other data regularities
• Directory construction
• Better data organization in general

Text Clustering

Cluster Searching
• Similar instances tend to be relevant to the same requests
• The query is mapped to the closest cluster by comparison with the cluster-centroids

Text Clustering

Notation
• N: number of elements
• Class: real world grouping – ground truth
• Cluster: grouping by algorithm
• The ideal clustering algorithm will produce clusters equivalent to real world classes with exactly the same members

Text Clustering

Problems
• How many clusters ?
• Complexity? N is usually large
• Quality of clustering
• When a method is better than another?
• Overlapping clusters
• Sensitivity to outliers

Text Clustering

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Example

Text Clustering

Clustering Approaches
• Divisive: build clusters “top down” starting from the entire data set
• K-means, Bisecting K-means
• Hierarchical or flat clustering
• Agglomerative: build clusters “bottom-up” starting with individual instances and by iteratively combining them to form larger cluster at higher level
• Hierarchical clustering
• Combinations of the above
• Buckshot algorithm

Text Clustering

Hierarchical – Flat Clustering
• Flat: all clusters at the same level
• K-means, Buckshot
• Hierarchical: nested sequence of clusters
• Single cluster with all data at the top & singleton clusters at the bottom
• Intermediate levels are more useful
• Every intermediate level combines two clusters from the next lower level
• Agglomerative, Bisecting K-means

Text Clustering

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Flat Clustering

Text Clustering

.

.

.

.

.

1

1

.

.

.

.

.

.

4

.

6

.

2

2

3

.

3

.

.

.

.

.

.

.

.

5

.

.

.

.

.

7

4

5

6

7

.

.

Hierarchical Clustering

Text Clustering

Text Clustering
• Finds overall similarities among documents or groups of documents
• Faster searching, browsing etc.
• Needs to know how to compute the similarity (or equivalently the distance) between documents

Text Clustering

d1

d2

θ

Query – Document Similarity
• Similarity is defined as the cosine of the angle between document and query vectors

Text Clustering

Document Distance
• Consider documents d1, d2 with vectors u1, u2
• Theirdistance is defined as the length AB

Text Clustering

Normalization by Document Length
• The longer the document is, the more likely it is for a given term to appear in it
• Normalize the term weights by document length (so terms in long documents are not given more weight)

Text Clustering

Evaluation of Cluster Quality
• Clusters can be evaluated using internal or external knowledge
• Internal Measures: intra cluster cohesion and cluster separability
• intra cluster similarity
• inter cluster similarity
• External measures: quality of clusters compared to real classes
• Entropy (E), Harmonic Mean (F)

Text Clustering

Intra Cluster Similarity
• A measure of cluster cohesion
• Defined as the average pair-wise similarity of documents in a cluster
• Where : cluster centroid
• Documents (not centroids) have unit length

Text Clustering

Inter Cluster Similarity
• Single Link: similarity of two most similar members
• Complete Link: similarity of two least similar members
• Group Average: average similarity between members

Text Clustering

.

.

S’

S

group

average

.

.

c’

c

Example

Text Clustering

Entropy
• Measures the quality of flat clusters using external knowledge
• Pre-existing classification
• Assessment by experts
• Pij: probability that a member of cluster j belong to class i
• The entropy of cluster j is defined as Ej=-ΣiPijlogPij

j

cluster

i

class

Text Clustering

Entropy (con’t)
• Total entropy for all clusters
• Where nj is the size of cluster j
• m is the number of clusters
• N is the number of instances
• The smaller the value of E is the better the quality of the algorithm is
• The best entropy is obtained when each cluster contains exactly one instance

Text Clustering

Harmonic Mean (F)
• Treats each cluster as a query result
• F combines precision (P) and recall (R)
• Fijfor cluster j and class i is defined as

nij: number of instances of class i in cluster j,

ni: number of instances of class i,

nj: number of instances of cluster j

Text Clustering

Harmonic Mean (con’t)
• The F value of any class i is the maximum value it achieves over all j

Fi = maxjFij

• The F value of a clustering solution is computed as the weighted average over all classes
• Where N is the number of data instances

Text Clustering

Quality of Clustering
• A good clustering method
• Maximizes intra-cluster similarity
• Minimizes inter cluster similarity
• Minimizes Entropy
• Maximizes Harmonic Mean
• Difficult to achieve all together simultaneously
• Maximize some objective function of the above
• An algorithm is better than an other if it has better values on most of these measures

Text Clustering

K-means Algorithm
• Select K centroids
• Repeat I times or until the centroids do not change
• Assign each instance to the cluster represented by its nearest centroid
• Compute new centroids
• Compute new centroids
• …….

Text Clustering

K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html

Nikos Hourdakis, MSc Thesis

K-Means demo (2/7)

Nikos Hourdakis, MSc Thesis

K-Means demo (3/7)

Nikos Hourdakis, MSc Thesis

K-Means demo (4/7)

Nikos Hourdakis, MSc Thesis

K-Means demo (5/7)

Nikos Hourdakis, MSc Thesis

K-Means demo (6/7)

Nikos Hourdakis, MSc Thesis

K-Means demo (7/7)

Nikos Hourdakis, MSc Thesis

• Generates a flat partition of K clusters
• K is the desired number of clusters and must be known in advance
• Starts with K random cluster centroids
• A centroid is the mean or the median of a group of instances
• The mean rarely corresponds to a real instance

Text Clustering

• Up to I=10 iterations
• Keep the clustering resulted in best inter/intra similarity or the final clusters after I iterations
• Complexity O(IKN)
• A repeated application of K-Means for K=2, 4,… can produce a hierarchical clustering

Text Clustering

Bisecting K-Means

K=2

K=2

K=2

Text Clustering

Choosing Centroids for K-means
• Quality of clustering depends on the selection of initial centroids
• Random selection may result in poor convergence rate, or convergence to sub-optimal clusterings.
• Select good initial centroids using a heuristic or the results of another method
• Buckshot algorithm

Text Clustering

Incremental K-Means
• Update each centroid during each iteration after each point is assigned to a cluster rather than at the end of each iteration
• Converges faster than simple K-means
• Usually 2-5 iterations

Text Clustering

Bisecting K-Means
• Starts with a single cluster with all instances
• Select a cluster to split: larger cluster or cluster with less intra similarity
• The selected cluster is split into 2 partitions using K-means (K=2)
• Repeat up to the desired depth h
• Hierarchical clustering
• Complexity O(2hN)

Text Clustering

Agglomerative Clustering
• Compute the similarity matrix between all pairs of instances
• Starting from singleton clusters
• Repeat until a single cluster remains
• Merge the two most similar clusters
• Replace them with a single cluster
• Replace the merged cluster in the matrix and update the similarity matrix
• Complexity O(N2)

Text Clustering

Similarity Matrix

Text Clustering

Update Similarity Matrix

merged

merged

Text Clustering

New Similarity Matrix

Text Clustering

• Selecting the most similar clusters for merging using single link
• Can result in long and thin clusters due to “chaining effect”
• Appropriate in some domains, such as clustering islands

Text Clustering

• Selecting the most similar clusters for merging using complete link
• Results in compact, spherical clusters that are preferable

Text Clustering

Group Average
• Selecting the most similar clusters for merging using group average
• Fast compromise between single and complete link

Text Clustering

.

.

B

A

group

average

.

.

c2

c1

Example

Text Clustering

Inter Cluster Similarity
• A new cluster is represented by its centroid
• The document to cluster similarity is computed as
• The cluster-to-cluster similarity can be computed as single, complete or group average similarity

Text Clustering

Buckshot K-Means
• Combines Agglomerative and K-Means
• Agglomerative results in a good clustering solution but has O(N2) complexity
• Randomly select a sample Ninstances
• Applying Agglomerative on the sample which takes (N) time
• Take the centroids of the k-clusters solution as input to K-Means
• Overall complexity is O(N)

Text Clustering

1

2

3

4

5

6

7

11

12

13

14

15

8

9

10

Example

initial

cetroids

for

K-Means

Text Clustering

Graph Theoretic Methods
• Two documents with similarity > T(threshold) are connected with an edge [Duda&Hart73]
• clusters: the connected components (maximal cliques) of the resulting graph
• problem: selection of appropriate threshold T

Information Retrieval Models

Zahn’s method [Zahn71]

the dashed edge

is inconsistent

and is deleted

• Find the minimum spanning tree
• For each doc delete edges with length l > lavg
• E.g., lavg: average distance if its incident edges
• Or remove the longest edge (1 edge removed => 2 clusters, 2 edges removed => 3 clusters
• Clusters: the connected components of the graph
• http://www.cse.iitb.ac.in/dbms/Data/Courses/CS632/1999/clustering/node20.html

Information Retrieval Models

Cluster Searching
• The M-dimensional query vector is compared with the cluster-centroids
• search closest cluster
• retrieve documents with similarity > T

Information Retrieval Models

Soft Clustering
• Hard clustering: each instance belongs to exactly one cluster
• Does not allow for uncertainty
• An instance may belong to two or more clusters
• Soft clustering is based on probabilities that an instance belongs to each of a set of clusters
• probabilities of all categories must sum to 1
• Expectation Minimization (EM) is the most popular approach

Text Clustering

References
• "Searching Multimedia Databases by Content", Christos Faloutsos, Kluwer Academic Publishers, 1996
• “A Comparison of Document Clustering Techniques”, M. Steinbach, G. Karypis, V. Kumar, In KDD Workshop on Text Mining,2000
• “Data Clustering: A Review”, A.K. Jain, M.N. Murphy, P.J. Flynn, ACM Comp. Surveys, Vol. 31, No. 3, Sept. 99.
• “Algorithms for Clustering Data” A.K. Jain, R.C. Dubes; Prentice-Hall , 1988, ISBN 0-13-022278-X
• “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer”, G. Salton, Addison-Wesley, 1989

Text Clustering