Clustering
Download
1 / 54

Clustering - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

Clustering. “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ ACM CS’99 ] Instances within a cluster are very similar Instances in different clusters are very different. t e r m 2. term1. Example.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Clustering' - ethan-golden


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Clustering
Clustering

  • “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99]

  • Instances within a cluster are very similar

  • Instances in different clusters are very different

Text Clustering


Example

.

.

.

t

e

r

m

2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

term1

Example

Text Clustering


Applications
Applications

  • Faster retrieval

  • Faster and better browsing

  • Structuring of search results

  • Revealing classes and other data regularities

  • Directory construction

  • Better data organization in general

Text Clustering


Cluster searching
Cluster Searching

  • Similar instances tend to be relevant to the same requests

  • The query is mapped to the closest cluster by comparison with the cluster-centroids

Text Clustering


Notation
Notation

  • N: number of elements

  • Class: real world grouping – ground truth

  • Cluster: grouping by algorithm

  • The ideal clustering algorithm will produce clusters equivalent to real world classes with exactly the same members

Text Clustering


Problems
Problems

  • How many clusters ?

  • Complexity? N is usually large

  • Quality of clustering

  • When a method is better than another?

  • Overlapping clusters

  • Sensitivity to outliers

Text Clustering


Example1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Example

Text Clustering


Clustering approaches
Clustering Approaches

  • Divisive: build clusters “top down” starting from the entire data set

    • K-means, Bisecting K-means

    • Hierarchical or flat clustering

  • Agglomerative: build clusters “bottom-up” starting with individual instances and by iteratively combining them to form larger cluster at higher level

    • Hierarchical clustering

  • Combinations of the above

    • Buckshot algorithm

Text Clustering


Hierarchical flat clustering
Hierarchical – Flat Clustering

  • Flat: all clusters at the same level

    • K-means, Buckshot

  • Hierarchical: nested sequence of clusters

    • Single cluster with all data at the top & singleton clusters at the bottom

    • Intermediate levels are more useful

    • Every intermediate level combines two clusters from the next lower level

    • Agglomerative, Bisecting K-means

Text Clustering


Flat clustering

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Flat Clustering

Text Clustering


Hierarchical clustering

.

.

.

.

.

1

1

.

.

.

.

.

.

4

.

6

.

2

2

3

.

3

.

.

.

.

.

.

.

.

5

.

.

.

.

.

7

4

5

6

7

.

.

Hierarchical Clustering

Text Clustering


Text clustering
Text Clustering

  • Finds overall similarities among documents or groups of documents

    • Faster searching, browsing etc.

  • Needs to know how to compute the similarity (or equivalently the distance) between documents

Text Clustering


Query document similarity

d1

d2

θ

Query – Document Similarity

  • Similarity is defined as the cosine of the angle between document and query vectors

Text Clustering


Document distance
Document Distance

  • Consider documents d1, d2 with vectors u1, u2

  • Theirdistance is defined as the length AB

Text Clustering


Normalization by document length
Normalization by Document Length

  • The longer the document is, the more likely it is for a given term to appear in it

  • Normalize the term weights by document length (so terms in long documents are not given more weight)

Text Clustering


Evaluation of cluster quality
Evaluation of Cluster Quality

  • Clusters can be evaluated using internal or external knowledge

  • Internal Measures: intra cluster cohesion and cluster separability

    • intra cluster similarity

    • inter cluster similarity

  • External measures: quality of clusters compared to real classes

    • Entropy (E), Harmonic Mean (F)

Text Clustering


Intra cluster similarity
Intra Cluster Similarity

  • A measure of cluster cohesion

  • Defined as the average pair-wise similarity of documents in a cluster

  • Where : cluster centroid

  • Documents (not centroids) have unit length

Text Clustering


Inter cluster similarity
Inter Cluster Similarity

  • Single Link: similarity of two most similar members

  • Complete Link: similarity of two least similar members

  • Group Average: average similarity between members

Text Clustering


Example2

complete link

.

.

S’

S

group

average

.

.

c’

c

single link

Example

Text Clustering


Entropy
Entropy

  • Measures the quality of flat clusters using external knowledge

    • Pre-existing classification

    • Assessment by experts

  • Pij: probability that a member of cluster j belong to class i

  • The entropy of cluster j is defined as Ej=-ΣiPijlogPij

j

cluster

i

class

Text Clustering


Entropy con t
Entropy (con’t)

  • Total entropy for all clusters

  • Where nj is the size of cluster j

  • m is the number of clusters

  • N is the number of instances

  • The smaller the value of E is the better the quality of the algorithm is

  • The best entropy is obtained when each cluster contains exactly one instance

Text Clustering


Harmonic mean f
Harmonic Mean (F)

  • Treats each cluster as a query result

  • F combines precision (P) and recall (R)

  • Fijfor cluster j and class i is defined as

    nij: number of instances of class i in cluster j,

    ni: number of instances of class i,

    nj: number of instances of cluster j

Text Clustering


Harmonic mean con t
Harmonic Mean (con’t)

  • The F value of any class i is the maximum value it achieves over all j

    Fi = maxjFij

  • The F value of a clustering solution is computed as the weighted average over all classes

  • Where N is the number of data instances

Text Clustering


Quality of clustering
Quality of Clustering

  • A good clustering method

    • Maximizes intra-cluster similarity

    • Minimizes inter cluster similarity

    • Minimizes Entropy

    • Maximizes Harmonic Mean

  • Difficult to achieve all together simultaneously

  • Maximize some objective function of the above

  • An algorithm is better than an other if it has better values on most of these measures

Text Clustering


K means algorithm
K-means Algorithm

  • Select K centroids

  • Repeat I times or until the centroids do not change

    • Assign each instance to the cluster represented by its nearest centroid

    • Compute new centroids

    • Reassign instances

    • Compute new centroids

    • …….

Text Clustering


K means demo 1 7 http www delft cluster nl textminer theory kmeans kmeans html
K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html

Nikos Hourdakis, MSc Thesis


K means demo 2 7
K-Means demo (2/7)

Nikos Hourdakis, MSc Thesis


K means demo 3 7
K-Means demo (3/7)

Nikos Hourdakis, MSc Thesis


K means demo 4 7
K-Means demo (4/7)

Nikos Hourdakis, MSc Thesis


K means demo 5 7
K-Means demo (5/7)

Nikos Hourdakis, MSc Thesis


K means demo 6 7
K-Means demo (6/7)

Nikos Hourdakis, MSc Thesis


K means demo 7 7
K-Means demo (7/7)

Nikos Hourdakis, MSc Thesis


Comments on k means 1
Comments on K-Means (1)

  • Generates a flat partition of K clusters

  • K is the desired number of clusters and must be known in advance

  • Starts with K random cluster centroids

  • A centroid is the mean or the median of a group of instances

  • The mean rarely corresponds to a real instance

Text Clustering


Comments on k means 2
Comments on K-Means (2)

  • Up to I=10 iterations

  • Keep the clustering resulted in best inter/intra similarity or the final clusters after I iterations

  • Complexity O(IKN)

  • A repeated application of K-Means for K=2, 4,… can produce a hierarchical clustering

Text Clustering


Bisecting k means
Bisecting K-Means

K=2

K=2

K=2

Text Clustering


Choosing centroids for k means
Choosing Centroids for K-means

  • Quality of clustering depends on the selection of initial centroids

  • Random selection may result in poor convergence rate, or convergence to sub-optimal clusterings.

  • Select good initial centroids using a heuristic or the results of another method

    • Buckshot algorithm

Text Clustering


Incremental k means
Incremental K-Means

  • Update each centroid during each iteration after each point is assigned to a cluster rather than at the end of each iteration

  • Reassign instances to clusters at the end of each iteration

  • Converges faster than simple K-means

  • Usually 2-5 iterations

Text Clustering


Bisecting k means1
Bisecting K-Means

  • Starts with a single cluster with all instances

  • Select a cluster to split: larger cluster or cluster with less intra similarity

  • The selected cluster is split into 2 partitions using K-means (K=2)

  • Repeat up to the desired depth h

  • Hierarchical clustering

  • Complexity O(2hN)

Text Clustering


Agglomerative clustering
Agglomerative Clustering

  • Compute the similarity matrix between all pairs of instances

  • Starting from singleton clusters

  • Repeat until a single cluster remains

    • Merge the two most similar clusters

    • Replace them with a single cluster

    • Replace the merged cluster in the matrix and update the similarity matrix

  • Complexity O(N2)

Text Clustering


Similarity matrix
Similarity Matrix

Text Clustering


Update similarity matrix
Update Similarity Matrix

merged

merged

Text Clustering


New similarity matrix
New Similarity Matrix

Text Clustering


Single link
Single Link

  • Selecting the most similar clusters for merging using single link

  • Can result in long and thin clusters due to “chaining effect”

    • Appropriate in some domains, such as clustering islands

Text Clustering


Complete link
Complete Link

  • Selecting the most similar clusters for merging using complete link

  • Results in compact, spherical clusters that are preferable

Text Clustering


Group average
Group Average

  • Selecting the most similar clusters for merging using group average

  • Fast compromise between single and complete link

Text Clustering


Example3

complete link

.

.

B

A

group

average

.

.

c2

c1

single link

Example

Text Clustering


Inter cluster similarity1
Inter Cluster Similarity

  • A new cluster is represented by its centroid

  • The document to cluster similarity is computed as

  • The cluster-to-cluster similarity can be computed as single, complete or group average similarity

Text Clustering


Buckshot k means
Buckshot K-Means

  • Combines Agglomerative and K-Means

  • Agglomerative results in a good clustering solution but has O(N2) complexity

  • Randomly select a sample Ninstances

  • Applying Agglomerative on the sample which takes (N) time

  • Take the centroids of the k-clusters solution as input to K-Means

  • Overall complexity is O(N)

Text Clustering


Example4

1

2

3

4

5

6

7

11

12

13

14

15

8

9

10

Example

initial

cetroids

for

K-Means

Text Clustering


Graph theoretic methods
Graph Theoretic Methods

  • Two documents with similarity > T(threshold) are connected with an edge [Duda&Hart73]

    • clusters: the connected components (maximal cliques) of the resulting graph

    • problem: selection of appropriate threshold T

Information Retrieval Models


Zahn s method zahn71
Zahn’s method [Zahn71]

the dashed edge

is inconsistent

and is deleted

  • Find the minimum spanning tree

  • For each doc delete edges with length l > lavg

    • E.g., lavg: average distance if its incident edges

  • Or remove the longest edge (1 edge removed => 2 clusters, 2 edges removed => 3 clusters

  • Clusters: the connected components of the graph

  • http://www.cse.iitb.ac.in/dbms/Data/Courses/CS632/1999/clustering/node20.html

Information Retrieval Models


Cluster searching1
Cluster Searching

  • The M-dimensional query vector is compared with the cluster-centroids

    • search closest cluster

    • retrieve documents with similarity > T

Information Retrieval Models


Soft clustering
Soft Clustering

  • Hard clustering: each instance belongs to exactly one cluster

    • Does not allow for uncertainty

    • An instance may belong to two or more clusters

  • Soft clustering is based on probabilities that an instance belongs to each of a set of clusters

    • probabilities of all categories must sum to 1

    • Expectation Minimization (EM) is the most popular approach

Text Clustering


References
References

  • "Searching Multimedia Databases by Content", Christos Faloutsos, Kluwer Academic Publishers, 1996

  • “A Comparison of Document Clustering Techniques”, M. Steinbach, G. Karypis, V. Kumar, In KDD Workshop on Text Mining,2000

  • “Data Clustering: A Review”, A.K. Jain, M.N. Murphy, P.J. Flynn, ACM Comp. Surveys, Vol. 31, No. 3, Sept. 99.

  • “Algorithms for Clustering Data” A.K. Jain, R.C. Dubes; Prentice-Hall , 1988, ISBN 0-13-022278-X

  • “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer”, G. Salton, Addison-Wesley, 1989

Text Clustering