Introduction to Text Clustering: Principles, Approaches, and Evaluation Techniques

Clustering • “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99] • Instances within a cluster are very similar • Instances in different clusters are very different Text Clustering

. . . t e r m 2 . . . . . . . . . . . . . . . . . . . term1 Example Text Clustering

Applications • Faster retrieval • Faster and better browsing • Structuring of search results • Revealing classes and other data regularities • Directory construction • Better data organization in general Text Clustering

Cluster Searching • Similar instances tend to be relevant to the same requests • The query is mapped to the closest cluster by comparison with the cluster-centroids Text Clustering

Notation • N: number of elements • Class: real world grouping – ground truth • Cluster: grouping by algorithm • The ideal clustering algorithm will produce clusters equivalent to real world classes with exactly the same members Text Clustering

Problems • How many clusters ? • Complexity? N is usually large • Quality of clustering • When a method is better than another? • Overlapping clusters • Sensitivity to outliers Text Clustering

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example Text Clustering

Clustering Approaches • Divisive: build clusters “top down” starting from the entire data set • K-means, Bisecting K-means • Hierarchical or flat clustering • Agglomerative: build clusters “bottom-up” starting with individual instances and by iteratively combining them to form larger cluster at higher level • Hierarchical clustering • Combinations of the above • Buckshot algorithm Text Clustering

Hierarchical – Flat Clustering • Flat: all clusters at the same level • K-means, Buckshot • Hierarchical: nested sequence of clusters • Single cluster with all data at the top & singleton clusters at the bottom • Intermediate levels are more useful • Every intermediate level combines two clusters from the next lower level • Agglomerative, Bisecting K-means Text Clustering

. . . . . . . . . . . . . . . . . . . . . . Flat Clustering Text Clustering

. . . . . 1 1 . . . . . . 4 . 6 . 2 2 3 . 3 . . . . . . . . 5 . . . . . 7 4 5 6 7 . . Hierarchical Clustering Text Clustering

Text Clustering • Finds overall similarities among documents or groups of documents • Faster searching, browsing etc. • Needs to know how to compute the similarity (or equivalently the distance) between documents Text Clustering

d1 d2 θ Query – Document Similarity • Similarity is defined as the cosine of the angle between document and query vectors Text Clustering

Document Distance • Consider documents d1, d2 with vectors u1, u2 • Theirdistance is defined as the length AB Text Clustering

Normalization by Document Length • The longer the document is, the more likely it is for a given term to appear in it • Normalize the term weights by document length (so terms in long documents are not given more weight) Text Clustering

Evaluation of Cluster Quality • Clusters can be evaluated using internal or external knowledge • Internal Measures: intra cluster cohesion and cluster separability • intra cluster similarity • inter cluster similarity • External measures: quality of clusters compared to real classes • Entropy (E), Harmonic Mean (F) Text Clustering

Intra Cluster Similarity • A measure of cluster cohesion • Defined as the average pair-wise similarity of documents in a cluster • Where : cluster centroid • Documents (not centroids) have unit length Text Clustering

Inter Cluster Similarity • Single Link: similarity of two most similar members • Complete Link: similarity of two least similar members • Group Average: average similarity between members Text Clustering

complete link . . S’ S group average . . c’ c single link Example Text Clustering

Entropy • Measures the quality of flat clusters using external knowledge • Pre-existing classification • Assessment by experts • Pij: probability that a member of cluster j belong to class i • The entropy of cluster j is defined as Ej=-ΣiPijlogPij j cluster i class Text Clustering

Entropy (con’t) • Total entropy for all clusters • Where nj is the size of cluster j • m is the number of clusters • N is the number of instances • The smaller the value of E is the better the quality of the algorithm is • The best entropy is obtained when each cluster contains exactly one instance Text Clustering

Harmonic Mean (F) • Treats each cluster as a query result • F combines precision (P) and recall (R) • Fijfor cluster j and class i is defined as nij: number of instances of class i in cluster j, ni: number of instances of class i, nj: number of instances of cluster j Text Clustering

Harmonic Mean (con’t) • The F value of any class i is the maximum value it achieves over all j Fi = maxjFij • The F value of a clustering solution is computed as the weighted average over all classes • Where N is the number of data instances Text Clustering

Quality of Clustering • A good clustering method • Maximizes intra-cluster similarity • Minimizes inter cluster similarity • Minimizes Entropy • Maximizes Harmonic Mean • Difficult to achieve all together simultaneously • Maximize some objective function of the above • An algorithm is better than an other if it has better values on most of these measures Text Clustering

K-means Algorithm • Select K centroids • Repeat I times or until the centroids do not change • Assign each instance to the cluster represented by its nearest centroid • Compute new centroids • Reassign instances • Compute new centroids • ……. Text Clustering

K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html Nikos Hourdakis, MSc Thesis

K-Means demo (2/7) Nikos Hourdakis, MSc Thesis

Comments on K-Means (1) • Generates a flat partition of K clusters • K is the desired number of clusters and must be known in advance • Starts with K random cluster centroids • A centroid is the mean or the median of a group of instances • The mean rarely corresponds to a real instance Text Clustering

Comments on K-Means (2) • Up to I=10 iterations • Keep the clustering resulted in best inter/intra similarity or the final clusters after I iterations • Complexity O(IKN) • A repeated application of K-Means for K=2, 4,… can produce a hierarchical clustering Text Clustering

Bisecting K-Means K=2 K=2 K=2 Text Clustering

Choosing Centroids for K-means • Quality of clustering depends on the selection of initial centroids • Random selection may result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good initial centroids using a heuristic or the results of another method • Buckshot algorithm Text Clustering

Incremental K-Means • Update each centroid during each iteration after each point is assigned to a cluster rather than at the end of each iteration • Reassign instances to clusters at the end of each iteration • Converges faster than simple K-means • Usually 2-5 iterations Text Clustering

Bisecting K-Means • Starts with a single cluster with all instances • Select a cluster to split: larger cluster or cluster with less intra similarity • The selected cluster is split into 2 partitions using K-means (K=2) • Repeat up to the desired depth h • Hierarchical clustering • Complexity O(2hN) Text Clustering

Agglomerative Clustering • Compute the similarity matrix between all pairs of instances • Starting from singleton clusters • Repeat until a single cluster remains • Merge the two most similar clusters • Replace them with a single cluster • Replace the merged cluster in the matrix and update the similarity matrix • Complexity O(N2) Text Clustering

Similarity Matrix Text Clustering

Update Similarity Matrix merged merged Text Clustering

New Similarity Matrix Text Clustering

Single Link • Selecting the most similar clusters for merging using single link • Can result in long and thin clusters due to “chaining effect” • Appropriate in some domains, such as clustering islands Text Clustering

Complete Link • Selecting the most similar clusters for merging using complete link • Results in compact, spherical clusters that are preferable Text Clustering

Group Average • Selecting the most similar clusters for merging using group average • Fast compromise between single and complete link Text Clustering

complete link . . B A group average . . c2 c1 single link Example Text Clustering

Inter Cluster Similarity • A new cluster is represented by its centroid • The document to cluster similarity is computed as • The cluster-to-cluster similarity can be computed as single, complete or group average similarity Text Clustering

Buckshot K-Means • Combines Agglomerative and K-Means • Agglomerative results in a good clustering solution but has O(N2) complexity • Randomly select a sample Ninstances • Applying Agglomerative on the sample which takes (N) time • Take the centroids of the k-clusters solution as input to K-Means • Overall complexity is O(N) Text Clustering

1 2 3 4 5 6 7 11 12 13 14 15 8 9 10 Example initial cetroids for K-Means Text Clustering

Graph Theoretic Methods • Two documents with similarity > T(threshold) are connected with an edge [Duda&Hart73] • clusters: the connected components (maximal cliques) of the resulting graph • problem: selection of appropriate threshold T Information Retrieval Models

Introduction to Text Clustering: Principles, Approaches, and Evaluation Techniques

Introduction to Text Clustering: Principles, Approaches, and Evaluation Techniques

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering