- 91 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Clustering' - ethan-golden

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Clustering

- “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99]
- Instances within a cluster are very similar
- Instances in different clusters are very different

Text Clustering

Applications

- Faster retrieval
- Faster and better browsing
- Structuring of search results
- Revealing classes and other data regularities
- Directory construction
- Better data organization in general

Text Clustering

Cluster Searching

- Similar instances tend to be relevant to the same requests
- The query is mapped to the closest cluster by comparison with the cluster-centroids

Text Clustering

Notation

- N: number of elements
- Class: real world grouping – ground truth
- Cluster: grouping by algorithm
- The ideal clustering algorithm will produce clusters equivalent to real world classes with exactly the same members

Text Clustering

Problems

- How many clusters ?
- Complexity? N is usually large
- Quality of clustering
- When a method is better than another?
- Overlapping clusters
- Sensitivity to outliers

Text Clustering

Clustering Approaches

- Divisive: build clusters “top down” starting from the entire data set
- K-means, Bisecting K-means
- Hierarchical or flat clustering
- Agglomerative: build clusters “bottom-up” starting with individual instances and by iteratively combining them to form larger cluster at higher level
- Hierarchical clustering
- Combinations of the above
- Buckshot algorithm

Text Clustering

Hierarchical – Flat Clustering

- Flat: all clusters at the same level
- K-means, Buckshot
- Hierarchical: nested sequence of clusters
- Single cluster with all data at the top & singleton clusters at the bottom
- Intermediate levels are more useful
- Every intermediate level combines two clusters from the next lower level
- Agglomerative, Bisecting K-means

Text Clustering

Text Clustering

- Finds overall similarities among documents or groups of documents
- Faster searching, browsing etc.
- Needs to know how to compute the similarity (or equivalently the distance) between documents

Text Clustering

d2

θ

Query – Document Similarity- Similarity is defined as the cosine of the angle between document and query vectors

Text Clustering

Document Distance

- Consider documents d1, d2 with vectors u1, u2
- Theirdistance is defined as the length AB

Text Clustering

Normalization by Document Length

- The longer the document is, the more likely it is for a given term to appear in it
- Normalize the term weights by document length (so terms in long documents are not given more weight)

Text Clustering

Evaluation of Cluster Quality

- Clusters can be evaluated using internal or external knowledge
- Internal Measures: intra cluster cohesion and cluster separability
- intra cluster similarity
- inter cluster similarity
- External measures: quality of clusters compared to real classes
- Entropy (E), Harmonic Mean (F)

Text Clustering

Intra Cluster Similarity

- A measure of cluster cohesion
- Defined as the average pair-wise similarity of documents in a cluster
- Where : cluster centroid
- Documents (not centroids) have unit length

Text Clustering

Inter Cluster Similarity

- Single Link: similarity of two most similar members
- Complete Link: similarity of two least similar members
- Group Average: average similarity between members

Text Clustering

Entropy

- Measures the quality of flat clusters using external knowledge
- Pre-existing classification
- Assessment by experts
- Pij: probability that a member of cluster j belong to class i
- The entropy of cluster j is defined as Ej=-ΣiPijlogPij

j

cluster

i

class

Text Clustering

Entropy (con’t)

- Total entropy for all clusters
- Where nj is the size of cluster j
- m is the number of clusters
- N is the number of instances
- The smaller the value of E is the better the quality of the algorithm is
- The best entropy is obtained when each cluster contains exactly one instance

Text Clustering

Harmonic Mean (F)

- Treats each cluster as a query result
- F combines precision (P) and recall (R)
- Fijfor cluster j and class i is defined as

nij: number of instances of class i in cluster j,

ni: number of instances of class i,

nj: number of instances of cluster j

Text Clustering

Harmonic Mean (con’t)

- The F value of any class i is the maximum value it achieves over all j

Fi = maxjFij

- The F value of a clustering solution is computed as the weighted average over all classes
- Where N is the number of data instances

Text Clustering

Quality of Clustering

- A good clustering method
- Maximizes intra-cluster similarity
- Minimizes inter cluster similarity
- Minimizes Entropy
- Maximizes Harmonic Mean
- Difficult to achieve all together simultaneously
- Maximize some objective function of the above
- An algorithm is better than an other if it has better values on most of these measures

Text Clustering

K-means Algorithm

- Select K centroids
- Repeat I times or until the centroids do not change
- Assign each instance to the cluster represented by its nearest centroid
- Compute new centroids
- Reassign instances
- Compute new centroids
- …….

Text Clustering

K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html

Nikos Hourdakis, MSc Thesis

K-Means demo (2/7)

Nikos Hourdakis, MSc Thesis

K-Means demo (3/7)

Nikos Hourdakis, MSc Thesis

K-Means demo (4/7)

Nikos Hourdakis, MSc Thesis

K-Means demo (5/7)

Nikos Hourdakis, MSc Thesis

K-Means demo (6/7)

Nikos Hourdakis, MSc Thesis

K-Means demo (7/7)

Nikos Hourdakis, MSc Thesis

Comments on K-Means (1)

- Generates a flat partition of K clusters
- K is the desired number of clusters and must be known in advance
- Starts with K random cluster centroids
- A centroid is the mean or the median of a group of instances
- The mean rarely corresponds to a real instance

Text Clustering

Comments on K-Means (2)

- Up to I=10 iterations
- Keep the clustering resulted in best inter/intra similarity or the final clusters after I iterations
- Complexity O(IKN)
- A repeated application of K-Means for K=2, 4,… can produce a hierarchical clustering

Text Clustering

Choosing Centroids for K-means

- Quality of clustering depends on the selection of initial centroids
- Random selection may result in poor convergence rate, or convergence to sub-optimal clusterings.
- Select good initial centroids using a heuristic or the results of another method
- Buckshot algorithm

Text Clustering

Incremental K-Means

- Update each centroid during each iteration after each point is assigned to a cluster rather than at the end of each iteration
- Reassign instances to clusters at the end of each iteration
- Converges faster than simple K-means
- Usually 2-5 iterations

Text Clustering

Bisecting K-Means

- Starts with a single cluster with all instances
- Select a cluster to split: larger cluster or cluster with less intra similarity
- The selected cluster is split into 2 partitions using K-means (K=2)
- Repeat up to the desired depth h
- Hierarchical clustering
- Complexity O(2hN)

Text Clustering

Agglomerative Clustering

- Compute the similarity matrix between all pairs of instances
- Starting from singleton clusters
- Repeat until a single cluster remains
- Merge the two most similar clusters
- Replace them with a single cluster
- Replace the merged cluster in the matrix and update the similarity matrix
- Complexity O(N2)

Text Clustering

Similarity Matrix

Text Clustering

New Similarity Matrix

Text Clustering

Single Link

- Selecting the most similar clusters for merging using single link
- Can result in long and thin clusters due to “chaining effect”
- Appropriate in some domains, such as clustering islands

Text Clustering

Complete Link

- Selecting the most similar clusters for merging using complete link
- Results in compact, spherical clusters that are preferable

Text Clustering

Group Average

- Selecting the most similar clusters for merging using group average
- Fast compromise between single and complete link

Text Clustering

Inter Cluster Similarity

- A new cluster is represented by its centroid
- The document to cluster similarity is computed as
- The cluster-to-cluster similarity can be computed as single, complete or group average similarity

Text Clustering

Buckshot K-Means

- Combines Agglomerative and K-Means
- Agglomerative results in a good clustering solution but has O(N2) complexity
- Randomly select a sample Ninstances
- Applying Agglomerative on the sample which takes (N) time
- Take the centroids of the k-clusters solution as input to K-Means
- Overall complexity is O(N)

Text Clustering

Graph Theoretic Methods

- Two documents with similarity > T(threshold) are connected with an edge [Duda&Hart73]
- clusters: the connected components (maximal cliques) of the resulting graph
- problem: selection of appropriate threshold T

Information Retrieval Models

Zahn’s method [Zahn71]

the dashed edge

is inconsistent

and is deleted

- Find the minimum spanning tree
- For each doc delete edges with length l > lavg
- E.g., lavg: average distance if its incident edges
- Or remove the longest edge (1 edge removed => 2 clusters, 2 edges removed => 3 clusters
- Clusters: the connected components of the graph
- http://www.cse.iitb.ac.in/dbms/Data/Courses/CS632/1999/clustering/node20.html

Information Retrieval Models

Cluster Searching

- The M-dimensional query vector is compared with the cluster-centroids
- search closest cluster
- retrieve documents with similarity > T

Information Retrieval Models

Soft Clustering

- Hard clustering: each instance belongs to exactly one cluster
- Does not allow for uncertainty
- An instance may belong to two or more clusters
- Soft clustering is based on probabilities that an instance belongs to each of a set of clusters
- probabilities of all categories must sum to 1
- Expectation Minimization (EM) is the most popular approach

Text Clustering

References

- "Searching Multimedia Databases by Content", Christos Faloutsos, Kluwer Academic Publishers, 1996
- “A Comparison of Document Clustering Techniques”, M. Steinbach, G. Karypis, V. Kumar, In KDD Workshop on Text Mining,2000
- “Data Clustering: A Review”, A.K. Jain, M.N. Murphy, P.J. Flynn, ACM Comp. Surveys, Vol. 31, No. 3, Sept. 99.
- “Algorithms for Clustering Data” A.K. Jain, R.C. Dubes; Prentice-Hall , 1988, ISBN 0-13-022278-X
- “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer”, G. Salton, Addison-Wesley, 1989

Text Clustering

Download Presentation

Connecting to Server..