1 / 43

CS728 Clustering the Web Lecture 13

CS728 Clustering the Web Lecture 13. What is clustering?. Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects It is the commonest form of unsupervised learning

Download Presentation

CS728 Clustering the Web Lecture 13

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS728Clusteringthe WebLecture 13

  2. What is clustering? • Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects • It is the commonest form of unsupervised learning • Unsupervised learning = learning from raw data, as opposed to supervised data where the correct classification of examples is given • It is a common and important task that finds many applications in Web Science, IR and other places

  3. Why cluster web documents? • Whole web navigation • Better user interface • Improving recall in web search • Better search results • For better navigation of search results • Effective “user recall” will be higher • For speeding up retrieval • Faster search

  4. Yahoo! Tree Hierarchy www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity CS Research Question: Given a set of related objects (e.g., webpages) find the best tree decomposition. Best: helps a user find/retrieve object of interest

  5. Scatter/Gather Method for Browsing a Large Collection (SIGIR ’92) Cutting, Karger, Pedersen, and TukeyUsers browse a document collection interactively by selecting subsets of documents that are re-clustered on-the-fly

  6. For improving search recall • Cluster hypothesis - Documents with similar text are related • Therefore, to improve search recall: • Cluster docs in corpus a priori • When a query matches a doc D, also return other docs in the cluster containing D • Hope if we do this: The query “car” will also return docs containing automobile • Because clustering grouped together docs containing car with those containing automobile.

  7. For better navigation of search results • For grouping search results thematically • clusty.com / Vivisimo

  8. For better navigation of search results • And more visually: Kartoo.com

  9. Defining What Is Good Clustering • Internal criterion: A good clustering will produce high quality clusters in which: • the intra-class (that is, intra-cluster) similarity is high • the inter-class similarity is low • The measured quality of a clustering depends on both the document representation and the similarity measure used • External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes • Assessable with gold standard data

  10. External Evaluation of Cluster Quality • Assesses clustering with respect to ground truth • Assume that there are C gold standard classes, while our clustering algorithms produce k clusters, π1, π2, …, πk with nimembers. • Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster πi • Others are entropy of classes in clusters (or mutual information between classes and clusters)

  11. Purity          Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

  12. Issues for clustering • Representation for clustering • Document representation • Vector space? Normalization? • Need a notion of similarity/distance • How many clusters? • Fixed a priori? • Completely data driven? • Avoid “trivial” clusters - too large or small • In an application, if a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.

  13. What makes docs “related”? • Ideal: semantic similarity. • Practical: statistical similarity • We will typically use cosine similarity (Pearson correlation). • Docs as vectors. • For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. • We will describe algorithms in terms of cosine similarity.

  14. Recall doc as vector • Each doc j is a vector of tfidf values, one component for each term. • Can normalize to unit length. • So we have a vector space • terms are axes - aka features • n docs live in this space • even with stemming, may have 20,000+ dimensions • do we really want to use all terms? • Different from using vector space for search. Why?

  15. Intuition t 3 D2 D3 D1 x y t 1 t 2 D4 Postulate: Documents that are “close together” in vector space talk about the same things.

  16. Clustering Algorithms • Partitioning “flat” algorithms • Usually start with a random (partial) partitioning • Refine it iteratively • k means/medoids clustering • Model based clustering • Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive

  17. k-Clustering Algorithms • Given: a set of documents and the number k • Find: a partition into k clusters that optimizes the chosen partitioning criterion • Globally optimal: exhaustively enumerate all partitions • Effective heuristic methods: • Iterative k-means and k-medoids algorithms • Hierarchical methods – stop at level with k parts

  18. How hard is clustering? • One idea is to consider all possible clusterings, and pick the one that has best inter and intra cluster distance properties • Suppose we are given n points, and would like to cluster them into k-clusters • How many possible clusterings?

  19. spacing k = 4 Clustering Criteria: Maximum Spacing • Spacing between clusters is defined as Min distance between any pair of points in different clusters. • Clustering of maximum spacing. Given an integer k, find a k-clustering of maximum spacing.

  20. Greedy Clustering Algorithm • Single-link k-clustering algorithm. • Form a graph on the vertex set U, corresponding to n clusters. • Find the closest pair of objects such that each object is in a different cluster, and add an edge between them. • Repeat n-k times until there are exactly k clusters. • Key observation. This procedure is precisely Kruskal's algorithm for Minimum-Cost Spanning Tree (except we stop when there are k connected components). • Remark. Equivalent to finding an MST and deleting the k-1 most expensive edges.

  21. Ct Cs C*r p q pj pi Greedy Clustering Analysis • Theorem. Let C* denote the clustering C*1, …, C*k formed by deleting the k-1 most expensive edges of a MST. C* is a k-clustering of max spacing. • Pf. Let C denote some other clustering C1, …, Ck. • The spacing of C* is the length d* of the (k-1)st most expensive edge. • Since C is not C* there is pair pi, pj in the same cluster in C*, say C*r, but different clusters in C, say Cs and Ct. • Some edge (p, q) on pi-pj path in C*r spans two different clusters in C. • All edges on pi-pj path have length  d*since Kruskal chose them. • Spacing of C is  d* since p and qare in different clusters. ▪

  22. Hierarchical Agglomerative Clustering (HAC) • Greedy is one example of HAC • Assumes a similarity function for determining the similarity of two instances. • Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. • The history of merging forms a binary tree or hierarchy.

  23. A Dendogram: Hierarchical Clustering • Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). • Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

  24. HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj

  25. Hierarchical Clustering algorithms • Agglomerative (bottom-up): • Start with each document being a single cluster. • Eventually all documents belong to the same cluster. • Divisive (top-down): • Start with all documents belong to the same cluster. • Eventually each node forms a cluster on its own. • Does not require the number of clusters k in advance • Needs a termination/readout condition • The final mode in both Agglomerative and Divisive is of no use.

  26. d3,d4,d5 d4,d5 d3 Dendrogram: Document Example • As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d4 d2 d1,d2

  27. Hierarchical Clustering • Key problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest? • Max spacing • Measure intercluster distances by distances of nearest pairs. • Euclidean spacing • each cluster has a centroid = average of its points. • Measure intercluster distances by distances of centroids.

  28. “Closest pair” of clusters • Many variants to defining closest pair of clusters • Single-link or max spacing • Similarity of the most similar pair (same as Kruskal’s MST algorithm) • “Center of gravity” • Clusters whose centroids (centers of gravity) are the most similar • Average-link • Average similarity between pairs of elements • Complete-link • Similarity of the “furthest” points, the least similar

  29. Impact of Choice of Similarity Measure • Single Link Clustering • Can result in “straggly” (long and thin) clusters due to chaining effect. • Appropriate in some domains, such as clustering islands: “Hawaii clusters” • Uses min distance / max similarity update • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

  30. Single Link Example

  31. Complete-Link Clustering • Use minimum similarity (max distance) of pairs: • Makes “tighter,” spherical clusters that are sometimes preferable. • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

  32. Complete Link Example

  33. Computational Complexity • In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). • In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. • Since we can just store unchanged similarities • In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time. • Else O(n2log n) or O(n3) if done naively

  34. Key notion: cluster representative • We want a notion of a representative point in a cluster • Representative should be some sort of “typical” or central point in the cluster, e.g., • point inducing smallest radii to docs in cluster • smallest squared distances, etc. • point that is the “average” of all docs in the cluster • Centroid or center of gravity

  35. Centroid after second step. Centroid after first step. Example: n=6, k=3, closest pair of centroids d4 d6 d3 d5 d1 d2

  36. Centroid Outliers in centroid computation • Can ignore outliers when computing centroid. • What is an outlier? • Lots of statistical definitions, e.g. • moment of point to centroid > M  some cluster moment. Outlier

  37. Group-Average Clustering • Uses average similarity across all pairs within the merged cluster to measure the similarity of two clusters. • Compromise between single and complete link. • Two options: • Averaged across all ordered pairs in the merged cluster • Averaged over all pairs between the two original clusters • Some previous work has used one of these options; some the other. No clear difference in efficacy

  38. Computing Group Average Similarity • Assume cosine similarity and normalized vectors with unit length. • Always maintain sum of vectors in each cluster. • Compute similarity of clusters in constant time:

  39. Medoid As Cluster Representative • The centroid does not have to be a document. • Medoid: A cluster representative that is one of the documents • For example: the document closest to the centroid • One reason this is useful • Consider the representative of a large cluster (>1000 documents) • The centroid of this cluster will be a dense vector • The medoid of this cluster will be a sparse vector • Compare: mean/centroid vs. median/medoid

  40. Homework Exercise • Consider different agglomerative clustering methods of n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?

  41. Efficiency: “Using approximations” • In standard algorithm, must find closest pair of centroids at each step • Approximation: instead, find nearly closest pair • use some data structure that makes this approximation easier to maintain • simplistic example: maintain closest pair based on distances in projection on a random line Random line

  42. The dual space • So far, we clustered docs based on their similarities in term space • For some applications, e.g., topic analysis for inducing navigation structures, can “dualize”: • use docs as axes • represent users or terms as vectors • proximity based on co-occurrence of usage • now clustering users or terms, not docs

  43. Next time • Iterative clustering using K-means • Spectral clustering using eigenvalues

More Related