1 / 76

Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering

Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering. Outline. Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?. Clustering: Definition.

marlow
Download Presentation

Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering

  2. Outline • Recap • Clustering: Introduction • Clustering in IR • K-means • Evaluation • How many clusters?

  3. Clustering: Definition • (Document) clustering is the process of grouping a set of documents into clusters of similar documents. • Documents within a cluster should be similar. • Documents from different clusters should be dissimilar. • Clustering is the most common form of unsupervised learning. • Unsupervised = there are no labeled or annotated data. 3

  4. Data set with clear cluster structure • Propose algorithm for finding the cluster structure in this example 4

  5. Classification vs. Clustering • Classification: supervisedlearning • Clustering: unsupervisedlearning • Classification: Classes are human-defined and part of the input to the learning algorithm. • Clustering: Clusters are inferred from the data without human input. • However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representationofdocuments, . . . 5

  6. Outline • Recap • Clustering: Introduction • Clustering in IR • K-means • Evaluation • How many clusters?

  7. The clusterhypothesis • Cluster hypothesis. Documents in the same cluster behave • similarly with respect to relevance to information needs. All • applications of clustering in IR are based (directly or indirectly) on • theclusterhypothesis. • Van Rijsbergen’s original wording: “closelyassociated documents tend to be relevant to the same requests”. 7

  8. Applicationsofclustering in IR 8

  9. Searchresultclusteringforbetternavigation 9

  10. Scatter-Gather 10

  11. Global navigation: Yahoo 11

  12. Global navigation: MESH (upperlevel) 12

  13. Global navigation: MESH (lowerlevel) 13

  14. Navigational hierarchies: Manual vs. automatic creation • Note: Yahoo/MESH are not examples of clustering. • But they are well known examples for using a global hierarchy fornavigation. • Some examples for global navigation/exploration based on clustering: • Cartia • Themescapes • Google News 14

  15. Global navigationcombinedwithvisualization (1) 15

  16. Global navigationcombinedwithvisualization (2) 16

  17. Global clustering for navigation: Google News • http://news.google.com 17

  18. Clustering forimprovingrecall • Toimprovesearchrecall: • Cluster docs in collection a priori • When a query matches a doc d, also return other docs in the clustercontainingd • Hope: if we do this: the query “car” will also return docs containing “automobile” • Because the clustering algorithm groups together docs containing “car” with those containing “automobile”. • Both types of documents contain words like “parts”, “dealer”, “mercedes”, “roadtrip”. 18

  19. Data set with clear cluster structure • Propose algorithm for finding the cluster structure in this example 19

  20. Desiderata forclustering • General goal: put related docs in the same cluster, put unrelated docs in different clusters. • How do we formalize this? • The number of clusters should be appropriate for the data set weareclustering. • Initially, we will assume the number of clusters K is given. • Later: Semiautomatic methods for determining K • Secondarygoals in clustering • Avoid very small and very large clusters • Define clusters that are easy to explain to the user • Manyothers . . . 20

  21. Flat vs. Hierarchicalclustering • Flat algorithms • Usually start with a random (partial) partitioning of docs into groups • Refineiteratively • Main algorithm: K-means • Hierarchicalalgorithms • Create a hierarchy • Bottom-up, agglomerative • Top-down, divisive 21

  22. Hard vs. Soft clustering • Hard clustering: Each document belongs to exactly one cluster. • More common and easier to do • Soft clustering: A document can belong to more than one cluster. • Makes more sense for applications like creating browsablehierarchies • You may want to put sneakers in two clusters: • sportsapparel • shoes • You can only do that with a soft clustering approach. • We will do flat, hard clustering only in this class. • See IIR 16.5, IIR 17, IIR 18 for soft clustering and hierarchical clustering 22

  23. Flat algorithms • Flat algorithms compute a partition of N documents into a setof K clusters. • Given: a set of documents and the number K • Find: a partition into K clusters that optimizes the chosen partitioningcriterion • Global optimization: exhaustively enumerate partitions, pick optimal one • Not tractable • Effective heuristic method: K-means algorithm 23

  24. Outline • Recap • Clustering: Introduction • Clustering in IR • K-means • Evaluation • How many clusters?

  25. K-means • Perhaps the best known clustering algorithm • Simple, works well in many cases • Use as default / baseline for clustering documents 25

  26. Documentrepresentations in clustering • Vectorspace model • As in vector space classification, we measure relatedness between vectors by Euclidean distance . . . • . . .which is almost equivalent to cosine similarity. • Almost: centroids are not length-normalized. 26

  27. K-means • Each cluster in K-means is defined by acentroid. • Objective/partitioning criterion: minimize the average squared differencefromthecentroid • Recall definitionofcentroid: • where we use ω to denote a cluster. • We try to find the minimum average squared difference by iteratingtwosteps: • reassignment: assign each vector to its closest centroid • recomputation: recompute each centroid as the average of the vectors that were assigned to it in reassignment 27

  28. K-means algorithm

  29. Worked Example: Set of to be clustered

  30. Worked Example: Random selection of initial centroids • Exercise: (i) Guess what the • optimal clustering into two clusters is in this case; (ii) compute the • centroids of the clusters 30

  31. Worked Example: Assign points to closest center

  32. Worked Example: Assignment

  33. Worked Example: Recompute cluster centroids

  34. Worked Example: Assign points to closest centroid

  35. Worked Example: Assignment

  36. Worked Example: Recompute cluster centroids

  37. Worked Example: Assign points to closest centroid

  38. Worked Example: Assignment

  39. Worked Example: Recompute cluster centroids

  40. Worked Example: Assign points to closest centroid

  41. Worked Example: Assignment

  42. Worked Example: Recompute cluster centroids

  43. Worked Example: Assign points to closest centroid

  44. Worked Example: Assignment

  45. Worked Example: Recompute cluster centroids

  46. Worked Example: Assign points to closest centroid

  47. Worked Example: Assignment

  48. Worked Example: Recompute cluster centroids

  49. Worked Example: Assign points to closest centroid

  50. Worked Example: Assignment

More Related