1 / 23

Clustering

Clustering. Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling. Supervised vs. Unsupervised Learning. Supervised Learning Goal: A program that performs a task as good as humans.

Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering • Supervised vs. Unsupervised Learning • Examples of clustering in Web IR • Characteristics of clustering • Clustering algorithms • Cluster Labeling

  2. Supervised vs. Unsupervised Learning • Supervised Learning • Goal: A program that performs a task as good as humans. • TASK – well defined (the target function) • EXPERIENCE – training data provided by a human • PERFORMANCE – error/accuracy on the task • Unsupervised Learning • Goal: To find some kind of structure in the data. • TASK – vaguely defined • No EXPERIENCE • No PERFORMANCE (but, there are some evaluations metrics)

  3. What is Clustering? • Clustering is the most common form of Unsupervised Learning • Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects • It can be used in IR: • To improve recall in search applications • For better navigation of search results

  4. Example 1: Improving Recall • Cluster hypothesis - Documents with similar text are related • Thus, when a query matches a document D, also return other documents in the cluster containing D.

  5. Example 2: Better Navigation

  6. Clustering Characteristics • Flat versus Hierarchical Clustering • Flat means dividing objects in groups (clusters) • Hierarchical means organize clusters in a subsuming hierarchy • Evaluating Clustering • Internal Criteria • The intra-cluster similarity is high (tightness) • The inter-cluster similarity is low (separateness) • External Criteria • Did we discover the hidden classes? (we need gold standard data for this evaluation)

  7. Clustering for Web IR • Representation for clustering • Document representation • Vector space? Normalization? • Need a notion of similarity/distance • How many clusters? • Fixed a priori? • Completely data driven? • Avoid “trivial” clusters - too large or small

  8. Recall documents as vectors • Each doc j is a vector of tfidf values, one component for each term. • Can normalize to unit length. • So we have a vector space • terms are axes - aka features • n docs live in this space • even with stemming, may have 20,000+ dimensions

  9. What makes documents related? • Ideal: semantic similarity. • Practical: statistical similarity • We will use cosine similarity. • Documents as vectors. • We will describe algorithms in terms of cosine similarity. This is known as the normalized inner product.

  10. Intuition for relatedness D2 D3 D1 x y t 1 t 2 D4 Documents that are “close together” in vector space talk about the same things.

  11. Clustering Algorithms • Partitioning “flat” algorithms • Usually start with a random (partial) partitioning • Refine it iteratively • k-means clustering • Model based clustering (we will not cover it) • Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive (we will not cover it)

  12. Partitioning “flat” algorithms • Partitioning method: Construct a partition of n documents into a set of k clusters • Given: a set of documents and the number k • Find: a partition of k clusters that optimizes the chosen partitioning criterion Watch animation of k-means

  13. K-means • Assumes documents are real-valued vectors. • Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: • Reassignment of instances to clusters is based on distance to the current cluster centroids.

  14. K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cjsuch that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj= (cj)

  15. K-means: Different Issues • When to stop? • When a fixed number of iterations is reached • When centroid positions do not change • Seed Choice • Results can vary based on random seed selection. • Try out multiple starting points Example showing sensitivity to seeds If you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} B A C D E F

  16. animal vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean Hierarchical clustering • Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.

  17. Hierarchical Agglomerative Clustering • We assume there is a similarity function that determines the similarity of two instances. Algorithm: Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj Watch animation of HAC

  18. What is the most similar cluster? • Single-link • Similarity of the most cosine-similar (single-link) • Complete-link • Similarity of the “furthest” points, the least cosine-similar • Group-average agglomerative clustering • Average cosine between pairs of elements • Centroid clustering • Similarity of clusters’ centroids

  19. Single link clustering 1) Use maximum similarity of pairs: 2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

  20. Complete link clustering 1) Use minimum similarity of pairs: 2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

  21. Major issue - labeling • After clustering algorithm finds clusters - how can they be useful to the end user? • Need a concise label for each cluster • In search results, say “Animal” or “Car” in the jaguar example. • In topic trees (Yahoo), need navigational cues. • Often done by hand, a posteriori.

  22. How to Label Clusters • Show titles of typical documents • Titles are easy to scan • Authors create them for quick scanning! • But you can only show a few titles which may not fully represent cluster • Show words/phrases prominent in cluster • More likely to fully represent cluster • Use distinguishing words/phrases • But harder to scan

  23. Not covered in this lecture • Complexity: • Clustering is computationally expensive. Implementations need careful balancing of needs. • How to decide how many clusters are best? • Evaluating the “goodness” of clustering • There are many techniques, some focus on implementation issues (complexity/time), some on the quality of

More Related