Information Retrieval For the MSc Computer Science Programme Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 Dell Zhang Birkbeck, University of London
… (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity Yahoo! Hierarchy http://dir.yahoo.com/science
Hierarchical Clustering • Builda tree-like hierarchical taxonomy (dendrogram) from a set of unlabeled documents. • Divisive (top-down) • Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own. • Recursive application of a (flat) partitional clustering algorithm, e.g., kMeans (k=2) Bi-secting kMeans. • Agglomerative (bottom-up) • Start with each document being a single cluster. Eventually all documents belong to the same cluster.
Dendrogram Clustering is obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. The number of clusters k is not required in advance.
Dendrogram – Example Clusters of News Stories: Reuters RCV1
Dendrogram – Example Clusters of Things that People Want: ZEBO
HAC • Hierarchical Agglomerative Clustering • Starts with each doc in a separate cluster. • Repeat until there is only one cluster: • Among the current clusters, determine the pair of clusters, ci and cj, that are most similar. • (Single-Link, Complete-Link, etc.) • Then mergesci and cj to a single cluster. • The history of merging forms a binary tree or hierarchy.
Single-Link • The similarity between a pair of clusters is defined by the single strongest link (i.e., maximum cosine-similarity) between their members: • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
d3,d4,d5 d4,d5 d3 HAC – Example • As clusters agglomerate, docs are likely to fall into a dendrogram. d3 d5 d1 d4 d2 d1,d2
HAC – Example Single-Link
Take Home Message • Single-Link HAC • Dendrogram