Understanding Hierarchical Clustering and Similarity Scores in Data Analysis

The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. Clustering techniques{week 03b} from Programming Collective Intelligence by Toby Segaran, O’Reilly Media, 2007, ISBN 978-0-596-52932-1

Hierarchical clustering (i) • Hierarchical clustering is an algorithmthat groups similar items together • At each iteration, the two most similaritems (or groups) are merged • For example, given five items A-E: A D B E C

Hierarchical clustering (ii) • Calculate the distances between all items • Group the two items that are closest: • Repeat! AB A D B E C

Hierarchical clustering (iii) • How do we compare group AB to other items? • Use the midpoint of items A and B ABC AB A DE D x B E C

Hierarchical clustering (iv) • When do we stop? • When we have a top-level group that includes all items ABCDE ABC AB A DE D x B E C

Hierarchical clustering (v) • The hierarchical part is based on the discovery order of clusters • This diagram is called a dendrogram... A AB ABC B ABCDE C D DE E

Hierarchical clustering (vi) • A dendrogram is a graph (or tree) • Distances between nodes of the dendrogram show how similar items (or groups) are • AB is closer (to A and B) than DEis (to D and E), so A and B aremore similarthan D and E • How can wedefine closeness? A AB ABC B ABCDE C D DE E

Similarity scores • A similarity score compares two distinct elements from a given set • To measure closeness, we need to calculate a similarity score for each pair of items in the set • Options include: • The Euclidean distance score, which is based onthe distance formula in two-dimensional geometry • The Pearson correlation score, which is basedon fitting data points to a line

Euclidean distance score • To find the Euclidean distance betweentwo data points, use the distance formula: distance = √ (y2 – y1)2 + (x2 – x1)2 • The larger the distance between two items,the less similar they are • So use the reciprocal of distance as a measure of similarity (but be careful of division by zero)

Pearson correlation score (i) • The Pearson correlation score is derived by determining the best-fit line for a given set v2 • The best-fit line, on average, comes as close as possible to each item • The Pearson correlation score is a coefficientmeasuring the degree to which items are on the best-fit line x x x x x x x x v1

Pearson correlation score (ii) • The Pearson correlation score tells us how closely items are correlated to one another • 1.0 is a perfect match; ~0.0 is no relationship correlation score: 0.4 correlation score: 0.8 v2 v2 x x x x x x x x x x x x x x x x v1 v1

Pearson correlation score (iii) • The algorithm is: • Calculate sum(v1) and sum(v2) • Calculate the sum of thesquares of v1 and v2 • Call them sum1Sq and sum2Sq • Calculate the sum of the products of v1 and v2 • (v1[0] * v2[0]) + (v1[1] * v2[1]) + ... + (v1[n-1] * v2[n-1]) • Call this pSum v2 x x x x x x x x v1

Pearson correlation score (iv) • Calculate the Pearson score: • Much more complex, but often better thanthe Euclidean distance score sum(v1) * sum(v2) pSum – ( ) n r = sum1Sq – sum(v1)2 sum2Sq – sum(v2)2 * n n √

What next? • Review the blog-data dendrograms • Identify any patterns in the data • Which blogs are very similar? • Which blogs are very different? • How can these techniques beapplied to other types of search? • Web search? • Enterprise search?

Understanding Hierarchical Clustering and Similarity Scores in Data Analysis