- 97 Views
- Uploaded on
- Presentation posted in: General

4. Ad-hoc I: Hierarchical clustering Hierarchical versus Flat

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

- 4. Ad-hoc I: Hierarchical clustering
- Hierarchical versus Flat
- Flat methods generate a single partition into k clusters. The number k of clusters has to be determined by the user ahead of time.
- Hierarchical methods generate a hierarchy of partitions, i.e.
- a partition P1 into 1 clusters (the entire collection)
- a partition P2 into 2 clusters
- …
- a partition Pn into n clusters (each object forms its own cluster)
- It is then up to the user to decide which of the partitions reflects actual sub-populations in the data.

P4

P3

P2

P1

Note: A sequence of partitions is called "hierarchical" if each cluster in a given partition is the union of clusters in the next larger partition.

Top: hierarchical sequence of partitionsBottom: non hierarchical sequence

- Hierarchical methods again come in two varieties, agglomerative and divisive.
- Agglomerative methods:
- Start with partition Pn, where each object forms its own cluster.
- Merge the two closest clusters, obtaining Pn-1.
- Repeat merge until only one cluster is left.
- Divisive methods
- Start with P1.
- Split the collection into two clusters that are as homogenous (and as different from each other) as possible.
- Apply splitting procedure recursively to the clusters.

Note:

Agglomerative methods require a rule to decide which clusters to merge. Typically one defines a distance between clusters and then merges the two clusters that are closest.

Divisive methods require a rule for splitting a cluster.

4.1 Hierarchical agglomerative clustering

Need to define a distance d(P,Q) between groups, given a distance measure d(x,y) between observations.

Commonly used distance measures:

1. d1(P,Q) = min d(x,y), for x in P, y in Q ( single linkage )

2. d2(P,Q) = ave d(x,y), for x in P, y in Q ( average linkage )

3. d3(P,Q) = max d(x,y), for x in P, y in Q ( complete linkage )

4. ( centroid method )

5. ( Ward’s method )

d5 is called Ward’s distance.

- Motivation for Ward’s distance:
- Let Pk = P1 ,…, Pk be a partition of the observations into k groups.
- Measure goodness of a partition by the sum of squared distances of observations from their cluster means:

- Consider all possible (k-1)-partitions obtainable from Pk by a merge
- Merging two clusters with smallest Ward’s distance optimizes goodness of new partition.

- 4.2 Hierarchical divisive clustering
- There are divisive versions of single linkage, average linkage, and Ward’s method.
- Divisive version of single linkage:
- Compute minimal spanning tree (graph connecting all the objects with smallest total edge length.
- Break longest edge to obtain 2 subtrees, and a corresponding partition of the objects.
- Apply process recursively to the subtrees.
- Agglomerative and divisive versions of single linkage give identical results (more later).

Divisive version of Ward’s method.

Given cluster R.

Need to find split of R into 2 groups P,Q to minimize

or, equivalently, to maximize Ward’s distance between P and Q.

Note: No computationally feasible method to find optimal P, Q for large |R|. Have to use approximation.

- Iterative algorithm to search for the optimal Ward’s split
- Project observations in R on largest principal component.
- Split at median to obtain initial clusters P, Q.
- Repeat {
- Assign each observation to cluster with closest mean
- Re-compute cluster means
- } Until convergence
- Note:
- Each step reduces RSS(P, Q)
- No guarantee to find optimal partition.

Divisive version of average linkage

Algorithm Diana, Struyf, Hubert, and Rousseuw, pp. 22

- 4.3 Dendograms
- Result of hierarchical clustering can be represented as binary tree:
- Root of tree represents entire collection
- Terminal nodes represent observations
- Each interior node represents a cluster
- Each subtree represents a partition
- Note:The tree defines many more partitions than the n-2 nontrivial ones constructed during the merge (or split) process.
- Note: For HAC methods, the merge order defines a sequence of n subtrees of the full tree. For HDC methods a sequence of subtrees can be defined if there is a figure of merit for each split.

If distance between daughter clusters is monotonically increasing as we move up the tree, we can draw dendogram:

y-coordinate of vertex = distance between daughter clusters.

Point set and corresponding single linkage dendogram

- Standard method to extract clusters from a dendogram:
- Pick number of clusters k.
- Cut dendogram at a level that results in k subtrees.

- 4.4 Experiment
- Try hierarchical method on unimodal 2D datasets.
- Experiments suggest:
- Except in completely clear-cut situations, tree cutting (“cutree”) is useless for extracting clusters from a dendogram.
- Complete linkage fails completely for elongated clusters.

- Needed:
- Diagnostics to decide whether the daughters of a dendogramnode really correspond to spatially separated clusters.
- Automatic and manual methods for dendogram pruning.
- Methods for assigning observations in pruned subtrees to clusters.