1 / 35

Cluster Analysis

Cluster Analysis. Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis K-means Hierarchical divisive cluster analysis Hierarchical agglomerative cluster analysis Linkage: single, complete, average, … Cophenetic correlation coefficient Additive trees

bergen
Download Presentation

Cluster Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster Analysis Hal Whitehead BIOL4062/5062

  2. What is cluster analysis? • Non-hierarchical cluster analysis • K-means • Hierarchical divisive cluster analysis • Hierarchical agglomerative cluster analysis • Linkage: single, complete, average, … • Cophenetic correlation coefficient • Additive trees • Problems with cluster analyses

  3. ? Cluster Analysis “Classification” Maximize within cluster homogeneity (similar individuals within cluster) “The Search for Discontinuities” Discontinuities: places to put divisions between clusters

  4. Discontinuities: Discontinuities generally present: taxonomy social organization community ecology??

  5. Types of cluster analysis: • Uses: data, dissimilarity, similarity matrix • Non-hierarchical • K-means • Hierarchical • Hierarchical divisive (repeated K-means, network methods) • Hierarchical agglomerative • single linkage, average linkage, ... • Additive trees

  6. Non-hierarchical Clustering Techniques:K-Means • Uses data matrix with Euclidean distances • Maximizes between-cluster variance for given number of clusters • i.e. Choose clusters to maximize F-ratio in 1-way MANOVA

  7. K-Means Works iteratively: 1. Choose number of clusters 2. Assigns points to clusters Randomly or some other clustering technique 3. Moves each point to other clusters in turn--increase in between cluster variance? 4. Repeat step 3. until no improvement possible

  8. K-means with three clusters

  9. K-means with three clusters Variable Between SS df Within SS df F-ratio X 0.536 2 0.007 7 256.163 Y 0.541 2 0.050 7 37.566 ** TOTAL ** 1.078 4 0.058 14

  10. K-means with three clusters Cluster 1 of 3 contains 4 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 1 0.02 | X 0.41 0.45 0.49 0.04 Case 2 0.11 | Y 0.03 0.19 0.27 0.11 Case 3 0.06 | Case 4 0.05 | Cluster 2 of 3 contains 4 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 7 0.06 | X 0.11 0.15 0.19 0.03 Case 8 0.03 | Y 0.61 0.70 0.77 0.07 Case 9 0.02 | Case 10 0.06 | Cluster 3 of 3 contains 2 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 5 0.01 | X 0.77 0.77 0.78 0.01 Case 6 0.01 | Y 0.33 0.35 0.36 0.02

  11. Disadvantages of K-means • Reaches optimum, but not necessarily global • Must choose number of clusters before analysis • How many clusters?

  12. Example: Sperm whale codas Patterned series of clicks: | | | | | ic1 ic2 ic3 ic4 For 5-click codas: 681 x 4 data set

  13. 5-click codas:| | | | |ic1 ic2 ic3 ic4 93% of variance in 2 PC’s

  14. 5-click codas:K-means with 10 clusters

  15. Hierarchical Cluster Analysis • Usually represented by: • Dendrogram or tree-diagram

  16. Hierarchical Cluster Analysis • Hierarchical Divisive Cluster Analysis • Hierarchical Agglomerative Cluster Analysis

  17. Hierarchical Divisive Cluster Analysis • Starts with all units in one cluster, successively splits them • Successive use of K-Means, or some other divisive technique, with n=2 • Either: Each time use the cluster with the greatest sum of squared distances • Or: Split each cluster each time. • Hierarchical divisive are good techniques, but rarely used, outside network analysis

  18. Hierarchical Agglomerative Cluster Analysis • Start with each individual units occupying its own cluster • The clusters are then gradually merged until just one is left • The most common cluster analyses

  19. Hierarchical Agglomerative Cluster Analysis Works on dissimilarity matrix or negative similarity matrix may be Euclidean, Penrose, … distances At each step: 1. There is a symmetric matrix of dissimilarities between clusters 2. The two clusters with least dissimilarity are merged 3. The dissimilarity between the new (merged) cluster and all others is calculated Different techniques do step 3. in different ways:

  20. A B C D E A 0 . . . . B 0.35 0 . . . C 0.45 0.67 0 . . D 0.11 0.45 0.57 0 . E 0.22 0.56 0.78 0.19 0 AD B C E AD 0 . . . B? 0 . . C? 0.67 0 . E? 0.56 0.78 0 First link Aand D Hierarchical Agglomerative Cluster Analysis How to calculate new disimmilarities?

  21. A B C D E A 0 . . . . B0.35 0 . . . C 0.45 0.67 0 . . D 0.11 0.45 0.57 0 . E 0.22 0.56 0.78 0.19 0 AD B C E AD 0 . . . B0.35 0 . . C? 0.67 0 . E? 0.56 0.78 0 Hierarchical Agglomerative Cluster AnalysisSingle Linkage d(AD,B)=Min{d(A,B), d(D,B)}

  22. A B C D E A 0 . . . . B0.35 0 . . . C 0.45 0.67 0 . . D 0.11 0.45 0.57 0 . E 0.22 0.56 0.78 0.19 0 AD B C E AD 0 . . . B0.45 0 . . C? 0.67 0 . E? 0.56 0.78 0 Hierarchical Agglomerative Cluster AnalysisComplete Linkage d(AD,B)=Max{d(A,B), d(D,B)}

  23. A B C D E A 0 . . . . B0.35 0 . . . C 0.45 0.67 0 . . D 0.11 0.45 0.57 0 . E 0.22 0.56 0.78 0.19 0 AD B C E AD 0 . . . B0.40 0 . . C? 0.67 0 . E? 0.56 0.78 0 Hierarchical Agglomerative Cluster AnalysisAverage Linkage d(AD,B)=Mean{d(A,B), d(D,B)}

  24. V1 V2 V3 A 0.11 0.75 0.33 B 0.35 0.99 0.41 C 0.45 0.67 0.22 D 0.11 0.71 0.37 E 0.22 0.56 0.78 F 0.13 0.14 0.55 G 0.55 0.90 0.21 V1 V2 V3 AD0.11 0.73 0.35 B 0.35 0.99 0.41 C 0.45 0.67 0.22 E 0.22 0.56 0.78 F 0.13 0.14 0.55 G 0.55 0.90 0.21 Hierarchical Agglomerative Cluster AnalysisCentroid Clustering (uses data matrix, or true distance matrix) V1(AD)=Mean{V1(A),V1(D)}

  25. Hierarchical Agglomerative Cluster AnalysisWard’s Method • Minimizes within-cluster sum-of squares • Similar to centroid clustering

  26. 1 1.00 2 0.00 1.00 4 0.53 0.00 1.00 5 0.18 0.05 0.00 1.00 9 0.22 0.09 0.13 0.25 1.00 11 0.36 0.00 0.17 0.40 0.33 1.00 12 0.00 0.37 0.18 0.00 0.13 0.00 1.00 140.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00 150.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00 19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.091.00 20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18 0.25 1.00 1 2 4 5 9 11 12 14 15 19 20

  27. Hierarchical Agglomerative Clustering Techniques • Single Linkage • Produces “straggly” clusters • Not recommended if much experimental error • Used in taxonomy • Invariant to transformations • Complete Linkage • Produces “tight” clusters • Not recommended if much experimental error • Invariant to transformations • Average Linkage, Centroid, Ward’s • Most likely to mimic input clusters • Not invariant to transformations in dissimilarity measure

  28. Cophenetic Correlation Coefficient CCC • Correlation between original disimilarity matrix and dissimilarity inferred from cluster analysis • CCC >~ 0.8 indicate a good match • CCC <~ 0.8, dendrogram not a good representation • probably should not be displayed • Use CCC to choose best linkage method (highest coefficient) 1 1.00 2 0.00 1.00 4 0.53 0.00 1.00 5 0.18 0.05 0.00 1.00 9 0.22 0.09 0.13 0.25 1.00 11 0.36 0.00 0.17 0.40 0.33 1.00 12 0.00 0.37 0.18 0.00 0.13 0.00 1.00 140.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00 150.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00 19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.091.00 20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18 0.25 1.00 1 2 4 5 9 11 12 14 15 19 20

  29. CCC=0.77 CCC=0.83 CCC=0.80 CCC=0.75

  30. Additive trees • Dendrogram in which path lengths represent dissimilarities • Computation quite complex (cross between agglomerative techniques and multidimensional scaling) • Good when data are measured as dissimilarities • Often used in taxonomy and genetics A B C D E A . . . . . B 14 . . . . C 6 12 . . . D 81 7 13 . . E 17 1 6 16 .

  31. Problems with Cluster Analysis • Are there really biologically-meaningful clusters in the data? • Does the dendrogram represent biological reality (web-of-life versus tree-of-life)? • How many clusters to use? • stopping rules are arbitrary • Which method to use? • best technique is data-dependent • Dendrograms become messy with many units

  32. Social Structure of 160 northern bottlenose whales

  33. Clustering Techniques Type Technique Use Non-hierarchicalK-Means Dividing data sets Hierarchical divisiveRepeated K-means Good technique on small data sets Network methods ... Hierarchical agglomerative Single linkage Taxonomy Complete linkage Tighter Clusters Average linkage, Centroid, Ward’s Usually Preferred HierarchicalAdditive treesExcellent for displaying dissimilarity; taxonomy, genetics

More Related