1 / 70

Cluster Analysis

Cluster Analysis. Mark Stamp. Cluster Analysis. Grouping objects in meaningful way Clustered data fits together in some way Can help to make sense of (big) data Useful technique in many fields Many different clustering strategies Overview, then details on 2 methods

dolan-poole
Download Presentation

Cluster Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster Analysis • Mark Stamp Cluster Analysis

  2. Cluster Analysis Cluster Analysis • Grouping objects in meaningful way • Clustered data fits together in some way • Can help to make sense of (big) data • Useful technique in many fields • Many different clustering strategies • Overview, then details on 2 methods • K-means  simple and can be effective • EM clustering  not as simple

  3. Intrinsic vs Extrinsic Cluster Analysis • Intrinsic clustering relies on unsupervised learning • No predetermined labels on objects • Apply analysis directly to data • Extrinsic requires category labels • Requires pre-processing of data • Can be viewed as a form of supervised learning

  4. Agglomerative vs Divisive Cluster Analysis • Agglomerative • Each object starts in its own cluster • Clustering merges existing clusters • A “bottom up” approach • Divisive • All objects start in one cluster • Clustering process splits existing clusters • A “top down” approach

  5. Hierarchical vsPartitional Cluster Analysis • Hierarchical clustering • “Child” and “parent” clusters • Can be viewed as dendrograms • Partitional clustering • Partition objects into disjoint clusters • No hierarchical relationship • We consider K-means and EM in detail • These are both partitional

  6. Hierarchical Clustering Cluster Analysis • Example of a hierarchical approach... • start: Every point is its own cluster • while number of clusters exceeds 1 • Find 2 nearest clusters and merge • end while • OK, but no real theoretical basis • And some find that “disconcerting” • Even K-means has some theory behind it

  7. Distance Cluster Analysis • Distance between data points? • Suppose x = (x1,x2,…,xn) and y = (y1,y2,…,yn) where each xi and yi are real numbers • Euclidean distance is d(x,y) = sqrt((x1-y1)2+(x2-y2)2+…+(xn-yn)2) • Manhattan (taxicab) distance is d(x,y) = |x1-y1| + |x2-y2| + … + |xn-yn|

  8. Distance b a Cluster Analysis • Euclidean distance red line • Manhattan distance blue or yellow • Or any similar right-angle only path

  9. Distance Cluster Analysis • Lots and lots more distance measures • Other examples include • Mahalanobis distance  takes mean and covariance into account • Simple substitution distance  measure of “decryption” distance • Chi-squared distance  statistical • Or just about anything you can think of…

  10. One Clustering Approach Cluster Analysis • Given data points x1,x2,x3,…,xm • Want to partition into Kclusters • I.e., each point in exactly one cluster • A centroid specified for each cluster • Let c1,c2,…,cKdenote current centroids • Each xi associated with one centroid • Let centroid(xi) be centroid for xi • If cj = centroid(xi), then xi is in cluster j

  11. Clustering Cluster Analysis • Two crucial questions • How to determine centroids, cj? • How to determine clusters, that is, how to assign xi to centroids? • But first, what makes a cluster good? • For now, focus on one individual cluster • Relationship between clusters later… • What do you think?

  12. Distortion Cluster Analysis • Intuitively, “compact” clusters good • Depends on data and K, which are given • And depends on centroids and assignment of xi to clusters (which we can control) • How to measure this “goodness”? • Define distortion = Σd(xi,centroid(xi)) • Where d(x,y) is a distance measure • Given K, let’s try to minimize distortion

  13. Distortion Cluster Analysis • Consider this 2-d data • Choose K = 3 clusters • Same data for both • Which has smaller distortion? • How to minimize distortion? • Good question…

  14. Distortion Cluster Analysis • Note, distortion depends on K • So, should probably write distortionK • Typically, larger K, smaller distortionK • Want to minimize distortionK for fixed K • Best choice of K is a different issue • Briefly considered later • Also consider other measures of goodness • For now, assume K is given and fixed

  15. How to Minimize Distortion? Cluster Analysis • Given m data points and K • Minimize distortion via exhaustive search? • Try all “mchoose K” different cases? • Too much work for realistic size data set • An approximate solution will have to do • Exact solution is NP-complete problem • Important Note: For minimum distortion… • Each xi grouped with nearest centroid • Centroid must be center of its group

  16. K-Means Cluster Analysis • Previous slide implies that we can improve suboptimal cluster by either… • Re-assign each xi to nearest centroid • Re-compute centroids so they’re centers • No improvement from applying either 1 or 2 more than once in succession • But alternating might be useful • In fact, that is the K-means algorithm

  17. K-Means Algorithm Cluster Analysis Given dataset… Select a value for K (how?) Select initial centroids (how?) Group data by nearest centroid Recomputecentroids (cluster centers) If “significant” change, then goto3; else stop

  18. K-Means Animation Cluster Analysis Very good animation here http://shabal.in/visuals/kmeans/2.html Nice animations of movement of centroids in different cases here http://www.ccs.neu.edu/home/kenb/db/examples/059.html (near bottom of web page) Other?

  19. K-Means Cluster Analysis • Are we assured of optimal solution? • Definitely not • Why not? • For one, initial centroid locations critical • There is a (sensitive) dependence on initial conditions • This is a common issue in iterative processes (HMM training, is an example)

  20. K-Means Initialization Cluster Analysis • Recall, K is the number of clusters • How to choose K? • No obvious “best” way to do so • But K-means is fast • So trial and error may be OK • That is, experiment with different K • Similar to choosing N in HMM • Is there a better way to choose K?

  21. Optimal K? Cluster Analysis • Even for trial and error, need a way to measure “goodness” of results • Choosing optimal K is tricky • Most intuitive measures will tend to improve for larger K • But K “too big” may overfit data • So, when is K “big enough”? • But not too big…

  22. f(K) Schwarz Criterion K Cluster Analysis • Choose K that minimizes f(K) = distortionK + λdK log m • Where d is the dimension, m is the number of data points, and λ is ??? • Recall that distortion depends on K • Tends to decrease as K increases • Essentially, adding a penalty as K increases • Related to Bayes Information Criterion (BIC) • And some other similar things • Consider choice of K in more detail later…

  23. K-Means Initialization Cluster Analysis • How to choose initial centroids? • Again, no best way to do this • Counterexamples to any “best” approach • Often just choose at random • Or uniform/maximum spacing • Or some variation on this idea • Other?

  24. K-Means Initialization Cluster Analysis • In practice, often… • Try several different choices of K • For each K, test several initial centroids • Select the result that is best • How to measure “best”? • We look at that next • May not be very scientific • But often it’s good enough

  25. K-Means Variations Cluster Analysis • One variation is K-mediods • Centroids point must be actual data point • Fuzzy K-means • In K-means, any data point is in one cluster and not in any other • In fuzzy case, data point can be partly in several different clusters • “Degree of membership” vs distance • Many other variations…

  26. Measuring Cluster Quality Cluster Analysis • How can we judge clustering results? • In general, that is, not just for K-means • Compare to typical training/scoring… • Suppose we test new scoring method • E.g., score malware and benign files • Compute ROC curves, AUC, etc. • Many tools to measure success/accuracy • Clustering is different (Why? How?)

  27. Clustering Quality Cluster Analysis • Clustering is a fishing expedition • Not sure what we are looking for • Hoping to find structure, “data discovery” • If we know answer, no point to clustering • Might find something that’s not there • Even random data can be clustered • Some things to consider on next slides • Relative to the data to be clustered

  28. Cluster-ability? Cluster Analysis • Clustering tendency • How suitable is dataset for clustering? • Which dataset below is cluster-friendly? • We can always apply clustering… • …but expect better results in some cases

  29. Validation Cluster Analysis • External validation • Compare clusters based on data labels • Similar to usual training/scoring scenario • Good idea if know something about data • Internal validation • Determine quality based only on clusters • E.g., spacing between and within clusters • This is alwaysapplicable

  30. It’s All Relative Cluster Analysis • Comparing clustering results • That is, compare one clustering result with others for same dataset • Canbe very useful in practice • Often, lots of trial and error • Could enable us to “hill climb” to better clustering results… • …but stillneeda way to quantify things

  31. How Many Clusters? Cluster Analysis • Optimal number of clusters? • Already mentioned this wrtK-means • But what about the general case? • I.e., not dependent on cluster technique • Can the data tell us how many clusters? • Or the topology of the clusters? • Next, we consider relevant measures

  32. Internal Validation Cluster Analysis • Direct measurement of clusters • Might call it “topological” validation • We’ll consider the following • Cluster correlation • Similarity matrix • Sum of squares error • Cohesion and separation • Silhouette coefficient

  33. Cluster Correlation Cluster Analysis • Given data x1,x2,…,xm, and clusters, define 2 matrices • Distance matrix D = {dij} • Where dij is distance between xi and xj • Adjacency matrix A = {aij} • Where aij is 1 if xi and xj in same cluster • And aij is 0 otherwise • Now what?

  34. Cluster Correlation Cluster Analysis • Compute correlation between D and A rAD =cor(A,D) = cov(A,D) / (σAσD) = Σ(aij–μA)(dij–μD)/sqrt(Σ(aij–μA)2Σ(dij–μD)2) • Can show that r is between -1 and+1 • If r > 0 then positivecor(and vice versa) • Magnitude is strength of correlation • High (inverse) correlation implies nearby things clustered together

  35. Correlation Cluster Analysis Correlation examples

  36. Similarity Matrix Cluster Analysis • Form “similarity matrix” • Could be based on just about anything • Typically, distance matrix D = {dij}, where dij = d(xi,xj) • Group rows and columns by cluster • Heat map for resulting matrix • Provides visual representation of similarity within clusters (so look at it…)

  37. Similarity Matrix Cluster Analysis Examples Better than just looking at clusters? Good for higher dimensions

  38. Residual Sum of Squares Cluster Analysis • Residual Sum of Squares (RSS) • Aka Sum of Squared Errors (SSE) • RSS is squared sum of “error” terms • Definition of error depends on problem • What is “error” when clustering? • Distance from centroid? • Then same as distortion • But, could use other measures instead

  39. Cohesion and Separation Cluster Analysis • Cluster cohesion • How “tightly packed” is a cluster • The more cohesive a cluster, the better • Cluster separation • Distance between clusters • The more separation, the better • Can we measure these things? • Yes, easily

  40. Notation Cluster Analysis • Same notation as K-means • Let ci, i=1,2,…,K, cluster centroids • Let x1,x2,…,xmbe data points • Let centroid(xi) be centroid of xi • Clusters determined by centroids • Following results apply generally • Not just for K-means

  41. Cohesion Cluster Analysis • Lots of measures of cohesion • Previously defined distortion is useful • Recall, distortion = Σd(xi,centroid(xi)) • Or, could use distance between all pairs

  42. Separation Cluster Analysis • Again, many ways to measure this • Here, using distances to other centroids • Or distances between all points in clusters • Or distance from centroids to a “midpoint” • Or distance between centroids, or…

  43. Silhouette Coefficient Cluster Analysis • Essentially, combines cohesion and separation into a single number • Let Ci be cluster of point xi • Let a be average of d(xi,y) for all y in Ci • For Cj ≠ Ci, let bj be avgd(xi,y) for y in Cj • Let b be minimum of bj • Then let S(xi) = (b – a) / max(a,b) • What the … ?

  44. Silhouette Coefficient a=avg xi avg b=min avg • Usually, S(xi) = 1 - a/b Cluster Analysis The idea...

  45. Silhouette Coefficient Cluster Analysis • For given point xi… • Let a be avg distance to points in its cluster • Let b be dist to nearest other cluster (in a sense) • Usually, a < band hence S(xi) = 1 – a/b • If a is a lot less than b, then S(xi) ≈ 1 • Points inside cluster much closer together than nearest other cluster (this is good) • If a is almost same as b, then S(xi) ≈ 0 • Some other cluster is almost as close as things inside cluster (this is bad)

  46. Silhouette Coefficient Cluster Analysis • Silhouette coefficient is defined for each point • Avg silhouette coefficient for a cluster • Measure of how “good” a cluster is • Avg silhouette coefficient for all points • Measure of overall clustering “goodness” • Numerically, what is a good result? • Rule of thumb on next slide

  47. Silhouette Coefficient Cluster Analysis • Average coefficient (to 2 decimal places) • 0.71 to 1.00  strong structure found • 0.51 to 0.70  reasonable structure found • 0.26 to 0.50  weak or artificial structure • 0.25 or less  no significant structure • Bottom line on silhouette coefficient • Combine cohesion, separation in one number • A useful measures of cluster quality

  48. External Validation Cluster Analysis • “External” implies that we measure quality based on data in clusters • Not relying on cluster topology (“shape”) • Suppose clustering data is of several different types • Say, different malware families • We can compute statistics on clusters • We only consider 2 stats here

  49. Entropy and Purity Cluster Analysis • Entropy • Standard measure of uncertainty or randomness • High entropy implies clusters less uniform • Purity • Another measure of uniformity • Ideally, cluster should be more “pure”, that is, more uniform

  50. Entropy Cluster Analysis • Suppose total of m data elements • As usual, x1,x2,…,xm • Denote cluster j as Cj • Let mj be number of elements in Cj • Let mij be count of type i in cluster Cj • Compute probabilities based on relative frequencies • That is, pij = mij / mj

More Related