1 / 29

Clustering and segmentation

Clustering and segmentation. MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining . http://www-users.cs.umn.edu/~kumar/dmbook/. What is Cluster Analysis?. Grouping data so that elements in a group will be Similar (or related) to one another

hua
Download Presentation

Clustering and segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering and segmentation MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.http://www-users.cs.umn.edu/~kumar/dmbook/

  2. What is Cluster Analysis? • Grouping data so that elements in a group will be • Similar (or related) to one another • Different (or unrelated) from elements in other groups Distance within clusters is minimized Distance between clusters is maximized http://www.baseball.bornbybits.com/blog/uploaded_images/Takashi_Saito-703616.gif

  3. Applications

  4. Even more examples

  5. What cluster analysis is NOT Main idea: The clusters must come from the data, not from external specifications. Creating the “buckets” beforehand is categorization, but not clustering.

  6. Clusters can be ambiguous 6 How many clusters? 2 4 The difference is the threshold you set. How distinct must a cluster be to be it’s own cluster?

  7. Two clustering techniques

  8. Partitional Clustering Three distinct groups emerge, but… …some curveballs behave more like splitters. …some splitters look more like fastballs.

  9. Hierarchical Clustering p1 p2 p3 p4 p1 p2 p3 p4 p5 This is a dendrogram Tree diagram used to represent clusters p5

  10. Clusters can be ambiguous 6 How many clusters? 2 4 The difference is the threshold you set. How distinct must a cluster be to be it’s own cluster? adapted from Tan, Steinbach, and Kumar. Introduction to Data Mining (2004)

  11. K-means (partitional) Choose K clusters The K-means algorithm is one method for doing partitional clustering Select K points as initial centroids Assign all points to clusters based on distance Recompute the centroid of each cluster Yes No DONE! Did the center change?

  12. K-Means Demonstration Here is the initial data set

  13. K-Means Demonstration Choose K points as initial centroids

  14. K-Means Demonstration Assign data points according to distance

  15. K-Means Demonstration And re-assign the points

  16. K-Means Demonstration Recalculate the centroids

  17. K-Means Demonstration And keep doing that until you settle on a final set of clusters

  18. Choosing the initial centroids

  19. Example of Poor Initialization This may “work” mathematically but the clusters don’t make much sense.

  20. Evaluating K-Means Clusters • On the previous slide, we did it visually, but there is a mathematical test • Sum-of-Squares Error (SSE) • The distance to the nearest cluster center • How close does each point get to the center? • This just means • In a cluster, compute distance from a point (m) to the cluster center (x) • Square that distance (so sign isn’t an issue) • Add them all together

  21. Example: Evaluating Clusters Cluster 1 Cluster 2 3 3.3 1 1.3 1.5 2 SSE2 = 32 + 3.32 + 1.52 = 9 + 10.89 + 2.25 = 22.14 SSE1 = 12 + 1.32 + 22 = 1 + 1.69 + 4 = 6.69 Reducing SSE within a cluster increases cohesion (we want that)

  22. Choosing the best initial centroids • There’s no single, best way to choose initial centroids • So what do you do? • Multiple runs (?) • Use a sample set of data first • And then apply it to your main data set • Select more centroids to start with • Then choose the ones that are farthest apart • Because those are the most distinct • Pre and post-processing of the data

  23. Pre-processing: Getting the right centroids • “Pre”  Get the data ready for analysis • Normalize the data • Reduces the dispersion of data points by re-computing the distance • Rationale: Preserves differences while dampening the effect of the outliers • Remove outliers • Reduces the dispersion of data points by removing the atypical data • Rationale: They don’t represent the population anyway

  24. Post-processing: Getting the right centroids • “Post”  Interpreting the results of the clustering analysis • Remove small clusters • May be outliers • Split loose clusters • With high SSE that look like they are really two different groups • Merge clusters • With relatively low SSE that are “close” together

  25. Limitations of K-Means Clustering The clusters may nevermake sense. In that case, the data may just not be well-suited for clustering!

  26. Similarity between clusters (inter-cluster) • Most common: distance between centroids • Also can use SSE • Look at distance between cluster 1’s points and other centroids • You’d want to maximize SSE between clusters Increasing SSE across clusters increases separation (we want that) Cluster 1 Cluster 5

  27. Figuring out if our clusters are good • “Good” means • Meaningful • Useful • Provides insight • The pitfalls • Poor clusters reveal incorrect associations • Poor clusters reveal inconclusive associations • There might be room for improvement and we can’t tell • This is somewhat subjective and depends upon the expectations of the analyst

  28. Cluster validity assessment

  29. The Keys to Successful Clustering • We want high cohesion within clusters (minimize differences) • High SSE, high correlation • And high separation between clusters (maximize differences) • Low SSE, low correlation • We need to choose the right number of clusters • We need to choose the right initial centroids • But there’s no easy way to do this • Combination of trial-and-error, knowledge of the problem, and looking at the output In SAS, cohesion is measured by root mean square standard deviation… …and separation measured by distance to nearest cluster

More Related