1 / 14

CURE: An Efficient Clustering Algorithm for Large Databases

CURE: An Efficient Clustering Algorithm for Large Databases. Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. Presentation by: Vuk Malbasa For CIS664 Prof. Vasilis Megalooekonomou. Overview. Introduction Previous Approaches Drawbacks of previous approaches CURE: Approach

chas
Download Presentation

CURE: An Efficient Clustering Algorithm for Large Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664 Prof. Vasilis Megalooekonomou

  2. Overview • Introduction • Previous Approaches • Drawbacks of previous approaches • CURE: Approach • Enhancements for Large Datasets • Conclusions

  3. Introduction • Clustering problem: Given points separate them into clusters so that data points within a cluster are more similar to each other than points in different clusters. • Traditional clustering techniques either favor clusters with spherical shapes and similar sizes or are fragile to the presence of outliers. • CURE is robust to outliers and identifies clusters with non-spherical shapes, and wide variances in size. • Each cluster is represented by a fixed number of well scattered points.

  4. Introduction • CURE is a hierarchical clustering technique where each partition is nested into the next partition in the sequence. • CURE is an agglomerative algorithm where disjoint clusters are successively merged until the number of clusters reduces to the desired number of clusters.

  5. Previous Approaches • At each step in agglomerative clustering the merged clusters are ones where some distance metric is minimized. • This distance metric can be: • Distance between means of clusters, dmean • Average distance between all points in clusters, dave • Maximal distance between points in clusters, dmax • Minimal distance between points in clusters, dmin

  6. Drawbacks of previous approaches • For situations where clusters vary in size dave, dmax and dmean distance metrics will split large clusters into parts. • Non spherical clusters will be split by dmean • Clusters connected by outliers will be connected if the dmin metric is used • None of the stated approaches work well in the presence of non spherical clusters or outliers.

  7. Drawbacks of previous approaches

  8. CURE: Approach • CURE is positioned between centroid based (dave) and all point (dmin) extremes. • A constant number of well scattered pointsis used to capture the shape and extend of a cluster. • The points are shrunk towards the centroid of the cluster by a factor α. • These well scattered and shrunk points are used as representative of the cluster.

  9. CURE: Approach • Scattered points approach alleviates shortcomings of dave and dmin. • Since multiple representatives are used the splitting of large clusters is avoided. • Multiple representatives allow for discovery of non spherical clusters. • The shrinking phase will affect outliers more than other points since their distance from the centroid will be decreased more than that of regular points.

  10. CURE: Approach • Initially since all points are in separate clusters, each cluster is defined by the point in the cluster. • Clusters are merged until they contain at least c points. • The first scattered point in a cluster in one which is farthest away from the clusters centroid. • Other scattered points are chosen so that their distance from previously chosen scattered points in maximal. • When c well scattered points are calculated they are shrunk by some factor α (r = p + α*(mean-p)). • After clusters have c representatives the distance between two clusters is the distance between two of the closest representatives of each cluster • Every time two clusters are merged their representatives are re-calculated.

  11. Enhancements for Large Datasets • Random sampling • Filters outliers and allows the dataset to fit into memory • Partitioning • First cluster in partitions then merge partitions • Labeling Data on Disk • The final labeling phase can be done by NN on already chosen cluster representatives • Handling outliers • Outliers are partially eliminated and spread out by random sampling, are identified because they belong to small clusters that grow slowly

  12. Conclusions • CURE can identify clusters that are not spherical but also ellipsoid • CURE is robust to outliers • CURE correctly clusters data with large differences in cluster size • Running time for a low dimensional dataset with s points is O(s2) • Using partitioning and sampling CURE can be applied to large datasets

  13. Thanks!

  14. ?

More Related