CURE: An Efficient Clustering Algorithm for Large Databases

CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664 Prof. Vasilis Megalooekonomou

Overview • Introduction • Previous Approaches • Drawbacks of previous approaches • CURE: Approach • Enhancements for Large Datasets • Conclusions

Introduction • Clustering problem: Given points separate them into clusters so that data points within a cluster are more similar to each other than points in different clusters. • Traditional clustering techniques either favor clusters with spherical shapes and similar sizes or are fragile to the presence of outliers. • CURE is robust to outliers and identifies clusters with non-spherical shapes, and wide variances in size. • Each cluster is represented by a fixed number of well scattered points.

Introduction • CURE is a hierarchical clustering technique where each partition is nested into the next partition in the sequence. • CURE is an agglomerative algorithm where disjoint clusters are successively merged until the number of clusters reduces to the desired number of clusters.

Previous Approaches • At each step in agglomerative clustering the merged clusters are ones where some distance metric is minimized. • This distance metric can be: • Distance between means of clusters, dmean • Average distance between all points in clusters, dave • Maximal distance between points in clusters, dmax • Minimal distance between points in clusters, dmin

Drawbacks of previous approaches • For situations where clusters vary in size dave, dmax and dmean distance metrics will split large clusters into parts. • Non spherical clusters will be split by dmean • Clusters connected by outliers will be connected if the dmin metric is used • None of the stated approaches work well in the presence of non spherical clusters or outliers.

Drawbacks of previous approaches

CURE: Approach • CURE is positioned between centroid based (dave) and all point (dmin) extremes. • A constant number of well scattered pointsis used to capture the shape and extend of a cluster. • The points are shrunk towards the centroid of the cluster by a factor α. • These well scattered and shrunk points are used as representative of the cluster.

CURE: Approach • Scattered points approach alleviates shortcomings of dave and dmin. • Since multiple representatives are used the splitting of large clusters is avoided. • Multiple representatives allow for discovery of non spherical clusters. • The shrinking phase will affect outliers more than other points since their distance from the centroid will be decreased more than that of regular points.

CURE: Approach • Initially since all points are in separate clusters, each cluster is defined by the point in the cluster. • Clusters are merged until they contain at least c points. • The first scattered point in a cluster in one which is farthest away from the clusters centroid. • Other scattered points are chosen so that their distance from previously chosen scattered points in maximal. • When c well scattered points are calculated they are shrunk by some factor α (r = p + α*(mean-p)). • After clusters have c representatives the distance between two clusters is the distance between two of the closest representatives of each cluster • Every time two clusters are merged their representatives are re-calculated.

Enhancements for Large Datasets • Random sampling • Filters outliers and allows the dataset to fit into memory • Partitioning • First cluster in partitions then merge partitions • Labeling Data on Disk • The final labeling phase can be done by NN on already chosen cluster representatives • Handling outliers • Outliers are partially eliminated and spread out by random sampling, are identified because they belong to small clusters that grow slowly

Conclusions • CURE can identify clusters that are not spherical but also ellipsoid • CURE is robust to outliers • CURE correctly clusters data with large differences in cluster size • Running time for a low dimensional dataset with s points is O(s2) • Using partitioning and sampling CURE can be applied to large datasets

Thanks!

CURE: An Efficient Clustering Algorithm for Large Databases

CURE: An Efficient Clustering Algorithm for Large Databases

Presentation Transcript

Comparing Clustering Algorithms

EM Algorithm: Expectation Maximazation Clustering Algorithm book: “ DataMining, Morgan Kaufmann, Frank ”

Birch: An efficient data clustering method for very large databases

Lecture outline

Combinatorial Pattern Matching

HCS Clustering Algorithm

A Distributed Clustering Algorithm for Target Tracking in Vehicular Ad-Hoc Networks

Chapter 4 Clustering

Large-scale Single-pass k-Means Clustering at Scale

Bayesian Hierarchical Clustering

CLUSTERING SCHEMES FOR MOBILE AD HOC NETWORK

Efficient full-text search in databases

A novel genetic algorithm for automatic clustering

Bi-correlation clustering algorithm for determining a set of co-regulated genes

CMune : A CLUSTERING USING MUTUAL NEAREST NEIGHBORS ALGORITHM

Efficient Pattern Matching Algorithm for Memory Architecture

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

Clustering in Mobile Ad hoc Networks