1 / 26

ABOUT ME

ABOUT ME. BHARATH RENGARAJAN PURSUING MY MASTERS IN COMPUTER SCIENCE FALL 2008. CONTENTS. Problems in the traditional clustering method CURE clustering Summary Drawbacks. PROBLEMS IN THE TRADITIONAL CLUSTERING METHOD. PARTITIONAL CLUSTERING ALGORITHM.

zeal
Download Presentation

ABOUT ME

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ABOUT ME BHARATH RENGARAJAN PURSUING MY MASTERS IN COMPUTER SCIENCE FALL 2008

  2. CONTENTS • Problems in the traditional clustering method • CURE clustering • Summary • Drawbacks

  3. PROBLEMS IN THE TRADITIONAL CLUSTERING METHOD

  4. PARTITIONAL CLUSTERING ALGORITHM • Attempts to find k-partitions that try to minimize a certain criterion function • The square-error criterion is the most common criterion function used. • Works well for compact, well separated clusters.

  5. PARTITIONAL CLUSTERING ALGORITHM • You may find error in case the square-error is reduced by splitting some large cluster to favor some other group.

  6. HIERARCHICAL CLUSTERING ALGORITHMS • This category of clustering method try to merge sequences of disjoint clusters into the target k clusters base on the minimum distance between two clusters. • The distance between clusters can be measured as: • Distance between mean: • Distance between two nearest point within cluster

  7. HIERARCHICAL CLUSTERING ALGORITHMS • Result of dmean :

  8. HIERARCHICAL CLUSTERING ALGORITHMS • Result of dmin :

  9. PROBLEMS: • Traditional clusteringmainly favors spherical shape. • Data in the cluster must be compact together. • Each cluster must separate far away enough. • Outliner will greatly disturb the cluster result.

  10. CURE CLUSTERING

  11. CURE CLUSTERING • It is similar to hierarchical clustering approach. But it use sample point variant as the cluster representative rather than every point in the cluster. • First set a target sample number c . Than we try to select c well scattered sample points from the cluster. • The chosen scattered points are shrunk toward the centroid in a fraction of  where 0 <<1 • These points are used as representative of clusters and will be used as the point in dmin cluster merging approach.

  12. Merge Nearest Merge Nearest CURE CLUSTERING • After each merging, c sample points will be selected from original representative of previous clusters to represent new cluster. • Cluster merging will be stopped until target k cluster is found

  13. PSEUDO FUNCTION OF CURE

  14. EFFICIENCY OF CURE • The worst-case time complexity is O(n2logn) • The space complexity is O(n) due to the use of k-d tree and heap.

  15. RANDOM SAMPLING • In case of dealing with large database, we can’t store every data point to the memory. • Handle of data merge in large database require very long time. • We use random sampling to both reduce the time complexity and memory usage. • By using random sampling, there exists a trade off between accuracy and efficiency.

  16. OUTLIER ELIMINATION We can introduce outliners elimination by two method. • Random sampling: With random sampling, most of outlier points are filtered out. • Outlier elimination: As outliner is not a compact group, it will grow in size very slowly during the cluster merge stage. We will then kick in the elimination procedure during the merging stage such that those cluster with 1 ~ 2 data points are removed from the cluster list.

  17. DATA LABELING • Due to the use of random sample. We need to label back every remaining data points to the proper cluster group. • Each data point is assigned to the cluster group with a representative point nearest to the data point.

  18. Draw Random Sample Partition Sample Partially cluster partition Elimination outliers Cluster partial clusters Label data in disk OVERVIEW OF CURE Data

  19. SENSITIVITY TO SHRINK PARAMETER ()

  20. SENSITIVITY TO NO. OF REPRESENTATIVE POINTS (c)

  21. SENSITIVITY TO THE NO. OF PARTITIONS

  22. SUMMARY

  23. SUMMARY • CURE can effectively detect proper shape of the cluster with the help of scattered representative point and centroid shrinking. • CURE can reduce computation time with random sampling. • CURE can effectively remove outlier. • The quality and effectiveness of CURE can be tuned be varying different s,p,c, to adapt different input data set.

  24. DRAWBACKS

  25. DRAWBACKS • Clusters shown are somewhat standard shapes. • Too many parameters are involved.

More Related