1 / 28

A New Gravitational Clustering Algorithm

A New Gravitational Clustering Algorithm. Jonatan Gomez, Dipankar Dasgupta , Olfa Nasraoui. Outline. Introduction Background Proposed Algorithm Analysis. Introduction. Many clustering techniques rely on the assumption that a data set follows a certain distribution and is free of noise

Download Presentation

A New Gravitational Clustering Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Gravitational Clustering Algorithm Jonatan Gomez, DipankarDasgupta, OlfaNasraoui

  2. Outline • Introduction • Background • Proposed Algorithm • Analysis

  3. Introduction • Many clustering techniques rely on the assumption that a data set follows a certain distribution and is free of noise • Given noise, several techniques (k-means, fuzzy k-means) based on a least squares estimate are spoiled • Most clustering algorithms require the number of clusters to be specified • The authors propose a novel, robust, unsupervised clustering technique based on Newton’s Law of Gravitation, and Newton’s second law of motion

  4. Introduction • Gravitational concepts have been applied to cluster visualization and analysis before • Properties of Wright’s Gravitational Clustering [2]: • New position of a particle is found using remaining particles • When two particles are close they merge • Maximum movement of particles per iteration is capped • Algorithm terminates when only one particle remains • Improvements over Wright: • Speed, robustness, and determining number of clusters

  5. Background • Newton’s Laws of Motion • If acceleration is constant:

  6. Background • Newton’s Law of Gravitation

  7. Background • Optimal Disjoint Set Union-Find Structure • A disjoint set Union-Find structure supports three operators: • MAKESET(X) FIND(X) UNION(X,Y) • Time complexity of any sequence of m Union and Find operations on n elements is at most O(m+n) in practice

  8. Proposed Algorithm • Ideas behind applying gravitational law: • A data point exerts a higher gravitational force on other data points in the same cluster than on data points not in the same cluster. Thus, points in a cluster move toward the center of the cluster. • If a point is a noise point, the gravitational forces acting on it will be so small the point will be immobile. Thus, noise points won’t be assigned to any cluster

  9. Proposed Algorithm Simplified equation used to move point x according to gravitational field of point y Velocity considered to be zero at all points in time Reduce G after each iteration to prevent the “big crunch”

  10. Proposed Algorithm

  11. Proposed Algorithm Use threshold to extract valid clusters which have at least a minimum number of points

  12. Proposed Algorithm Similarities to Agglomerative Hierarchical Clustering Differences from Agglomerative Hierarchical Clustering

  13. Proposed Algorithm Comparison to Wright [2]

  14. Experiments Synthetic data

  15. Experiments Results (over 10 trials) Parameters: M = 500, G = 7x10-6, ∆G = 0.01, ε = 10-4 k-Means and Fuzzy k-Means given 150 iterations

  16. Experiments Clusters found by the G-algorithm

  17. Experiments Clusters found by the G-algorithm (noise removed)

  18. Experiments Movement of points over iterations

  19. Experiments Scalability (average of 50 trails for each percentage) Do not need to use entire data set to get good results

  20. Experiments Sensitivity to α Use α = 0.03

  21. Experiments Sensitivity to G To big => one cluster To small => no clusters No universal value => depends on data set

  22. Experiments Sensitivity to ∆(G) To big => no clusters To small => one cluster Best value ~0.01 based on experiments

  23. Experiments Sensitivity to ε To big => one cluster

  24. Experiments • Real data set • Intrusion detection benchmark data set • 42 attributes, 33 numerical, N = 492,021 • 2 classes – no intrusion (19.3%) and intrusion (80.7%) • Use only the numerical attributes • Use only 1% of the data (chosen randomly) • Parameter settings • G = 1x10-4 (based on testing) • ∆(G) = 0.01 • α = 0.03 • ε = 1x10-6 • M = 100

  25. Experiments • Clustering-Classification Strategy • Assign to each cluster the class with more training data records assigned to that cluster • Given an unknown data point, the data point is assigned to the closest cluster (the center of the clusters is used to compute the distance)

  26. Experiments Real data set results (over 100 trials)

  27. Conclusions / Future Work • Successfully determines the number of clusters in noisy data sets • Can be used to pre-process data by removing noise • Three of four parameters can be set to constant values • Future Work: • Determine method to automatically set G • Extend to different distance metrics

  28. References • [1] J. Gomez, D. Dasgupta, and O. Nasraoui, “A New Gravitational Clustering Algorithm,” In Proc. of the SIAM Int. Conf. on Data Mining, 2003. • [2] W. E. Wright, “Gravitational Clustering,” Pattern Recognition, 9:151-166, Pergamon Press, 1977.

More Related