1 / 43

SATSCAN VERSUS DBSCAN

SATSCAN VERSUS DBSCAN. Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang. AGENDA. Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work. AGENDA. Problem Statement

ailis
Download Presentation

SATSCAN VERSUS DBSCAN

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SATSCAN VERSUS DBSCAN Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang

  2. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Validation • Results • Conclusions • Future Work

  3. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Validation • Results • Conclusions • Future Work

  4. PROBLEM STATEMENT • Input: • Two different Clustering algorithms (DBScan, SatScan) • Same Input Dataset • Criteria of Comparison • Output: • Result of Comparison – Data / Graph • Constraints: • DBScan – No data about efficiency • SatScanSoftware – 1 pre defined shape • Objective: • Usage Scenarios – Which algorithm can be used where?

  5. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Validation • Results • Conclusions • Future Work

  6. MOTIVATION/NOVELTY • Different clustering algorithms • Categorized into different types • Comparisons • Algorithms - Same category • No Systematic way of comparison, Biased Comparisons • No situation based comparison – Which to use where? • No comparison betn. DBScan & SatScan

  7. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Validation • Results • Conclusions • Future Work

  8. RELATED WORK Comparison of Clustering Algorithms Same type of Algorithms Different type of Algorithms Our Work – DBSCanVsSatScan DBScanVs K-Means Density Based – DBScan& OPTICS Density Based – DBScan& SNN K-means (Centroid Based) Vs Hierarchical, Expectation Vs Maximization (Distance Based)

  9. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Validation • Results • Conclusions • Future Work

  10. PROPOSED APPROACH • Our Approach: • Two different types of Clustering algorithms • DBScan • SatScan • Unbiased comparison • Systematic – 3 factors & Same Input datasets • Shape of the cluster • Statistical Significance • Scalability

  11. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Challenges • Validation • Results • Conclusions • Future Work

  12. KEY CONCEPTS - 1 • Clustering • Task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. • Data Mining, Statistical Analysis & many more fields • Real world Application: • Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones • Field Robotics: For robotic situational awareness to track objects and detect outliers in sensor data

  13. KEY CONCEPTS - 2 Types of Clustering Algorithms ……… Connectivity based / Hierarchical Distribution Based Density Based Centroid Based Core Idea- Clusters are represented by a central vector, which may not necessarily be a member of the data set Ex: K - Means Core Idea - Objects being more related to nearby objects (distance) than to objects farther away Core Idea- Clusters are represented by a central vector, which may not necessarily be a member of the data set Ex: K - Means Core Idea - Clusters can be defined as objects belonging most likely to the same distribution Core Idea - Clusters are areas of higher density than the remainder of the data set

  14. KEY CONCEPTS - DBSCAN • Density based Clustering • Arguments • Minimum number of Points – MinPts • Radius - Eps • Density = Number of Points within specified radius (Eps) • Three types of Points • Core Point – No. of points > MinPts within Eps • Border point – No. of Points < MinPts within Eps but is in neighborhood of a core point • Noise point - Neither a core point nor a border point

  15. EXAMPLE - DBSCAN • Dataset 1 :

  16. DBSCAN RESULTS - 1 DB Scan o/p on dataset1: Min-Neighbors=3, Radius = 5 Number of Clusters = 36

  17. DBSCAN RESULTS - 2 DB Scan o/p on dataset1: Min-Neighbors=7, Radius = 1 Number of clusters = 0

  18. DBSCAN RESULTS - 3 DB Scan o/p on dataset1: Min-Neighbors=20,Radius = 20 Number of clusters = 4

  19. KEY CONCEPTS - 3 • SaTScan – Spatial Scan Statistics • Input: • Dataset • null hypothesis model • Procedure: • Pre-defined shape scanning window • Variating size of the window • Calculate likelyhood ratio => Most Likely clusters • Test statistical significance (Monte Carlo Sampling, 1000 runs) • Output: • Clusters with p-value Significant/primary Insignificant/secondary

  20. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Challenges • Validation • Results • Conclusions • Future Work

  21. CHALLENGES • Tuning parameters - DBScan • Manual tuning to detect clusters • Hard to set correct parameters • Design of appropriate Datasets • To demonstrate Criteria of Comparison

  22. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Challenges • Validation • Results • Conclusions • Future Work

  23. VALIDATION • Experiment • Assumptions based on theory • Designing datasets and running experiment • Able to validate them with results

  24. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Challenges • Validation • Results • Conclusions • Future Work

  25. CLUSTER SHAPE - DBSCAN VsSatScan

  26. CLUSTER SHAPE - DBSCAN VsSatScan

  27. STATISTICAL SIGNIFICANCE CSR Dataset -1000 points

  28. STATISTICAL SIGNIFICANCE CSR Dataset - 2000 points

  29. RUNTIME – Number of Points - DBScan

  30. RUNTIME – Number of Points - SATScan

  31. RUNTIME – Number of Clusters – DB Vs SAT SAME CLUSTERS!! Datasets: 3000 points

  32. RUNTIME – Number of Clusters – DBScan

  33. RUNTIME – Number of Clusters – SATScan

  34. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Challenges • Validation • Results • Conclusions • Future Work

  35. CONCLUSIONS

  36. AGENDA • Problem Statement • Motivation / Novelty • Related Work & Our Contributions • Proposed Approach • Key Concepts • Challenges • Validation • Results • Conclusions • Future Work

  37. FUTURE WORK • Same project – Real World Datasets • Run more instances of the experiments • Control over parameters • Compare with other types of clustering algorithms

  38. QUESTIONS?

  39. BACKUP SLIDE - 1 • DBSCAN requires two parameters: epsilon (eps) and minimum points (minPts). It starts with an arbitrary starting point that has not been visited. It then finds all the neighbor points within distance eps of the starting point. • If the number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively. • If the number of neighbors is less than minPts, the point is marked as noise. • If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points in the dataset.

  40. BACKUP SLIDE 2 -CONCLUSIONS • DBScan Works • Same density clusters • Don’t know the number of clusters beforehand • Different shaped clusters • All points need not be in clusters – Noise concept is present • DBScan doesn’t work • Varying density clusters • Quality of DBScan depends on – Epsilon – If Euclidean distance • High dimension data – Curse of dimensionality • TO DO

  41. CLUSTER SHAPE - DBSCAN Results

  42. SHAPE - SATSCAN RESULTS p-value: 0.001

More Related