1 / 31

Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

A Unified Framework Supporting Interactive Exploration of Density-Based Clusters In Streaming Windows. Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted.

Download Presentation

Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Unified Framework Supporting Interactive Exploration of Density-Based Clusters In Streaming Windows Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted This work is supported under NSF grants CCF-0811510, IIS-0119276, IIS-0414380.

  2. What are Density-Based Clusters? • Clusters that are defined by individual data points (tuples) and their local “neighborhood”. • How they are different from K-median style clustering? Cluster 1 Cluster 2 Cluster 2 Cluster 1 Cluster 3 Cluster 4

  3. range θ cnt θ Formal Definition 1 Core Object: has more than neighbors in distance from it. 16 2 14 4 9 6 17 12 5 Edge Object: not core object but a neighbor of a core object. 8 13 7 15 Noise: not core object and not a neighbor of any core object. A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them

  4. Cluster Detection in Sliding Windows W1 W2 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Pattern-specific Window-specific Template Density-Based Clustering Query Over Sliding Windows

  5. Application Examples: Are there intensive-transaction areas in last 1 hour transactions? clusters transaction info Stock Market Stock Analysts Where are the main clusters formed by enemy war-crafts position info clusters Battle field 5 Commander

  6. State-of-Art • Existing algorithms for density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09]. • Extra-N suffers from the performance inefficiency as the slide/win rate increases. • No evolution semantics defined for density-based cluster changes over the time. • No existing system allowing interactive exploration of density-based clusters in streaming windows.

  7. Goals • A more efficient density-based clustering algorithm over streams. • An evolution semantics that intuitively explain cluster changes. • A visualized pattern space allowing interactive exploration of clusters.

  8. Review: existing algorithm– Extra-N • In highly dynamic streaming environments: • Re-computation. • Incremental cluster maintenance. • Extra-N[Yang09] proposed a hybrid neighbor relationship (neighborship) mechanism to represent cluster structure. • Maintain “Exact Neighborships” (neighbor lists) for none-core objects. • Maintain “Abstract Neighborships” (cluster memberships) for core objects. • A general concept of “Predicted View” is applied to efficiently update the cluster structure. —Key: a compact and easy-maintainable cluster representation.

  9. Concept of Predicted Views 9 3 9 3 9 9 2 2 14 13 14 13 14 13 14 13 6 6 6 12 12 12 12 5 5 5 8 8 8 11 11 11 11 7 7 7 1 1 15 15 15 15 10 10 10 10 16 16 16 16 4 4 Current View of W0 Predicted View of W1 Predicted View of W2 Predicted View of W3 window size=16, slide size=4, time=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 W0 W1 W2 W3

  10. Update Predicted Views 9 3 18 18 18 18 9 9 2 14 13 14 13 14 13 14 13 19 19 19 19 6 6 12 12 12 5 5 8 8 17 11 11 7 7 17 17 11 17 1 15 15 15 15 10 10 10 20 20 16 20 20 16 16 16 4 Predicted View of W2 Predicted View of W3 Expired View of W0 Current View of W1 Predicted View of W4 window size=16, slide size=4, time=1 New Data Points 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 W1 W2 W3 W4

  11. Inefficiency of Extra-N • When Slide/Win rate increases, (for example Win=10000, slide=10), large number of predicted views need to be maintained independently. • Heavy burden to both CPU and memory resources. Slide Win

  12. Proposed Solution: IWIN • Any relationship between the cluster identified ?

  13. “Growth Property” among DB-cluster Sets Grow c6 c5 c4 c6 c5 c4 If any cluster Ci in Clu_Set1 is “contained” by one cluster in Clu_Set2, Clu_Set2 is a “Growth” of Clu_Set1 . Independent Cluster Structure Storage Hierarchical Cluster Structure Storage

  14. Integrated Vs. Independent Maintenance of Predicted Views IWIN: Integrated maintenance Extra-N: Independetmaintenance

  15. Benefits of Integrated Maintenance • Benefits for Memory Resources: Memory space needed by storing cluster sets identified by multiple queries in QG is independent from |QG|. • Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster structure (which are usually similar) can be maintained incrementally, rather than independently. • IWIN outperforms Extra-N in both CPU and memory utilizations.

  16. Goals • A more efficient density-based clustering algorithm over streams. • An evolution semantics that intuitively explain cluster changes. • A visualized pattern space allowing interactive exploration of clusters.

  17. Why we need evolution semantics? • Analysts need to know how clusters change over time. • It is hard to observe by looking at the clusters only (even with visualization). History: Did any clusters merge? Now: Are their any new cluster? Future: Is there any cluster breaking shortly? Commander

  18. Proposed Semantics • Single Step Evolutions: • birth • termination • split • merge • Preserve/expand/shrink • Multi Step Evolutions: • split-expand • split-merge • shrink-split / /

  19. How to Compute • Extract Predicted Evolution (before window slide) • Update Evolution (after window slide) split preserve shrink preserve

  20. Conclusion for Proposed Semantics • Intuitively describe the cluster evolution over the time. • Easily maintainable: can be computed on-the-fly during cluster maintenance.

  21. Goals • A more efficient density-based clustering algorithm over streams. • An evolution semantics that intuitively explain cluster changes. • A visualized pattern space allowing interactive exploration of clusters.

  22. Outline • What is Neighbor-Based Pattern Detection • State-of-Art • Potential Solutions & Their Inefficiency • Proposed Solution: Extra-N • Experimental Study • Conclusion

  23. Why needed? • Analysts need to navigate along the time axis to learn the current, review the history, and predict the near future. • Example: how are the two clusters in current window related to those detected 30 minutes back? • Analysts need to study the clusters and their evolution at different abstraction level. • Example: for routine traffic monitoring, only the position of major clusters need to be reported; when accident happened, specific information of cluster members need to be reported.

  24. Proposed Pattern Space

  25. Evaluation for IWIN • Alternative Methods: • Incremental DBSCAN [Ester98] • Extra-N [Yang09] • IWIN • Real Streaming Data: • GMTI data recording information about moving vehicles [Mitre08]. • STT data recording stock transactions from NYSE [INETATS08]. • Measurements: • Average processing time for each tuple. • Memory footprint.

  26. Evaluation for IWIN

  27. Case Study 1

  28. Case Study 2

  29. Conclusion • Presented the first unified framework supporting interactive exploration of density-based clusters in streaming windows. • Designed a more efficient density-based clustering algorithm IWIN. • Define the first evolution semantics for density-based clusters. • Our experimental study confirms the both the efficiency and effectiveness of our proposed framework.

  30. Future work • Support multiple queries. • Support other pattern types, such as outliers, association rules… • Support pattern storage and match. • More?

  31. The End Thanks

More Related