Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

A Unified Framework Supporting Interactive Exploration of Density-Based Clusters In Streaming Windows Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted This work is supported under NSF grants CCF-0811510, IIS-0119276, IIS-0414380.

What are Density-Based Clusters? • Clusters that are defined by individual data points (tuples) and their local “neighborhood”. • How they are different from K-median style clustering? Cluster 1 Cluster 2 Cluster 2 Cluster 1 Cluster 3 Cluster 4

range θ cnt θ Formal Definition 1 Core Object: has more than neighbors in distance from it. 16 2 14 4 9 6 17 12 5 Edge Object: not core object but a neighbor of a core object. 8 13 7 15 Noise: not core object and not a neighbor of any core object. A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them

Cluster Detection in Sliding Windows W1 W2 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Pattern-specific Window-specific Template Density-Based Clustering Query Over Sliding Windows

Application Examples: Are there intensive-transaction areas in last 1 hour transactions? clusters transaction info Stock Market Stock Analysts Where are the main clusters formed by enemy war-crafts position info clusters Battle field 5 Commander

State-of-Art • Existing algorithms for density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09]. • Extra-N suffers from the performance inefficiency as the slide/win rate increases. • No evolution semantics defined for density-based cluster changes over the time. • No existing system allowing interactive exploration of density-based clusters in streaming windows.

Goals • A more efficient density-based clustering algorithm over streams. • An evolution semantics that intuitively explain cluster changes. • A visualized pattern space allowing interactive exploration of clusters.

Review: existing algorithm– Extra-N • In highly dynamic streaming environments: • Re-computation. • Incremental cluster maintenance. • Extra-N[Yang09] proposed a hybrid neighbor relationship (neighborship) mechanism to represent cluster structure. • Maintain “Exact Neighborships” (neighbor lists) for none-core objects. • Maintain “Abstract Neighborships” (cluster memberships) for core objects. • A general concept of “Predicted View” is applied to efficiently update the cluster structure. —Key: a compact and easy-maintainable cluster representation.

Concept of Predicted Views 9 3 9 3 9 9 2 2 14 13 14 13 14 13 14 13 6 6 6 12 12 12 12 5 5 5 8 8 8 11 11 11 11 7 7 7 1 1 15 15 15 15 10 10 10 10 16 16 16 16 4 4 Current View of W0 Predicted View of W1 Predicted View of W2 Predicted View of W3 window size=16, slide size=4, time=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 W0 W1 W2 W3

Update Predicted Views 9 3 18 18 18 18 9 9 2 14 13 14 13 14 13 14 13 19 19 19 19 6 6 12 12 12 5 5 8 8 17 11 11 7 7 17 17 11 17 1 15 15 15 15 10 10 10 20 20 16 20 20 16 16 16 4 Predicted View of W2 Predicted View of W3 Expired View of W0 Current View of W1 Predicted View of W4 window size=16, slide size=4, time=1 New Data Points 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 W1 W2 W3 W4

Inefficiency of Extra-N • When Slide/Win rate increases, (for example Win=10000, slide=10), large number of predicted views need to be maintained independently. • Heavy burden to both CPU and memory resources. Slide Win

Proposed Solution: IWIN • Any relationship between the cluster identified ?

“Growth Property” among DB-cluster Sets Grow c6 c5 c4 c6 c5 c4 If any cluster Ci in Clu_Set1 is “contained” by one cluster in Clu_Set2, Clu_Set2 is a “Growth” of Clu_Set1 . Independent Cluster Structure Storage Hierarchical Cluster Structure Storage

Integrated Vs. Independent Maintenance of Predicted Views IWIN: Integrated maintenance Extra-N: Independetmaintenance

Benefits of Integrated Maintenance • Benefits for Memory Resources: Memory space needed by storing cluster sets identified by multiple queries in QG is independent from |QG|. • Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster structure (which are usually similar) can be maintained incrementally, rather than independently. • IWIN outperforms Extra-N in both CPU and memory utilizations.

Why we need evolution semantics? • Analysts need to know how clusters change over time. • It is hard to observe by looking at the clusters only (even with visualization). History: Did any clusters merge? Now: Are their any new cluster? Future: Is there any cluster breaking shortly? Commander

Proposed Semantics • Single Step Evolutions: • birth • termination • split • merge • Preserve/expand/shrink • Multi Step Evolutions: • split-expand • split-merge • shrink-split / /

How to Compute • Extract Predicted Evolution (before window slide) • Update Evolution (after window slide) split preserve shrink preserve

Conclusion for Proposed Semantics • Intuitively describe the cluster evolution over the time. • Easily maintainable: can be computed on-the-fly during cluster maintenance.

Outline • What is Neighbor-Based Pattern Detection • State-of-Art • Potential Solutions & Their Inefficiency • Proposed Solution: Extra-N • Experimental Study • Conclusion

Why needed? • Analysts need to navigate along the time axis to learn the current, review the history, and predict the near future. • Example: how are the two clusters in current window related to those detected 30 minutes back? • Analysts need to study the clusters and their evolution at different abstraction level. • Example: for routine traffic monitoring, only the position of major clusters need to be reported; when accident happened, specific information of cluster members need to be reported.

Proposed Pattern Space

Evaluation for IWIN • Alternative Methods: • Incremental DBSCAN [Ester98] • Extra-N [Yang09] • IWIN • Real Streaming Data: • GMTI data recording information about moving vehicles [Mitre08]. • STT data recording stock transactions from NYSE [INETATS08]. • Measurements: • Average processing time for each tuple. • Memory footprint.

Evaluation for IWIN

Case Study 1

Case Study 2

Conclusion • Presented the first unified framework supporting interactive exploration of density-based clusters in streaming windows. • Designed a more efficient density-based clustering algorithm IWIN. • Define the first evolution semantics for density-based clusters. • Our experimental study confirms the both the efficiency and effectiveness of our proposed framework.

Future work • Support multiple queries. • Support other pattern types, such as outliers, association rules… • Support pattern storage and match. • More?

The End Thanks

Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute