A Shared Execution Strategy for Multiple Pattern Mining Requests Over Streaming Data

A Shared Execution Strategy for Multiple Pattern Mining Requests Over Streaming Data Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France Chandi This work is supported under NSF grants CCF-0811510, IIS-0119276, IIS-0414380.

Motivation: data streams are everywhere Are there any patterns in transactions over past hour? patterns transaction info Stock Market Stock Analysts Where are the main clusters formed by enemy warcraft? position info patterns Battlefield 2 Commander

Motivation: pattern mining requests tend to be parameterized • Example 1: give me the stocks that dropped significantly in the most recent transactions. • Example 2: give me the majorclusters formed by enemy warcraft. with in last 10,30, or 60 minutes. 10%, 30% or 50% to the original price size: n war-crafts density: m war-crafts / mi 2

Motivation: best parameters settings are hard to determine Clusters formed by boom carriers need to be updated every10 seconds I only care about the clusters that are formed by more than 20 warcraft Clusters formed by fighter planes need to be updated every 5 seconds I need info for any cluster sized 5 or higher Problem: A lot of similar queries, yet with different parameter settings, how to answer them efficiently. Multiple analysts may raise multiple queries with different parameter settings. ? Parameter settings? I probably know . But, can I try different combinations of them? A single analyst may raise multiple queries with different parameter settings .

Outline • Motivation • Problem Definition & State-of-Art • Basic: Single Query Processing Strategies. • Proposed Solution: Multi-Query Sharing. • Experimental Study • Conclusion 5

range θ cnt θ Target Pattern Type • Why: popular and well know, arbitraryshapes, allow unclassified (noise), deterministic… 1 Core Object: has no less than neighbors in distance from it. 16 2 14 4 9 6 17 12 5 Edge Object: not core object but a neighbor of a core object. 8 13 7 15 Noise: not core object and not a neighbor of any core object. A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them

Clustering in Sliding Windows Over Stream W1 W2 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 • Applications include: • monitoring congestion (cluster)in traffic • looking for intensive transaction areas (cluster) in stock trades • identifying malicious attacks (cluster) in network

Problem Definition • Input:a query group QG with multiple density-based clustering queries querying on the same input stream but with arbitrary parameter settings. • Goal: to minimize both the average processing time and the peak memory space needed by the system. Pattern-specific Window-specific Template Density-Based Clustering Query Over Sliding Windows 8

State-Of-Art • Previous work on multiple query sharing concentrates on traditional SPJ and aggregation operators [Arasu04][Hammad03][Wang06]. • No existing solution for multiple query optimization for complex pattern mining requests, such as clustering, in streaming environments. • Existing algorithms for single density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09]. • Executing algorithms above independently for each query (naive solution ) is not scalable.

Outline • Motivation • Problem Definition & State-of-Art • Basic: Single Query Processing Strategies. • Proposed Solution: Multi-Query Sharing. • Experimental Study • Conclusion

Density-Based Clustering on Data Streams • In highly dynamic streaming environments: • Re-computation. • Incremental cluster maintenance. • Extra-N proposed a hybrid neighbor relationship (neighborship) mechanism to represent cluster structure. • Maintain “Exact Neighborships” (neighbor lists) for none-core objects. • Maintain “Abstract Neighborships” (cluster memberships) for core objects. • A general concept of “Predicted View” is applied to efficiently update the cluster structure. —key: a compact and easy-maintainable cluster representation.

Concept of Predicted Views 9 3 9 3 9 9 2 2 14 13 14 13 14 13 14 13 6 6 6 12 12 12 12 5 5 5 8 8 8 11 11 11 11 7 7 7 1 1 15 15 15 15 10 10 10 10 16 16 16 16 4 4 Current View of W0 Predicted View of W1 Predicted View of W2 Predicted View of W3 window size=16, slide size=4, time=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 W0 W1 W2 W3

Update Predicted Views 9 3 18 18 18 18 9 9 2 14 13 14 13 14 13 14 13 19 19 19 19 6 6 12 12 12 5 5 8 8 17 11 11 7 7 17 17 11 17 1 15 15 15 15 10 10 10 20 20 16 20 20 16 16 16 4 Predicted View of W2 Predicted View of W3 Expired View of W0 Current View of W1 Predicted View of W4 window size=16, slide size=4, time=1 New Data Points 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 W1 W2 W3 W4

Outline • Motivation • Problem Definition & State-of-Art • Basic: Single Query Processing Strategies. • Proposed Solution: Multiple Query Sharing. • Experimental Study • Conclusion

cnt range θ θ Pattern-Specific Parameters Window-Specific Parameters

cnt θ θ range Arbitrary Pattern-Specific Parameter Case-- arbitrary , fixed • Anyrelationship between the cluster sets identified by them?

“Growth Property” among DB-cluster Sets Grow c6 c5 c4 c6 c5 c4 If any cluster Ci in Clu_Set1 is “contained” by one cluster in Clu_Set2, Clu_Set2 is a “Growth” of Clu_Set1 . Independent Cluster Structure Storage Hierarchical Cluster Structure Storage

Benefits of Hierarchical Cluster Structure • Benefits for Memory Resources: Memory space needed by storing cluster sets identified by multiple queries in QG is independent from |QG|. • Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster structure (which are usually similar) can be maintained incrementally, rather than independently.

θ θ cnt range cnt θ θ range Arbitrary Pattern-Specific Parameter Case-- arbitrary , fixed • Growth property transitively holds among the cluster sets identified by multiple queries with arbitrary and same .

cnt θ θ range Arbitrary Pattern-Specific Parameter Case-- arbitrary , fixed cnt θ IntView_ • Idea: Growth property transitively holds. • Solution: A single integrated representation of predicted views:

θ θ range cnt cnt range θ θ Arbitrary Pattern-Specific Parameter Case-- Fixed , Arbitrary • Growth property transitively holds among the cluster sets identified by multiple queries with arbitrary and same .

range θ θ cnt Arbitrary Pattern-Specific Parameter Case-- arbitrary , fixed range θ IntView_ • Idea: Growth property transitively holds. • Solution: A single integrated representation of predicted views:

θ θ range range θ θ cnt cnt range θ θ cnt Arbitrary Pattern-Specific Parameter Case-- arbitrary , arbitrary θ IntView_ • Growth property holds ,if ≥ and ≤ . • An “Predicted View Tree” for all queries. Cluster set by Q1 Cluster set by Q2 Qj. Qi. grow grow Qj. Qi. Cluster set by Q5 Cluster set by Q4 grow grow Cluster set by Q3

Arbitrary Window-Specific Parameter Case-- arbitrarywin, fixed slide • In this case, maintaining a single query will be sufficient to answer all. • The predicted views for Qi with largest win cover all needed. Answer for Q4 at T=16 Answer for Q1 at T=16 Answer for Q2 at T=16 Answer for Q3 at T=16 Shared Shared Shared Q1.win=16, slide=4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 W0 W1 W2 W3 Time W0 Q2.win=12, slide=4 Q3.win=8 , slide=4 W0 Q4.win=4 , slide=4 W0

Arbitrary Window-Specific Parameter Case-- arbitraryslide, arbitrary win • Use a single meta query with largest window size and adaptive slide sizeto represent queries. Grow

General Case-- arbitraryall four parameters θ IntView_ • Our proposed techniques • for arbitrary pattern parameter cases (intra-window-optimization) • for arbitrary window parameter cases (inter-window-optimization) are orthogonal to each other. • Final integrated structure for QG. IntView_W IntView

Experimental Study • Alternative Methods: • Incremental DBSCAN [Ester98] • Incremental DBSCAN with rqs (range query search sharing) • Extra-N [Yang09] • Extra-N with rqs (range query search sharing) • Chandi (our solution) • Real Streaming Data: • GMTI data recording information about moving vehicles [Mitre08]. • STT data recording stock transactions from NYSE [INETATS08]. • Measurements: • Average processing time for each tuple. • Memory footprint. Chandi

Evaluation for Performance Arbitrary Pattern Parameter Case Arbitrary Window Parameter Case Arbitrary Pattern Parameter Case Arbitrary Window Parameter Case

Evaluation for Performance Arbitrary All Four Parameter Cases

Conclusion: • First effort of applying multi-query optimization techniques on complex mining requests. • General principles proposed, such asincremental pattern representation and meta query strategy, applicable to other pattern types. • First full-sharing framework, called Chandi, for multiple density-based clustering queries over streaming windows, • Experimental study shows Chandihas excellent efficiency and scalability. Future Work: • Other pattern types: other cluster types, outliers, association rules… • Pattern management. • Visualization and interaction.

The End Thanks Chandi

A Shared Execution Strategy for Multiple Pattern Mining Requests Over Streaming Data

A Shared Execution Strategy for Multiple Pattern Mining Requests Over Streaming Data

Presentation Transcript

Sampling From a Moving Window Over Streaming Data

Data Mining Over a Large, Dynamic Network

Streaming Queries over Streaming Data

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Frequent Pattern Mining in Data Streams

Strategy Pattern

Queries over Streaming Sensor Data

Data Mining and Access Pattern Discovery

Strategy Pattern

Pattern Directed Mining Of Sequence Data

Multiple Aggregations over Data Stream

Multiple Aggregations Over Data Streams

TIPS FOR A SHARED LLL STRATEGY

Tango: distributed data structures over a shared log

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Streaming Queries over Streaming Data

Strategy Execution

Data Mining over Hidden Data Sources

Multiple Aggregations Over Data Streams

Streaming Pattern Discovery in Multiple Time-Series

Machine Learning, Pattern Recognition, Data Mining

Queries Over Streaming Sensor Data