ARGUS: A Prototype Stream Anomaly Monitoring System

ARGUS: A Prototype Stream Anomaly Monitoring System Thesis Proposal Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston Jamie Callan Phil Hayes, DYNAMiX Technologies

Thesis Statement • Stream Anomaly Monitoring System (SAMS) is an important sub-class of stream applications. The difficulty is raised by the very-large-volume data and a large number of queries the system is supposed to handle. • Propose an approach for SAMS’s that implements incremental evaluation schemes with adapted Rete algorithm upon a traditional DBMS platform and exploit SAMS characteristics for query evaluation optimization. • Demonstrate how the approach and the improvements could lead to a simple and fast implementation of an effective and efficient SAMS system. Chun Jin Carnegie Mellon

Outline • Motivation • My ARGUS Approach • Current Work Status • Current System • Preliminary Results • Proposed Work and Timeline Chun Jin Carnegie Mellon

Stream Processing • Stream Processing Applications • Network Traffic Analysis and Router Configuration • Internet Services • Sensor Data Analysis • Anomaly Detection • Stream Processing Projects • STREAM, TelegraphCQ, Aurora • NiagaraCQ, OpenCQ, WebCQ • Gigascope, Tribeca • Tapestry, Alert, Tukwila, etc. Chun Jin Carnegie Mellon

Stream Anomaly Monitoring Systems (SAMS) • SAMS monitors structured data streams for anomalies or potential hazards. • Continuous queries may number in thousands or tens of thousands. • Daily stream volumes may exceed millions of records. • Satisfaction of a SAMS query is often rare (very-high-selectivity). Chun Jin Carnegie Mellon

SAMS Dataflow Data Streams FedWire Money Transfers Patient Records Stream Anomaly Monitoring System Queries Storage Chun Jin Carnegie Mellon Alerts Analyst

Query Example 4 • Suppose for every big transaction of type code 1000, the analyst wants to check if the money stayed in the bank or left within ten days. An additional sign of possible fraud is that transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within ten days of this transaction using an intermediate bank. Chun Jin Carnegie Mellon

SQL Query for Example 4 FROM transaction r1, transaction r2, transaction r3 WHERE r2.type_code = 1000 AND r3.type_code = 1000 AND r1.type_code = 1000 AND r1.amount > 1000000 AND r1.rbank_aba = r2.sbank_aba AND r1.benef_account = r2.orig_account AND r2.amount > 0.5 * r1.amount AND r1.tran_date <= r2.tran_date AND r2.tran_date <= r1.tran_date + 10 AND r2.rbank_aba = r3.sbank_aba AND r2.benef_account = r3.orig_account AND r2.amount = r3.amount AND r2.tran_date <= r3.tran_date AND r3.tran_date <= r2.tran_date + 10; Chun Jin Carnegie Mellon

ARGUS as a Prototype SAMS • Implement the Adapted Rete Algorithm upon a traditional DBMS platform • Rete (Forgy 1982): Incremental Evaluation based on Materialized Intermediate Results. • SAMS’s assumption of very-high-selectivity query over very-large-volume data justifies employment of Rete and necessitates some unique improvements. • Transitivity Inference • Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97 • Predicate Set Evaluation and Materialization • Partial Rete (Materialization skipping) • Complex Common Computation Identification for Sharing • Intermingled Sharing and Optimization processing Chun Jin Carnegie Mellon

ARGUS System Architecture Data Tables Stream Anomaly Monitoring Intermediate Tables Data Streams Query Table Do_queries Analyst Rete Network Generator Query Scheduler Rete Networks Identified Threats Chun Jin Carnegie Mellon

ReteGenerator Architecture System Catalog History-based Cost Estimating ReteGenerator Update Tables History-based Rete Optimizer Topology Table Counter Table SQL Queries Transitivity Inference ReteGen Manager Sharing Topology Checker Query Rewriter Check Topology Register Rete Networks Chun Jin Carnegie Mellon

Selected ARGUS Topics • Adapted Rete Algorithm • ReteGenerator translates a query into a Rete network that is wrapped as a stored procedure. • The procedure implements the Adapted Rete Algorithm accounting for the incremental evaluation • Transitivity Inference • Rete Optimization • Computation Sharing Chun Jin Carnegie Mellon

Adapted Rete Algorithm (Selection) • n and m are old data sets • Δn and Δm are the new much smaller incremental data sets. • Selection ơ • ơ(n+ Δn) = + ơ(Δn) ơ(n) Chun Jin Carnegie Mellon

Adapted Rete Algorithm (Join) • Join • (n+Δn) (m+Δm) = n m + Δn m + n Δm + Δn Δm • When Δn and Δm are very small compared to n and m, time complexity of incremental join is O(n+m) Old Results New Incremental Results Chun Jin Carnegie Mellon

Incremental Evaluation in Rete Example 4 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount > r1.amount*0.5 r1.tran_date <= r2.tran_date r2.tran_date >= r1.tran_date+10 Type_code=1000 Amount>1000000 DataTable Type_code=1000 r1, r2, r3 r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date >= r2.tran_date+10 Type_code=1000 Chun Jin Carnegie Mellon

Complex Queries • A continuous query may contain multiple SQL statements, and a single SQL statement may contain unions of multiple SQL terms. • Each SQL term is mapped to a sub-Rete network. • These sub-Rete networks are then connected to form the statement-level sub-networks. • And the statement-level subnetworks are further connected based on the view references to form the final query-level Rete network. Chun Jin Carnegie Mellon

Transitivity Inference • Exploring transitivity properties of comparison operators • To derive hidden high-selective selection predicates • High-selective selection predicates can significantly improve performance as they may produce very small intermediate results. Subsequent join could be performed very fast on the materialized intermediate results. Chun Jin Carnegie Mellon

Transitivity Inference Example • Given • r1.amount > 1000000 and • r2.amount > r1.amount * 0.5 and • r3.amount = r2.amount • r1.amount > 1000000 is very high-selective on r1 • We can infer high-selective predicates: • r2.amount > 500000 • r3.amount > 500000 Chun Jin Carnegie Mellon

Rete Optimization DB History-based Cost Estimator Active List SQL Query Join Enumerator Join Graph Rete network Update Tables History-based Rete Optimizer StructureBuilder Chun Jin Carnegie Mellon

Join Graph Example 1 P(1,2) P(1,3) 1,2 2 P(2,3) 3 4 P(3,4) Chun Jin Carnegie Mellon

History-based Cost Estimator • Run sub-plans on historical data • To estimate the costs of sub-plans on future data • Assume same data distribution in past and future • Apply heuristic functions to avoid estimating extremely high cost sub-plans. • Justify History-based Cost Estimator • Compiled and optimized once, and executed multiple times • Tolerable to spend more time on the one-time optimization • Accurate cost estimates compensate as queries run more and more times Chun Jin Carnegie Mellon

Computation Sharing • Predicate Indexing • Extended predicate set operations • Sharing Algorithm Chun Jin Carnegie Mellon

Predicate Indexing • Predicate Indexing Concepts: • Equivalent Predicate, p1≡p2, iff ∀D, p1(D) = p2(D) • Equivalent Predicate Class • Canonical Predicate Form • Predicates are converted into the canonical forms and stored as records in tables. • Searching a predicate becomes data retrieval from tables. Chun Jin Carnegie Mellon

Relationship between Predicate Sets and Their Result Tuple Sets • Predicate Set: a set of conjunctive predicates • Its Result Tuple Set: a set of database tuples that satisfy all the predicates of the Predicate Set. • Fix database status D, a mapping from predicate set P to its result tuple set SD(P): • SD: P ---> SD(P) • Predicate sets and their result tuple sets are complementary: • Predicates are filters of data items • The more number of predicates, the less number of result tuples Chun Jin Carnegie Mellon

Extending Predicate Set Operations • Defined on predicate sets • Definitions are justified by the relationships among corresponding result tuple sets • Important to common computation identification Chun Jin Carnegie Mellon

Semantic Subset ⊆≡ • Given two predicate sets P1 and P2, we say that P1 is a semantic subset of P2, and denote as P1⊆≡P2, if for any database status D, we have SD(P1)⊇SD(P2). Chun Jin Carnegie Mellon

Semantic Subset Example • p1: t1.a>1, p2: t1.a>2 • P1 = {p1}, P2 = {p2} • S(P1)⊇S(P2), P1⊆≡ P2. • Why? • P2 ≡≡ {p1, p2} Chun Jin Carnegie Mellon

Sharing Types T1 POT1 POJ POJ T1 POT1 T2 POT2 PNT1-POT1 PNJ T2 POT2 PNT2-POT2 T1 POT1 T1 POT1 POJ-PFJ POJ PNJ-POJ PFJ T2 POT2 T2 POT2 Add-only Non-change Reconstruction Selection Add-only Chun Jin Carnegie Mellon

Sharing Algorithm Overview • Non-change sharing. • Add-only sharing. • Optimizing the remaining query. • Reconstruction and selection sharing. • Constructing the remaining Rete network based on the optimized plan with possible sharing. Chun Jin Carnegie Mellon

Current Work Status • A preliminary system • Database • A preliminary ReteGenerator • With the Adapted Rete and Transitivity Inference • Will be expanded to incorporate optimization, computation sharing, and incremental aggregation, etc. • A Preliminary evaluation • Will conduct full evaluation on the complete system in future Chun Jin Carnegie Mellon

Preliminary Evaluation:Queries and Data • 7 queries on synthesized FedWire money transfer database. 320006 records. • Two Data Conditions: • Data1: Old: first 300000 records New: remaining 20006 records ALERT • Data2: Old: first 300000 records New: next 20000 records NOT alert Chun Jin Carnegie Mellon

Preliminary Results 50 40 30 Execution Time(s) 20 10 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Rete Data1 SQL Data1 Rete Data2 SQL Data2 Rete with Transitivity Inference Chun Jin Carnegie Mellon

Transitivity Inference Q4 50 Q2 45 40 25 35 30 20 Execution Time(s) 25 15 20 Execution Time(s) 15 10 10 5 5 0 0 Data1 Data2 Data1 Data2 Rete TI Rete Non-TI SQL Non-TI SQL TI Chun Jin Carnegie Mellon

Partial Rete Generation 50 45 40 35 Partial Rete 30 Execution Time(s) 25 Rete 20 SQL 15 10 5 0 Data1 Data2 Q4 assumes Transitivity Inference not applicable Chun Jin Carnegie Mellon

Proposed Work • System Design and Implementation • System Evaluation Chun Jin Carnegie Mellon

System Design and Implementation • Rete Optimization (am doing) (05–08/2004) • Computation Sharing (will do) (07–11/2004) • Incremental Aggregation (will do) (12/2004– 02/2005) • Constraint Exploiting (optional) (04–05/2005) • Transitivity Inference Enhancements (optional) ( 06 – 08/2005) • Automatic Index Selection (optional) (09–12/2005) Chun Jin Carnegie Mellon

System Evaluation • Data Collection ( 12/2004 – 01/2005) • Query Generation ( 12/2004 – 01/2005) • Simulation and Evaluation ( 02 – 05/2005) • Single SQL vs. Single Rete, Multiple SQL vs. Multiple Shared Optimized Rete • Single Non-optimized Rete vs. Single Optimized Rete • Multiple Non-shared Optimized Rete vs. Multiple Shared Optimized Rete • Non-incremental Aggregation vs. Incremental Aggregation Chun Jin Carnegie Mellon

Evaluation: Data Collection • FedWire Money Transfer Transactions • Synthesized 0.5M records. • Plan to generate 0.5M more. • 23 attributes/record • Massachusetts Medical Data • Real 1.6M records (sanitized) • 70 attributes/record • In-patient admission and discharge records. • Expand to 10M. Chun Jin Carnegie Mellon

Evaluation: Queries • Now, 7 queries on FedWire, 3 queries on Medical. • Plan to extend to 20-40 queries for each domain. • Further extend query sets: • Similar predicates matching different constants • Join predicate sets have non-empty intersections • Same where_clauses but different groupby_clauses • Same where_clauses and groupby_clauses but different aggregation operators Chun Jin Carnegie Mellon

Timeline • System Design and Implementation (Required) 03/2004 – 02/2005 • System Implementation (Optional) 04/2005 – 12/2005 • Evaluation on Required Parts 12/2004 – 05/2005 • Thesis Writing and Defense 06/2005 – 03/2006 • Thesis Writing 06 – 12/2005 • Thesis Finalizing 01 – 03/2006 • Defense 02 or 03/2006 Chun Jin Carnegie Mellon

ARGUS Summary • Implement the incremental evaluation schemes with the Adapted Rete Algorithm upon a traditional DBMS platform • To deal with very-large-volume data, exploit the very-high-selectivity query property for optimization: • Transitivity Inference • Predicate Set Evaluation and Materialization • Partial Rete (Materialization skipping) • Complex Common Computation Identification for Sharing • Intermingled Sharing and Optimization processing Chun Jin Carnegie Mellon

Thank you! Questions and Comments? Chun Jin Carnegie Mellon

ARGUS: A Prototype Stream Anomaly Monitoring System