Adaptivity in continuous query systems

Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003

Outline • Introduction • Adapting to the “burstiness” of data streams by using a smart operator scheduling strategy • Adapting to high volumes of data streamed by multiple data sources through the use of “adaptive filters” • Conclusion Sotomayor - Xu

Introduction • Two distinguishing characteristics of data streams: • Volume of data is extremely high • Decisions are made in close to real time • Traditional solutions are impractical • Data cannot be stored in static databases for offline querying • Importance of data streams is due to variety of applications Sotomayor - Xu

Applications of data streams • Network monitoring • Intrusion detection systems • Fraud detection • Financial monitoring • E-commerce • Sensor networks Sotomayor - Xu

Research efforts • Large number of applications has led to many efforts seeking to construct full-fledged DSMS • Efforts have concentrated on issues of • System architectures • Query languages • Algorithm efficiency • Issues such as efficient resource allocation, and communication overhead have received less attention Sotomayor - Xu

Importance of adaptivity • DSMS deal with multiple long-running continuous queries • Data streams do not usually arrive at a regular rate • Considerable “burstiness” and variation over time • Environment conditions in which queries are executed are frequently different from the conditions for which the query plans were generated • DSMS may face an increasing number of data sources and therefore an increased volume of traffic Sotomayor - Xu

The “Chain” operator scheduling strategy

The classic solution • Buffer the backlog of unprocessed tuples • Work through them during periods of light load • Problem: • Heavy load could exceed physical memory (causing page switches) • The memory used for these backlogs has to be minimized Sotomayor - Xu

Finding a better solution • Claim: the operator scheduling strategy can have a significant impact on run-time resource consumption • Use an operator scheduling strategy that will minimize the amount of memory used during query execution • I.e. reduce the size of the backlogs Sotomayor - Xu

Chain scheduling • A near optimal operator scheduling strategy • Outperforms competing operator scheduling strategies • Strategy concentrates on • Single stream queries involving • Selection • Projection • Foreign-key joins with stored relations • Sliding window queries over multiple streams Sotomayor - Xu

Query execution is conceptualized as a data flow diagram (a directed acyclic graph) Nodes correspond to pipelined operators Edges represent compositions of operators An edge from A to B indicates the output of operator A is the input to operator B Another interpretation: an edge represents an input queue that buffers the output from A before it is input to B The model Sotomayor - Xu

Suppose the query is SELECT Name FROM EmployeeStream WHERE ID = ‘12345’; Operators are Projection (SELECT …) Selection (WHERE …) An example Input stream Select Project Output stream Operator path Sotomayor - Xu

Main ideas • Operators are thought of as filters • Operate on a set of tuples • Produce s tuples in return • sselectivity of an operator • If s = 0.2we can interpret the value in two ways • Out of every 10 tuples, the operator outputs 2 tuples • If the input requires 1 unit of memory, the output will require 0.2 units of memory Sotomayor - Xu

Example • Consider an operator path with two operators O1 and O2 • Assume that O1 takes one unit of time to process a tuple and that its selectivity is 0.2 • Assume that O2 takes one unit of time to process 0.2 tuples and that its selectivity is 0 • I.e. O2 outputs tuples out of the system Sotomayor - Xu

Example (cont) • Now consider two strategies • FIFO • A tuple is passed through both operators in two consecutive time units • No other tuples are processed during that time • Greedy strategy • If there is a tuple buffered before O1 then it is operated on using one time unit • Otherwise if there are tuples buffered before O2, 0.2 tuples are processed using 1 time unit Sotomayor - Xu

Time Greedy scheduling FIFO scheduling 0 1 1 1 1.2 1.2 2 1.4 2.0 3 1.6 2.2 4 1.8 3.0 5 2.0 3.2 6 2.2 4.0 Example (cont) • Memory usage • Need to consider the growth or reduction of data as it travels along the operator path Sotomayor - Xu

Progress charts • Behavior of data is captured by progress charts • Points represent an operator • The ith operator takes (ti – ti-1) units of time to process a tuple of size si-1 • Result is a tuple of size si Sotomayor - Xu

Progress charts (cont) • We can define selectivity as the drop in tuple size from operator i to operator i+1. • In other words selectivity is equal to si/si-1  selectivity Sotomayor - Xu

The lower envelope • Consider some point (s, t) on the progress chart • Imagine there is a line from this point to every operator point (ti, si) to its right • The operator that corresponds to the line with the steepest slope is called the “steepest descent operator point” Sotomayor - Xu

The lower envelope (cont) • By starting at the first point (t0, s0) and repeatedly calculating the steepest descent operator point we find the lower envelope P’ for a progress chart P • Notice that the slopes of the segments are non-increasing Sotomayor - Xu

The lower envelope (cont) • So what is it? • A way to find which segments of the operator path yield the biggest drops in tuple size • It allows us to consider changes in selectivity across groups of operators • We call these groups “chains” Sotomayor - Xu

“Chain” scheduling • Chain assigns priorities to operators equaling the slope of the lower envelope segment to which the operator belongs • At any time • Out of all the operators with tuples in their input queues the one with the highest priority is chosen • When there are “ties,” the operator with the oldest tuples is chosen (based on arrival time) Sotomayor - Xu

The Chain strategy along the progress chart • Tuples don’t actually move along lower envelope • They instead move along the operator path • When the Chain strategy moves along the actual progress chart P, the memory requirements are not that much greater than before Sotomayor - Xu

Multiple stream queries • Queries that have at least one tuple-based sliding window join between two streams Sotomayor - Xu

Multiple stream query execution R   • Query is first broken up into parallel operator paths S R   S  Shared Sotomayor - Xu

Experimental results • Compared the performance of Chain, FIFO, Greedy, and Round-Robin • 2 data sets (network data) • Synthetic data set • Real data set • Queries used IP addresses and packet sizes in selection and projection predicates Sotomayor - Xu

Experiment: single stream queries (4 operators) • Query: • 4 operators • Third operator is very selective • In between two less selective operators Sotomayor - Xu

Experiment results Sotomayor - Xu

Multiple stream experiment • Three simultaneous queries • A sliding window join • Two single stream queries with selectivities less than one • Results show Chain outperforms other strategies by a large margin Sotomayor - Xu

Multiple stream experiment results Sotomayor - Xu

Summary • Proved that the choice of operator scheduling strategy has a significant impact on resource consumption • Proved that the Chain scheduling strategy outperforms competing strategies • Future work • Latency and starvation issues • Consider query plans that change over time • Consider the sharing of computation and memory in query plans Sotomayor - Xu

“Adaptive filters” for continuous queries over distributed data streams Sotomayor - Xu

What’s the problem? • Distributed data sources continuously stream updates to a centralized processor where continuous queries are evaluated • Because of the high volume of data updates, the communication overhead jeopardizes system performance • E.g. path latency computed by monitoring queuing latency at routers: the volume of monitoring traffic from routers may exceed that of normal traffic • Can we reduce the communication overhead to make continuous queries based on multiple data streams feasible and efficient? Sotomayor - Xu

Important observations • Exact precision for continuous queries is not always needed • E.g. path latency application: <= 5 ms of accuracy • Approximate answers of sufficient precision can usually be computed from a small fraction of the input stream. • E.g. average network traffic volume received by all hosts within the organization • The precision constraint for queries may change over time. • E.g. more precise traffic volume needed in face of attack Sotomayor - Xu

Overview of Approach • Reduce communication overhead at the cost of query precision. • Quantitative precision constraints specified with the continuous queries • Bounded approximate answer [L, H] • Precision constraint δ. 0 ≤ H – L ≤ δ • Filters installed at the remote data sources by the stream processor • Filter at data object O’s source: [Lo, Ho] of width Wo centered around most recent numeric update V. Sotomayor - Xu

Naive filtering policy • Uniform allocation • E.g a single CQ: AVG(O1, O2, …, On) • Precision constraint δ  Filters with a bound of width δ • The wider a bound, the more restrictive a filter and consequently the more imprecise the query answers. • Cons • Multiple CQs are issued on one object. If the smallest bound width is chosen for the filter, the higher update stream rate may be wasted on a few CQs. • Data updates rate and magnitudes not counted. Sotomayor - Xu

System structure • Data source • Filters • Stream coordinator • Precision manager • Bound cache • CQ evaluator Sotomayor - Xu

System structure Sotomayor - Xu

Adaptive filter setting algorithm • Goal: set bound widths for steam filters adaptively to reduce communication costs while guaranteeing the precision constraints of CQs • AVG queries analyzed only • Q1, Q2, …, Qm with sets S1, S2, …, Sm. Sjis a subset of a set of n data objects O1, O2, …, On • Query result Qj : • Precision constraint: • Basic idea: • Implicit bound width shrinking • Explicit bound width growing Sotomayor - Xu

Bound shrinking • Filtering bound width Wi for object Oi • Maintained both at the central stream coordinator and at the source filter • Wi Wi· (1 – S) for every Γ time units • Γ: adjustment period • S: shrink percentage Sotomayor - Xu

Bound growing • Burden score: the degree to which an object is contributing to the overall communication cost due to streamed updates where Ci is communication cost for Oi, Wi is the current bound width, and Where Ni is the number of updates of Oi received by the stream coordinator in the last Γ time units Burden target: the lowest overall burden required of the objects in the query in order to meet the precision constraint at all times. Sotomayor - Xu

Burden deviation: the degree to which an object is “over-burdened” with respect to the burden targets of the queries that access it. Queried objects are considered in order of decreasing deviation, and it is assigned the maximum possible bound growth when it is considered. Bound growing (Cont) Sotomayor - Xu

Bound growing (Summary) • Each object is assigned a burden score • Each query is assigned a burden target by either averaging burden scores or invoking an iterative linear solver • Each object is assigned a deviation value based on the difference between its burden score and the burden targets of the queries that access it • The objects are considered in order of decreasing deviation, and each object is assigned the maximum possible bound growth when it is considered Sotomayor - Xu

Burden Target Computation • Single AVG query Qk over every object O1, …, On. • B1 = B2 = … = Bn = Tk • Or • Intuitive explanation behind this formula • Objects having higher than average burden scores will be given a higher priority for bound width growth to lower their burden scores; • Objects having lower than average burden scores will shrink by default, thereby raising their burden scores. Sotomayor - Xu

Burden Target Computation (Cont) • Multiple queries over different set of objects • θi,j : the portion of object Oi’s burden score corresponding to query Qj and • Goal for adjusting burden scores in presence of overlapping queries is to have the burden score Biof each object Oi equal the sum of the burden targets of the queries over Oi. • Burden target: Sotomayor - Xu

Validation against optimized strategy • The adaptive bound width setting algorithm converges on bounds that are on par with those selected by an optimizer. Sotomayor - Xu

Implementation and experimental validation • Single query Sotomayor - Xu

Implementation and experimental validation • Multiple queries Sotomayor - Xu

Summary • Trade the precision of query results for lower communication costs. • The specification of precision for continuous queries • Adaptive filters • Future work • How imprecision propagates through more complex query plans • Develop appropriate optimization techniques for adapting remote filter predicates in more complex environments Sotomayor - Xu

Conclusion • The problem • DSMS must consider the high volume as well as the “burstiness” of data streams • Effectiveness of systems depends on being able to gracefully adapt to environmental conditions (I.e. resource availability) • Two different approaches for adaptivity • Minimizing the amount of memory at all times • Controlling the amount of data sent from multiple data sources Sotomayor - Xu

Adaptivity in continuous query systems