Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Samples in more complex query graphs Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA *Bioinformatics Institute Singapore

Continuous Aggregate Queries on Data Streams:Sampling & Load Shedding • Only random samples are available for computing aggregate queries because of • Limitations of remote sensors, or transmission lines • Load Shedding policies implemented when overloads occur • When overloads occur (e.g., due to a burst of arrivals} we can • drop queries all together, or • sample the input---much preferable • Key objective: Achieve answer accuracy with sparse samples for complex aggregates on windows • Can we improve answer accuracy with minimal overhead?

S1 Sn …... Query Network ∑ ∑ ∑ ∑ …... General Architecture • Basic Idea: • Optimize sampling rates of load shedders for accurate answers. • Previous Work [BDM04]: • Find an error bound for each aggregate query. • Determine sampling rates thatminimize query inaccuracy within the limits imposed by resource constraints. • Only work for SUM and COUNT • No error model provided Si Data Stream Load Shedder Query Operator Aggregate

A New Approach • Correlation between answers at different points in time • Example: sensor data [VAA04,AVA04] • Objective: The current answer can be adjusted by the past answers in the way that: • Low sampling rate  current answer less accurate  more dependent on history. • High sampling rate  current answer more accurate  less dependent on history. • We propose a Bayesian quality enhancement module which can achieve this objective automatically and reduce the uncertainty of the approximate answers. • A larger class of queries will be considered: • SUM, COUNT, AVG, quantiles.

∑ Our Model S1 …... Sn Query Network • The observed answer Ã is computed from random samples of the complete stream with sampling rate P. • We propose a bayesian method to obtain the improved answer by combining • the observed answer • the error model • history of the answer ∑ ∑ …... ∑ Ã …... History P Quality Enhancement Module Data Stream Si Load Shedder Improved answer Query Operator Aggregate

Error Model of the aggregate answers • Ã – approximate answer obtained by a random sample with sampling rate P. • Key result:Error modelfor sum count, avg, quantiles: • SUM: • COUNT: • AVG: • p-th Quantiles [B86]:Fis the cumulative distribution, f = F’ is the density function

Use of Error Model • Derive accuracy estimate for larger class of queries for optimizing the load shedding policy • Idea: minimize the variance of each query • Enhance the quality of the query answer, on the basis of statistically information derived from the past.

Learning Prior Distribution from the Past • Statistical information on the answers: • Spatial – e.g. reading from the neighbors. • Temporal – e.g the past answers {xi}. • Model the distribution of the answer by:- • Normal distribution: • By MLE, pdf ~ N(s,s2) where s=xi/n,s2=(xi-s)2/n; only need to store s, s. • Only require a minimal amount of computation time. • Assuming that there is no concept change.

Observations • Reduced uncertainty. (small s t << ) • Compromise between prior and observed answer: • large  less accurate Ã  more dependent on s • small  more accurate Ã  less dependent on s • uncertain prior (i.e. large s) will not have much effect on the improved answer.

Generalizing to Mining functions:K-means is aGeneralized AVG Query Relative Error for the first mean Relative Error for the second mean

Quantiles: (dataset with concept drifts) Average Rel. Err. for every quantile

Changing Distribution • Corrections effective for distributions besides normal ones • Changing distributions (a.k.a. concept changes) can be easily detected—we used a two sample test • Then old prior is dropped and new prior is constructed

Minimal Overheads • Computation costs introduced by: • Calculating posterior distribution • Detecting changes Time (in ms) for each query

Summary • Proposed a Bayesian quality enhancement method for approximate aggregatesin the presence of sampling. • Our method: • Works for ordered statistics and data mining functions as well as with traditional aggregates, and also • handles concept changes in the data streams

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding