1 / 14

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding. Samples in more complex query graphs. Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics Institute Singapore. Continuous Aggregate Queries on Data Streams: Sampling & Load Shedding.

iavery
Download Presentation

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Samples in more complex query graphs Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA *Bioinformatics Institute Singapore

  2. Continuous Aggregate Queries on Data Streams:Sampling & Load Shedding • Only random samples are available for computing aggregate queries because of • Limitations of remote sensors, or transmission lines • Load Shedding policies implemented when overloads occur • When overloads occur (e.g., due to a burst of arrivals} we can • drop queries all together, or • sample the input---much preferable • Key objective: Achieve answer accuracy with sparse samples for complex aggregates on windows • Can we improve answer accuracy with minimal overhead?

  3. S1 Sn …... Query Network ∑ ∑ ∑ ∑ …... General Architecture • Basic Idea: • Optimize sampling rates of load shedders for accurate answers. • Previous Work [BDM04]: • Find an error bound for each aggregate query. • Determine sampling rates thatminimize query inaccuracy within the limits imposed by resource constraints. • Only work for SUM and COUNT • No error model provided Si Data Stream Load Shedder Query Operator Aggregate

  4. A New Approach • Correlation between answers at different points in time • Example: sensor data [VAA04,AVA04] • Objective: The current answer can be adjusted by the past answers in the way that: • Low sampling rate  current answer less accurate  more dependent on history. • High sampling rate  current answer more accurate  less dependent on history. • We propose a Bayesian quality enhancement module which can achieve this objective automatically and reduce the uncertainty of the approximate answers. • A larger class of queries will be considered: • SUM, COUNT, AVG, quantiles.

  5. Our Model S1 …... Sn Query Network • The observed answer à is computed from random samples of the complete stream with sampling rate P. • We propose a bayesian method to obtain the improved answer by combining • the observed answer • the error model • history of the answer ∑ ∑ …... ∑ à …... History P Quality Enhancement Module Data Stream Si Load Shedder Improved answer Query Operator Aggregate

  6. Error Model of the aggregate answers • Ã – approximate answer obtained by a random sample with sampling rate P. • Key result:Error modelfor sum count, avg, quantiles: • SUM: • COUNT: • AVG: • p-th Quantiles [B86]:Fis the cumulative distribution, f = F’ is the density function

  7. Use of Error Model • Derive accuracy estimate for larger class of queries for optimizing the load shedding policy • Idea: minimize the variance of each query • Enhance the quality of the query answer, on the basis of statistically information derived from the past.

  8. Learning Prior Distribution from the Past • Statistical information on the answers: • Spatial – e.g. reading from the neighbors. • Temporal – e.g the past answers {xi}. • Model the distribution of the answer by:- • Normal distribution: • By MLE, pdf ~ N(s,s2) where s=xi/n,s2=(xi-s)2/n; only need to store s, s. • Only require a minimal amount of computation time. • Assuming that there is no concept change.

  9. Observations • Reduced uncertainty. (small s t << ) • Compromise between prior and observed answer: • large  less accurate à  more dependent on s • small  more accurate à  less dependent on s • uncertain prior (i.e. large s) will not have much effect on the improved answer.

  10. Generalizing to Mining functions:K-means is aGeneralized AVG Query Relative Error for the first mean Relative Error for the second mean

  11. Quantiles: (dataset with concept drifts) Average Rel. Err. for every quantile

  12. Changing Distribution • Corrections effective for distributions besides normal ones • Changing distributions (a.k.a. concept changes) can be easily detected—we used a two sample test • Then old prior is dropped and new prior is constructed

  13. Minimal Overheads • Computation costs introduced by: • Calculating posterior distribution • Detecting changes Time (in ms) for each query

  14. Summary • Proposed a Bayesian quality enhancement method for approximate aggregatesin the presence of sampling. • Our method: • Works for ordered statistics and data mining functions as well as with traditional aggregates, and also • handles concept changes in the data streams

More Related