1 / 24

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure. Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD 2001, Santa Barbara. Talk Outline. Aggregate Queries Motivation for Approximate Answering Multi-Resolution Aggregate Tree (MRA-Tree)

Download Presentation

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD 2001, Santa Barbara

  2. Talk Outline • Aggregate Queries • Motivation for Approximate Answering • Multi-Resolution Aggregate Tree (MRA-Tree) • Progressive Algorithm with Error Bounds • Experimental Evaluation • Summary and Future Work

  3. 9 6 3 8 2 7 Q Aggregate Queries minQ = 2 maxQ = 7 countQ = 3 sumQ = 2+7+6 = 15 avgQ = 15/3 = 5 S

  4. Evaluating Aggregate Queries • Exact answering • Scan all points of D checking each against Q • Retrieve points in Q via a multi-dimensional index on D • Both linear/index scan can be very expensive • Approximate answering • Many applications (selectivity estimation, data analysis, visualization) do not require exact answers

  5. Boss How many tanks 10 miles from me? Motivating Examples My boss needs to see the income aggregates in 10 minutes!

  6. Techniques for Approximate Aggregate Queries • Online estimation (Interactive) • Sampling • Offline estimation (Data Synopsis) • Sampling, Histograms, Wavelets • Our Technique: • Online estimator via a scan of a modified multi-dimensional index (MRA-Tree) • Allows incremental tradeoff of accuracy for response time, with guaranteed error bounds

  7. Multi-Resolution Aggregate Tree (MRA-Tree) • An MRA-Tree can be instantiated with any of the popular multi-dimensional index trees (R-Tree, quadtree, Hybrid tree, etc.) • A non-leaf node contains (for each of its subtrees) four aggregates {MIN,MAX,COUNT,SUM} • A leaf node contains the actual data points • Tree operations are identical with those of the plain (non-MRA) tree with the consideration that aggregates must be maintained

  8. 6 2 2 1 2 4 3 5 4 1 MRA-Tree Example 2 min max count sum 4 1 6 Non-Leaf Node 4 5 2 6 3 2 4 1 9 9 4 6 Leaf Nodes

  9. Progressive Algorithm Outline • We want • Best answer for given time • Shortest time for given precision of the answer • Refine an answer at will, trading time for precision • How we achieve it • Do a prioritized traversal of nodes of the MRA-tree • Maintain an estimate of the answer E(aggQ) • Maintain a 100% interval of confidence I = [L, H], such that L aggQ H

  10. is contained contains Q Q Q Q N N partially overlaps disjoint N N Generic Algorithm (1) • Two sets of nodes: • NP(partial contribution to the query) • NC (complete contribution)

  11. Generic Algorithm (2) • Initialize NPwith the root • At each iteration: Remove one node N from NPand for each Nchildof its children • discard, if Nchild disjoint with Q • insert into NPif Q is contained or partially overlaps with Nchild • “insert” into NCif Q contains Nchild(we only need to maintain aggNC) N Q

  12. Generic Algorithm (3) • To instantiate the algorithm for {MIN,MAX,COUNT,SUM,AVG}: • Error Bounds. • Interval I=[L, H] : L  aggQ H • Traversal Policy. • Which node from NP to explore next? Minimize |I| • Estimation. • Provide an estimate of the answer: E(aggQ) Node inNP Node in NC

  13. Traversal Choose N  NP: minN = minNP MIN (and MAX) Interval minNC= min { 4, 5 }=4 minNP = min { 3, 9 } = 3 L = min {minNC, minNP} = 3 H = minNC = 4 hence, I = [3, 4] 9 4 5 Estimate Lower bound: E(minQ) = L = 3 3

  14. Traversal Choose N  NP: countNcountM, MNP COUNT (and SUM) Interval countNC=9+6 = 15 countNP = 8+10 = 18 L = countNC = 15 H = countNC + countNP = 33 hence, I = [15, 33] 8 25% 9 Estimate E(countQ) = L + 0.258 + 0.210 = 19 6 20% 10

  15. 10 10 5 5 5 AVG Interval Current avgNC = 55/10 = 5.5 B Maximum possible: (55+210) / (10+2) = 6.25 Minimum possible: (55+35) / (10+3) = 5.38 hence, I = [5.38, 6.25] A Estimate E(avgQ) = E(sumQ)/ E(countQ) Distribution of Values {5, 5, 5, 10, 10} Traversal – max countN – max (maxN-avgNC), (avgNC-minN) min max count sum A 5 10 5 35 B – – 10 55

  16. Experiments • Synthetic datasets 2-4D • Real datasets: 2D spatial (USGS) and 4D (UCI KDD Forest Cover) • MRA-quadtree and MRA-Rtree indices • We study • MRA-tree Vs. “plain” tree • MRA-tree Vs. online sampling • Accuracy of estimation • Scalability with database size

  17. MRA-Quadtree (Nodes Visited)

  18. MRA-Quadtree (Error Reduction) Absolute Relative Error =

  19. MRA-Rtree (2D, USGS) I/O Performance DB Size =

  20. Estimation vs. Maximum Error (4D, Forest Cover, sel. 16% / axis)

  21. MRA-Rtree vs. Online SamplingEstimation Accuracy (4D, Forest Cover)

  22. Database Size (3D Synthetic, exact, 10% spatial sel.)

  23. Summary • MRA-Tree is a modified multi-dimensional index for approximate answering of aggregate queries • For exact answer • faster than “plain” index • Advantages over offline estimators • Progressively improving answers • Error bounds • Advantages over sampling • Better estimate for same I/O • Algorithm scales gracefully with database size

  24. Future Work (QUASAR Project, UC Irvine) • Scalability with high dimensionality, by using a dedicated high-D index structure • Scalability in high update rate environments • Approximate query processing of general SQL queries using dedicated data structures, similar to MRA-tree

More Related