1 / 31

Space-Efficient Online Computation of Quantile Summaries

Space-Efficient Online Computation of Quantile Summaries. SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery. Outline. Introduction The summary data structure Operation and algorithm Tree representation Analysis and experimental result Conclusion. Introduction.

pavel
Download Presentation

Space-Efficient Online Computation of Quantile Summaries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery

  2. Outline • Introduction • The summary data structure • Operation and algorithm • Tree representation • Analysis and experimental result • Conclusion

  3. Introduction • Space-efficient computation of quantile summaries of very large data sets in a single pass. • Quantile queries: Given a quantile, , return the value whose rank is N

  4. N = 16 sorting 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12 0.5 quantile returns element ranked 8 ( 0.5*16) which is 8 0.75 quantile returns element ranked 12 (0.75*16) which is 10

  5. Requirements • Explicit & tunable a priori guarantees on the precision of the approximation • As small a memory footprint as possible • Online:Single pass over the data • Data Independent Performance: guarantees should be unaffected by arrival order, distribution of values, or cardinality of observations. • Data Independent Setup: no a priori knowledge required about data set (size, range, distribution, order).

  6. ε- approximate • A quantile summary for a data sequence is ε- approximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r -εN , r + εN ] Example : A data stream with 100 elements, 0.5 – quantile with ε= 0.1 returns a value v. The true rank of v is within [40,60]

  7. The Summary Data Structure • Let rmin(v) and rmax(v) denote the lower and upper bounds on the rank of v • Each tuple ti = (vi , gi ,Δi)

  8. Example .01, N=1750 {28,7} {10,1} {15,2} 192 204 201 [501,503] [539,540] [529,536]

  9. Query • Sketch S isε- approximate, That is for each ψ (0,1] , there is a (vi , rmin(vi), rmax(vi)) in S such that • vi is our answer for ψ-quantile

  10. Corollary • If at any time n, the summary S(n) satisfies the property that then we can answer any ψ-quantile query to within an εn precision.

  11. Overview of Summary Data Structure  = .29 r = N = 522 .01, N=1800 {28,7} {15,2} {10,1} • Quantile  = .29? Compute r and choose best vi 192 201 204 [529,536] [539,540] [501,503]

  12. Overview of Summary Data Structure .01, N=1800 {28,7} {15,2} {10,1} • If (rmax(vi+1) - rmin(vi)) ≦ 2N, then -approximate summary. • Our goal: always maintain this property. • Tuple formulation of this rule: gi + I ≦ 2N 2N=36 192 204 201 [529,536] [539,540] [501,503]

  13. 197 Overview of Summary Data Structure .01, N=1800 {28,7} {15,2} {10,1} • Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary 2N=36 192 204 201 [539,540] [529,536] [501,503]

  14. Overview of Summary Data Structure .01, N=1800 {28,7} {15,2} {10,1} • Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary 2N=36 197 192 204 201 [502,536] [501,503] [529,536] [539,540]

  15. Overview of Summary Data Structure .01, N=1801 {28,7} {15,2} {1,34} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Insert tuple before the ith tuple. gnew = 1; new = gi + I - 1; 2N=36.02 197 192 204 201 [502,536] [530,537] [540,541] [501,503]

  16. Overview of Summary Data Structure .01, N=1801 {28,7} {15,2} {1,34} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Delete all “superfluous” entries. 2N=36.02 197 192 204 201 [502,536] [540,541] [530,537] [501,503]

  17. Overview of Summary Data Structure .01, N=1801 {28,7} {15,2} {1,34} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Delete all “superfluous” entries. 2N=36.02 192 204 201 [530,537] [540,541] [501,503]

  18. Overview of Summary Data Structure .01, N=1801 {29,7} {15,2} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Delete all “superfluous” entries. gi = gi + gi-1 2N=36.02 192 204 201 [530,537] [540,541] [501,503]

  19. Overview of Summary Data Structure .01, N=1801 {29,7} {15,2} {10,1} 2N=36.02 • Insert: gnew = 1; new = gi + I - 1; • Delete: gi = gi + gi-1 192 204 201 [530,537] [540,541] [501,503]

  20. Terminology • Full tuple: A tuple is full if gi + I = 2N • Full tuple pair: A pair of tuples is full if deleting the left-hand tuple would overfill the right one • Capacity: number of observations that can be counted by gi before the tuple becomes full. (=2N - I) General strategy will be to delete tuples with small capacity and preserve tuples with large capacity.

  21. Operations • Insert(v):Find the smallest i, such that , and insert • Delete(vi):to delete from S, replace and by the new tuple • Compress():from right to left, merge all mergeable pair.

  22. GK Algorithm To add the n+1st observation, v, to summary S(n) yes no COMPRESS() INSERT

  23. Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 2N=14 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent. 0 0 0 3 1 2 1 1 1 0 3 0 1 2 3 1 2 0 1 1 3

  24. Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 2N=14 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent. 3 3 3 3 0 0 0 1 2 1 1 1 0 0 1 2 1 2 0 1 1

  25. 3 3 3 3 2 2 2 0 0 0 1 1 1 1 0 0 1 1 0 1 1 Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent. 2N=14

  26. R 3 3 3 3 2 2 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 2N=14 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent.

  27. Operation (compress) General strategy: delete tuples with small capacity and preserve tuples with large capacity. 1) Deletion cannot leave descendants unmerged --- it must delete entire subtrees 2) Deletion can only merge a tuple with small capacity into a tuple with similar or larger capacity. 3) Deletion cannot create an over-full tuple (i.e with g+ > floor(2N))

  28. Analysis • Theorem At any time n, the total number of tuples stored in S(n) is at most

  29. Experimental Result • Measurement: • |S| • Observed  (vs. desired ) : max, avg, and for 16 representative quantiles • Optimal max observed  • Compared 3 algorithms • MRL • Preallocated (1/3 number of stored observations as MRL) • Adaptive: allocate a new quantile only when observed error is about to exceed desired 

  30. Conclusion • Better worst-case behavior than previous algorithms • It does not require a priori knowledge of the parameter N

  31. Any Question ?

More Related