1 / 49

Online Computation and Continuous Maintaining of Quantile Summaries

Online Computation and Continuous Maintaining of Quantile Summaries. Tian Xia Database Lab @ CCIS Northeastern University April 16, 2004. References. M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD , pages 58-66, 2001.

mahdis
Download Presentation

Online Computation and Continuous Maintaining of Quantile Summaries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database Lab @ CCIS Northeastern University April 16, 2004

  2. References • M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD, pages 58-66, 2001. • X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. In ICDE, pages 362-373, 2004

  3. Outline of this talk • Quantile Estimation Overview • GK-quantile Summary Algorithm • Data Structure • Operations • Space Complexity Analysis • Sliding Window Model

  4. Problem Definitions • -Quantile:A -quantile ((0,1]) of an ordered sequence of N data elements is the element with rank N . • Quantile Query: Given , find the data element with rank N among all elements in the stream. • Variation: N recent elements (sliding window model). • (-approximate):Find the element with rank r within the interval [r-N, r+N].

  5. t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 Example of A Quantile Query • The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12. • 0.5-quantile returns the element ranked 8, which is 8. • 0.25-approximate 0.5-quantile returns one of the elements in {4,5,6,7,8,9,10}.

  6. Why Approximation? • Munro and Paterson (Theoretical Computer Science, 1980) showed that any algorithm which exactly computes -quantile of N data elements in p passes, requires a space of . • Approximate quantile techniques are necessary to achieve sub-linear space efficiency.

  7. Quantile Summary • Quantile Summary:A small number of objects from the input data sequence, which could be used (by quantile estimator) to answer quantile queries. • Other summary methods of large data sets include average, standard deviation, histogram, counting sketch (FM-sketch), etc.

  8. Properties of A Good Quantile Estimator • Provide tunable and explicit a priori guarantees on the precision of the approximation, e.g. it is -approximate. • Data independent. • Use as small a memory footprint as possible, which includes temporary storage.

  9. Previous Work • Manku, Rajagopalan, and Lindsay (SIGMOD, 1998) proposed a single-pass algorithm that constructs an -approximate quantile summary. • Space complexity: log2N. • It requires an advance knowledge of N, the size of data set. Won’t work in data stream environment.

  10. Outline of this talk • Quantile Estimation Overview • GK-quantile Summary Algorithm • Data Structure • Operations • Space Complexity Analysis • Sliding Window Model

  11. Contributions of GK-algorithm • Dynamically adjust quantile summary with the growth of N, the total number of data elements in the data stream. • Space complexity is reduced to logN.

  12. Assumptions • A new data element arrives after each unit of time. • n denotes both the number of elements of the data sequence, as well as the current time. • A data element is represented by its value v. • rmin(v) and rmax(v) denote respectively the lower and upper bounds on the actual rank r of v among the elements seen so far.

  13. The Summary Data Structure • GK-algorithm maintains a summary data structure S=S(n) at any point in time n. • S(n) consists of an ordered (non-decreasing) sequence of tuples which corresponds to a subset of the elements seen thus far.

  14. The Summary Data Structure • S = {t0, t1, …, ts-1}, where ti = (vi, gi, Δi). • vi is the value of one of the elements seen so far. • gi = rmin(vi) - rmin(vi-1) • Δi = rmax(vi) - rmin(vi) • v0 and vs-1 always correspond to the minimum and the maximum elements seen so far.

  15. The Summary Data Structure • Given gi = rmin(vi) - rmin(vi-1) and Δi = rmax(vi) - rmin(vi), • rmin(vi)= ji gj • rmax(vi) = ji gj +Δi • gi +Δi -1 isupper bound on the total number of elements that may have fallen between vi-1and vi. • rmin(vs-1) = i gj = n.

  16. t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 Example of A Quantile Summary • {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples. • For clarity, re-write the tuples of the above summary in the form ti = (vi, rmin(vi), rmax(vi)) as follows:{(1,1,1), (2,2,9), (3,3,10), (4,4,10), (10,10,10), (12,16,16)}.

  17. Error Rate? • PROPOSITION 1:Given a quantile summary S, a-quantile can always be identified to within an error of maxi(gi+Δi)/2. • COROLLARY 1:If at any time n, the summary S(n) satisfies the property that maxigi+i  2n, than we can answer any -quantile query to within an nprecision.

  18. QUANTILE () • QUANTILE(): To compute an -approximate -quantile from the summary S(n) after n data elements, compute the rank r=n. Find i such that both r rmin(vi) n and rmax(vi) r n, return vi. • i.e. rn rmin(vi)  rmax(vi)  r  n

  19. t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 Example of A Quantile Summary • {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is 0.25-approximate with respect to the data stream. • An 0.25-approximate 0.5-quantile returns the element (4,1,6) or (10,6,0).

  20. Outline of this talk • Quantile Estimation Overview • GK-quantile Summary Algorithm • Data Structure • Operations • Space Complexity Analysis • Sliding Window Model

  21. How does their algorithm work? • Insert a tuple in the summary corresponding to a new incoming element. • Periodically sweep over the summary to “merge” some of the tuples into their neighbors. • It ensures the space requirement. • At all times maxi (gi +Δi) 2n. • What to merge & How to merge?

  22. INSERT (v) • INSERT(v): Find the smallest i, such that vi-1 vvi, and insert the tuple (v, 1, 2n), between ti-1 and ti. Increment s. As a special case, if v is the new minimum or the maximum element seen, then insert (v, 1, 0).

  23. Example of INSERT • S={(12, 1, 0)}, n=1 • S={(6, 1, 0), (12, 1, 0)}, n=2 • S={(6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=3 • S={(1, 1, 0), (6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=4 t0 12 t8 6 t3 10 t4 1

  24. Merge • Space will increase with insertions. • Intuitively, two tuples (vi, gi,Δi) and (vj, gj,Δj) can be merged into a new tuple (vk, gk,Δk), as long as gk +Δk  2n. • An individual tuple is full if gk +Δk  2n. • Capacity and Band are introduced.

  25. Capacity and Band • The capacity of a tuple is the maximum numer of elements that can be counted by gi before the tuple become full. (gi  2n  i). • The merge phase will free up space by merging tuples with small capacities into tuples with similar or larger capacities. • Bands: Roughly speaking, divide theΔs into bands that lie between elements of (0, ½2n, ¾2n, …, 2i-12i 2n, …, 2n-1, 2n). • The larger the capacity (with smallerΔ), the larger the band.

  26. t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 Example of A Quantile Summary • {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples. • (2,1,7) and (3,1,7) are in thelowest band. (1,1,0), (10,6,0) and (12,6,0) are in the highest bands.

  27. Band • Strictly, Given  from 1 to log2n, p=2n, band is the set of allΔsuch that p2 (p mod 2)Δp2-1 (p mod 2-1). • If twoΔs are ever in the same band, they never appear in different bands as n increase. • In band0,Δ= 2n . • A tree structure is imposed to facilitate merges between bands.

  28. Tree Representation • Given a summary S = {t0, t1, …, ts-1}, the tree T associated with S contains a node Vi for each ti and a special root node R. • The parent of a node Vi is the node Vj such that j is the least index greater than i with band(ti) > band(tj). Otherwise R is the parent.

  29. R (1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0) Tree Representation • PROPOSITION 3: The children of any node in T are always arranged in non-increasing order of band in S. • PROPOSITION 4: For any node V, the set of all its descendants arranged in T forms a contiguous segment in S.

  30. Merge Actually • GK-algorithm will merge together a node and all its descendants into either its parent node or into its right sibling. • The tuple that results after the merge must not be full, i.e. gi +i  2n. • The operation is called COMPRESS().

  31. COMPRESS ( ) • The operation COMPRESS tries to merge together a node and all its descendants into either parent node or into its right sibling. • COMPRESS() • for i from s-2 to 0 do • if ((BAND(i, 2n)  BAND(i+1, 2n)) && g*gi+1i+1 2n))then • DELETE all descendants of ti and the tuple ti itself; • end if • end for • end COMPRESS g* denotes the sum of g-values of the tuple ti and all its descendants in T.

  32. DELETE (vi) • DELETE(vi): To delete the tuple (vi, gi,Δi) from S, replace (vi, gi,Δi) and (vi+1, gi+1,Δi+1) by the new tuple (vi+1, gi+ gi+1,Δi+1), and decrement s.

  33. Example of COMPRESS and DELETE t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 • S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12, 1, 0)}, s=6, n=6 • Compress tuples (11, 1, 1) and (12, 1, 0) into a new tuple (12, 2, 0). • S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)}, s=5, n=6

  34. Pseudo-Code for the whole algorithm Initial State S; s  0; n  0; Algorithm To add the n+1st element, v, to summary S(n): if (n  0 mod 12)then COMPRESS(); end if INSERT (v); n=n+1;

  35. A Complete Example ( ) t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 • S={(10, 1, 0), (12, 1, 0)}, n=2 • S={(10, 1, 0), (10, 1, 1), (11, 1, 1), (12, 1, 0)}, n=4 • S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12, 1, 0)}, n=6, s=6 • Perform compress when t6 comes. • S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)}, n=6, s=5

  36. A Complete Example ( ) t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 • S={(1, 1, 0), (9, 1, 3), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 3), (12, 2, 0)}, n=8, s=7 • Perform compress when t8 comes. • S={(1, 1, 0), (10, 2, 0), (10, 1, 1), (10, 1, 2), (12, 3, 0)}, n=8, s=5

  37. A Complete Example ( ) t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 • S={(1, 1, 0), (4, 1, 6), (5, 1, 6), (10, 5, 0), (12, 6, 0)}, n=14, s=5 • Perform compress • S={(1, 1, 0), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=14, s=4 • Finally • S={(1, 1, 0), (2, 1, 7), (3, 1, 7), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=16, s=6

  38. Outline of this talk • Quantile Estimation Overview • GK-quantile Summary Algorithm • Data Structure • Operations • Space Complexity Analysis • Sliding Window Model

  39. Band Property • Observe that the number of band and elements in a band determine the space complexity. • PROPOSITION 2:At any point in time n and for any  1, band(n) contains either 2 or 2-1 distinct values ofΔ. • Since no more than 1 2 elements with any givenΔ are inserted, band is a summary of at most 2 2 elements in the stream.

  40. LEMMAs • LEMMA 3:At any time n and for any given , there are at most 32 nodes in T(n) that have a child with band value of . • Only a small number of nodes can have a child with band . See Proposition 3.

  41. LEMMAs • A full pair of tuples (ti-1, ti): band(ti-1)  band(ti). The tuple ti-1 is left partner and ti is a right partner in this full pair. • LEMMA 4:At any time n and for any given , there are at most 4 tuples from band(n) that are right partners in a full tuple pair.

  42. R (1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0) Full Pair Example • {(2,1,7), (3,1,7)} and is a full pair • {(1,1,0), (2,1,7)} is not a full pair. • (2,1,7) can only be a left partner!

  43. Space Efficiency • Any band(n) node either is a right partner of a full pair, or can only be a left partner. • By Proposition 3, a band(n) node that can only be a left partner only occurs once for every parent of nodes from band(n). • By Lemma 3 and 4, the number of nodes in any band is bounded by 3 2  4 11 2.

  44. Space Efficiency • The number of band is 1. • THEOREM:At any time n, the total number of tuples stored in S(n) is at most (11 2)log(2n). • GK-algorithm’s space complexity is logN.

  45. Outline of this talk • Quantile Estimation Overview • GK-quantile Summary Algorithm • Data Structure • Operations • Space Complexity Analysis • Sliding Window Model

  46. Sliding Window Model • Under sliding window model, a summary is maintained for the most recently seen N data elements. • Eliminate exact out-dated elements requires a space of O(N). • Lin, etc. (ICDE 2004) proposed a space-efficient one-pass summary algorithm for sliding window model. Their underlying summary algorithm is GK-algorithm.

  47. n-of-N Model • A summary is maintained for N most recently seen data elements. However, quantile queries can be issued against any n  N. That is, for any (0,1], and any n  N, we can return -quantiles among the n most recent elements in a data stream seen so far. • Lin, etc. (ICDE 2004) proposed their one-pass summary algorithm combining EH partitioning technique (Datar, etc. ACM-SIAM 2002) with GK-algorithm, solving n-of-N model.

  48. t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 Example of n-of-N model • Assume the sliding window is 16 in an n-of-N model. A quantile query can be answered for any 1 n  16. • 0.5-quantile returns 6 for n=12 and 3 for n=4. • FYI: The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12.

  49. Thank you!

More Related