1 / 46

Space-Efficient Online Computation of Quantile Summaries

Space-Efficient Online Computation of Quantile Summaries. Michael Greenwald & Sanjeev Khanna University of Pennsylvania Presented by nir levy. Introduction. The problem We introduced a very large data sets and we wish to compute Φ -quatiles

iden
Download Presentation

Space-Efficient Online Computation of Quantile Summaries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Space-Efficient Online Computation of Quantile Summaries Michael Greenwald & Sanjeev Khanna University of Pennsylvania Presented by nir levy

  2. Introduction • The problem We introduced a very large data sets and we wish to compute Φ-quatiles in a single pass using space-efficient computation . • Def: The Φ-quantiles of an ordered sequence of N data items is the value with rank ΦN. (the element in the ΦN position) • We are going to see an online algorithm for computing ε-approximate quatile summaries of a very large data sequence. • Def: An ε-approximate quantile summaries of a sequence of N elements is a data structure that can answer quantile queries about the sequence to within a precision of εN. • Def: A quantile summary consists of a small number of points from the input data sequence, and uses those quantile estimates to give approximate responses to any arbitrary quantile query.

  3. Introduction cont… • EXAMPLE • Input data: 14, 2, 12, 5, 6, 19, 1, 14, 4, 9, 12, 3, 8, 11, 15, 4. Ordered: 19, 15, 14, 14, 12, 12, 11, 9, 8, 6, 5, 4, 4, 3, 2, 1 Rank: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 what is the 2nd biggest number? (15) What is 25%th number?(14) Summary: 19, 14, 11, 6, 4, 1 rank: 1 4 7 10 13 16 what is the 2nd biggest number? 2nd 1st(19) What is 25%th number?16*0.25=4  4th(14)

  4. Quantile estimation for Database Applications • Estimate the size of intermediate results, to allow query optimizers to estimate the cost of competing plans to resolve database queries. • Partition data into roughly equal partitions for parallel database. • Prevent expensive and incorrect queries from being issued By estimate results sizes and give feedback to the users • Characterize the distribution of real world data sets for database users.

  5. Properties Properties for quantile estimators • provide tunable and explicit guarantees on the precision of the approximation. That is, for any given rank r, an ε-approximate quantile summary return a value whose rank r’ is guaranteed to be within the interval [r-εN , r+εN]. 2. be data independent. That is, neither affected by the arrival order or distribution of the values nor should it require a priori knowledge of the size of the dataset. 3. execute in a single pass over the data. 4. have as small of memory footprints as possible (apply to temporary storage during the computation)

  6. Previous Work • Mnku, Rajagopalan and Lindsay presented single-pass algorithm, Ɛ-approximate quantile summary, requires O(1/ε * log2(εN) space but need and advanced knowledge of N ( otherwise they provide a probabilistic guarantee on the precision) (MRL). • Gibson, Matis and Poosala presented multiple pass algorithm with probabilistic guarantee • Munro and Paterson showed that any algorithm that exactly compute Φ-quantile in in only P passes requires a space of (N1/p)

  7. This algorithm • present a worse-case space requirement of O(1/Ɛ*log(ƐN)), thus improving upon the previous best result of O(1/Ɛ*log2(ƐN)). • in contrast to earlier algorithms, the algorithm doesn’t require a priori knowledge of the length of the input sequence • based on a novel data structure that effectively maintains the range of possible ranks for each quantile that they store. • The behavior is based on the fact that no input sequence can be “bad” across the entire distribution that is, the input sequence cannot present new observations that must be stored without deleting old stored observations.

  8. The Data Structure • Assume w.l.og. That every new observation arrives after each unit of time. • Denote n to be the number of observation seen so far as well as the current time. • Denote ε to be the given precision requirement • Denote S=S(n) to be the summary data structure at all time. S(n) consists of an ordered sequence elements corresponding to a subset of the observations seen thus far • For each observation v in S, maintain an implicit bound on the minimum and the maximum possible rank of v among the first n observations. (Denote by Rmin(v) and Rmax(v))

  9. Data structure cont… • More formally let S(n) be the set of tuples t0,t1,…,ts-1 where ti=(Vi,gi,∆i) Vi – is one of the elements for the data stream gi – is equal Rmin(Vi) - Rmin(Vi-1) ∆I – is equal Rmax(Vi) - Rmin(Vi) • ∑j<=I gj = Rmin(Vi) - Rmin(Vi-1) + Rmin(Vi-1) - Rmin(Vi+2) +...+ Rmin(V1)- Rmin(V0)= Rmin(Vi) • (∑j<=I gi)+∆I = Rmax(Vi) - Rmin(Vi) + Rmin(Vi) = Rmax(Vi)

  10. Data structure cont… • At all time ensure that V0 and Vs-1 correspond to the minimum and maximum element seen so far. • gi+∆i-1 is the upper bound on the total number of observations that may have fallen between vi and vi-1 • ∑i gi is the number of observations seen so far

  11. Answering Quantile Queries • Proposition 1: Given a quantile summary S in the above form a Φ-quantile can always be identified to within an error of MAXi(gi+∆i)/2. Proof. let r= Φn and let e=MAXi(gi+∆i)/2. - search for an index i such that r-e <= Rmin(Vi) and Rmax(vi)<= r+e Maxi(gi+∆i) V0 Rmax(Vi) Vs-1 Φn Rmin(Vi) Vi  vi approximates the Φ-quantile within the claimed error bound.

  12. Answering Quantile Queries cont… All is left to see is that such an index I must always exist. Consider the case r>n-e n-e Vs-1 V0 r We have Rmin(Vs-1)=Rmax(Vs-1)=n and therefore i=s-1 is valid Otherwise r<=n-e Choose the smallest j such Rmax(Vj)>r+e it follows that Rmin(Vj-1)>=r-e Since for Rmin(Vj-1)<r-e we get Rmax(Vj)=Rmin(Vj-1)+gj+∆j > Rmin(Vj-1)+2e Rmax(Vj) V0 r-e r+e Rmin(Vj-1) r Vs-1  Contradiction to the assumption that e=MAXi(gi+∆i)/2

  13. Answering Quantile Queries cont… • By assumption Rmax(Vj-1)<=r+e therefore j-1 is an example of an index i with the desired property. • Corollary 1 if at any time n, the summery S(n) satisfied the property that MAXi(gi+∆i) <=2εn, then we can answer any Φ-quantile query to within an εn precision.

  14. Data structure cont… At high level • On a new observation – insert in the summary a tuple corresponding to this observation. • Periodically, perform a sweep over the summary to “merge” some of the tuples into their neighbors so as to free space • Maintain several condition in order to bound the space used by S at any time. • By corollary 1 in suffice to ensure that at all time MAXi(gi+∆i) <=2εn. • Def: An individual tuple is full if gi+∆i=2εn. • Def: The capacity of an individual tuple is the maximum number of observations that can be counted by gi before the tuple become full

  15. BANDS • General strategy: delete tuples with small capacities and preserve tuples with large capacities. • In the merge phase, free up space by merging tuples with small capacities into tuples with “similar” or larger capacities. • We say , two tuples ti and tjhave similar capacities, if log capacity(ti) log capacity(tj) • This notion of similarity partition the possible values of ∆ into Bands • we try to divide the ∆’s in bands that lie between elements of 0, ½(2εn), ¾(2εn),…..((2i-1)/2i)(2εn),…, 2εn-1, 2εn • this boundaries correspond to capacities of 2εn, εn, 1/2εn,…,(1/2i)εn,..8,4,2,1

  16. BANDS cont… • Define bandα to be the set of all ∆ such that : p - 2α- (p mod 2α) < ∆ <= p - 2α-1 – (p mod 2α-1) where p=2εn and α=1 .. log(2εn) • The above definition ensure that if two ∆s are ever in the same band, they never appear in different bands as n increases • Define band0 simply to be p • Consider the first 1/2ε observations, with ∆ = 0 to be in a band of their own.

  17. BANDS cont… • Example • Consider ε=1/8. a b c d e f g • ∆= 0,0,0,0,1,1,1,1,2,2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6 • N=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28

  18. BANDS cont…

  19. BANDS cont… • Proposition 2: at any point in time n and for any α>=1 bandα(n) contains either 2α or 2α-1 distinct value of ∆. PROOF according to the upper and lower bounds of bandα 2εn - 2α- (2εn mod 2α) < ∆ <= 2εn - 2α-1 – (2εn mod 2α-1) If ( 2εn mod 2α ) < 2α-1 then ( 2εn mod 2α ) = ( 2εn mod 2α-1)  |bandα| = 2α- 2α-1 = 2α-1 distinct values of ∆ If ( 2εn mod 2α ) >= 2α-1 then ( 2εn mod 2α ) = 2α-1 + ( 2εn mod 2α-1)  |bandα| = 2α-1 + 2α-1 = 2α distinct values of ∆

  20. A tree representation • For S = t0, t1, ….,ts-1 Impose a tree structure T over the tuples of S. • Assign a special root node R • for every tuple ti assign a node Vi • The parent of every node Vi is the node Vj such that j is the least index greater than i with band(tj) > band(ti). If no such j exist than set R to be the parent. • All children (and all descendants) of a given node Vi have ∆ values larger than ∆I .

  21. A tree representation • Proposition 4: for any node V, the set of all its descendants in T form a contiguous segment in S • Proposition 3: the children of any node in T are always arranged in non-increasing order of band in S

  22. Operations • To compute ε-approximate Φ-quantile from S(n) after n observations • During the operations we wish to maintain correct relationship between gi , ∆I , Rmin and Rmax • QUANTILE(Φ): compute the rank r=Φn find i such that: r-Rmin(Vi)<= εn and Rmax(Vi)-r<=εn return Vi . • INSERT(V): find the smallest i such that: Vi-1<= V <Vi and insert the tuple (V,1,2εn) between ti-1 and ti . If V is the new minimum or maximum seen, then insert (v,1,0)

  23. Operations Cont… • INSERT(V) maintains maintain correct relationship between gi , ∆I , Rmin and Rmax • If V is inserted before Vi the value of Rmin(V) may be as small as Rmin(Vi-1)+1 similarly Rmax(V) may be as large as the current Rmax(Vi) which is bounded by 2εn. • Note that Rmin(Vi) and Rmax(Vi) get increased by 1 after insertion.

  24. Operations Cont… • DELETE(Vi): replace the tuple (Vi,gi,∆i) and (Vi+1,gi+1,∆i+1) with the new tuple (Vi+1,gi+gi+1,∆i+1). • Deleting Vi has no effect on Rmin(Vi+1) Rmax(Vi+1) so it should simply preserve them. • The relationship between Rmin(Vi+1) and Rmax(Vi+1) is preserved as long as ∆i+1 is unchanged . • since Rmin(Vi+1) = ∑j<=I+1 gi and we deleted gi we must increase gi+1 by gi to keep Rmin(Vi+1).

  25. COMPRESS • The operation COMPRESS tries to merge together a node and all its descendents into either its parent node or into its right sibling (by deleting them). • During compress we must ensure that the tuple results after the merging is not full • Two adjacent tuples ti,ti+1are mergeable if the resulting tuple is not full and band(ti,n)<=band(ti+1,n). • Note that pair of tuples that are not mergeable at some point in time may be come so at later point as the term 2εn increases over time. • Let gi* denote the sum of g-values of tuple ti and all it’s descendents in T .

  26. Operations Cont… • COMPRESS() for i from s-2 to 0 do if(BAND(Δi,2Ɛn) ≤BAND(Δi+1,2Ɛn)) && (gi*+gi+1+ Δi+1< 2Ɛn) then delete all descendants of ti and the tuple ti itself end if end for • Compress inspect tuples from right (highest index) to left. it first combine children (and all their subtree of descendents) into their parents and only when the parent is full it combine children.

  27. Operations Cont… • Initial State S Φ; s=0; n=0. Algorithm To add the n+1st observation, v, to summary S(n): if(n≡ 0 mod 1/(2Ɛ) ) then COMPRESS(); end if INSERT(v); n=n+1;

  28. Analysis • The insert and compress operations always ensure that gi+∆i<=2εn • We will see now that the total number of tuples in the summary S(n) is bounded by (11/(2ε) * log (2εn)). • Def: coverage – we say that a tuple ti in S(n) covers an observation v at any time n if either the tuple for v had been directly merged into ti or a tuple t that covered v has been merged into ti . • A tuple always cover itself. • It is easy to see that the number of observations covered by ti is exactly given by gi=gi(n)

  29. Analysis Cont… • Lemma 1: At no point in time a tuple from band α covers an observation from a band > α. • Lemma 2: At any point in time n, and for any integer α, the total number of observations covered cumulatively by all tuples with band value in [0..α] is bounded by 2α/ε . • Lemma 3: At any time n and for any given α, there are at most 3/2ε nodes in T(n) that have a child with band value of α. That is, there are at most 3/2ε parents of nodes from bandα(n)

  30. Analysis Cont… • PROOF of lemma 4 • Let mmin,mmax denote the earliest and the latest time at which a node from bandαcould be seen. • mmin=(2εn-2α-(2εn mod 2α))/2ε • mmax=(2εn-2α-1-(2εn mod 2α-1))/2ε • Choose a child parent pair (Vi,Vj) Vj is in bandα • Since Vj exist we can show that: • Since at time mj (when Vj showed up) we had: • gi(mj)+∆i<2εmj

  31. Analysis Cont… • Since for all pairs (v’i,v’j) we have distinct observations • The number of observations that came after mmin is n-mmin • We get (n-mmin)/(2ε*(n-mmax))=3/(2ε) • Since mj is at most mmax

  32. Analysis Cont… • Def: Given a full pair of tuples (ti-1,ti), we say that a tuple ti-1 is left partner and ti is right partner in this full pair. • Lemma 4: At any time n and for any given α, there are at most 4/ε tuples from bandα(n) that are right partners in a full tuple pair. • PROOF • Let ti,ti+1, ,ti+p-1 be the longest contiguous segment of tuples from bandα(n) in S(n). • Since they existed after the compress operation in must be the case g*j-1+gj+∆j>2εn for all i<=j<i+p

  33. Analysis Cont… • Summing over all j • According to lemma 2 the first term is bounded by 2α+1/ε • The second term is bounded by p(2εn-2α-1) • Summing the two bounds we get p<4/ε

  34. Analysis Cont… • for non- contiguous segments just consider the above summations over all such segments • Lemma 5: At any time n and for any given α, the maximum number of tuples possible from each bandα(n) is 11/2ε . • Proof • Each node of bandα(n) is either: 1. a right partner in a full pair 2. a left partner in a full pair 3. not participate in any full pair • The first case is bounded by 4/ε ( lemma 4) • The last two are bounded by 3/2ε • And the claim follow.

  35. Analysis Cont… • Theorem 1: At any time n, the number of tuples stored in S(n) is at most (11/(2ε) * log (2εn)). • PROOF • There are at most 1+log(2εn) bands at time n • Summing over their sizes we get  (11/(2ε) * log (2εn)).

  36. Experiments results • The experiments were done on 3 different classes of input data 1. Hard Case. - an adversarial manner data sequence that is, place the next observation in the largest current “gap” of the quantile summary. 2. sorted input data. - the data arrives in sorted order. 3. random input data. - select each datum by selecting an element (without replacement) from a uniform distribution of all remaining elements in the data set

  37. Experiments results cont… • Sorted and random input data are used after the MRL experimental results • Random input data can give an insight to the behavior of the algorithm on “average” inputs. • In general, the algorithm used less space than indicated by the analysis. And turned out to be better than the MRL’s space requirement.

  38. Experiments results cont… • For each case we have 2 different kind of experiment: 1. Adaptive – the regular algorithm ( with a slight variation) 2. Pre-allocated – used the same space as used in the MRL • We will see that in the later case the observed error is significantly better then the one of the MRL. • differences in the algorithm used for the experiment : 1. An observation is inserted as a tuple (v,1,gi+Δi-1) and not (v,1,2Ɛn). the latter is strictly to simplify theoretical analysis. 2. Rather than running the COMPRESS after every 1/2ε observations for each observation inserted one tuple was deleted when possible.  if no tulpe could be deleted without making is successor full the size of S grew by 1.

  39. Experiments results cont… • We apply the following measurements: 1. The maximum space used to produce the summary –counting the number of stored tuples ( multiple by 3 for comparison with MRL to account the Rmin and Rmax values stored in each tuple ) 2. The observed precision of the results.

  40. Experiments results cont… • HARD INPUT • The required number of quantile is approximately a factor of 11 less than the worst case bound of the analysis • We almost always require less space than the MRL. • The only exception is in epsilon=.001 and N=105 where MRL require less space

  41. Experiments results cont… • SORTED INPUT • Fix ε=.001 and construct summaries of sorted sequences of size 105,106 and 107 • Sample 15 quantiles at (qi/16)*N for qi=[1..15] and compute the maximum error over all possible quantile queries. • Compare 3 algorithms: 1. MRL – preallocated the storage required by MRL as a function of N and ε. 2. pre-allocated – using 1/3 as many stored quantiles as MRL. 3. adaptive – storage allocated for new quantile only if no quantile could be deleted without exceeding a precision of .001n

  42. Experiments results cont… • |S| - the number of stored quantiles need to achieve the desired precision • Max ε-the maximum error of all possible quantile queries of the summaries • The remaining rows lists the approximation error of the response to the query for the qi/16th quantile.

  43. Experiments results cont… • RANDOM INPUT • Same measurements as in the sorted input (ε and sequence length) • Run each experiment 50 times and report the max, min, mean and std for every measurement.

  44. Experiments results cont…

  45. Experiments results cont…

  46. Conclusions • Improves upon the earlier results in two significant ways: • It improves the space complexity by a factor of Ω (log(εN)). 2. It doesn’t require a priori knowledge of the parameter N – that is, it allocates more space dynamically as the data sequence grows in size.

More Related