1 / 42

How to Summarize the Universe: Dynamic Maintenance of Quantiles

How to Summarize the Universe: Dynamic Maintenance of Quantiles. By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss. Quantiles. Median, quartiles, … The general case: Uses Statistics Estimating result set size Partitioning …. Computing static quantiles.

kyran
Download Presentation

How to Summarize the Universe: Dynamic Maintenance of Quantiles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Summarize the Universe:Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

  2. Quantiles • Median, quartiles, … • The general case: • Uses • Statistics • Estimating result set size • Partitioning • …

  3. Computing static quantiles • Blum, Floyd, Pratt, Rivest & Tarjan • Find the i’th element • Comparison based • Similar to QuickSort • O(n) – worst case time

  4. Problems with massive data sets • O(n) time – not good enough… • O(n) space – usually not affordable • Dynamic environment • Cancellations are especially troublesome • Usually recomputed periodically • May be very inaccurate until recomputed Some kind of approximation is the only choice !…

  5. Common approaches • Deterministically chosen sample • Randomization – probability of failure • Maintaining a backing sample • Wavelets • Most of the above approaches work well for the incremental case, but deletions may cause inaccuracy.

  6. GK – Greenwald-Khanna (‘01) • Fill the available memory with values • Maintain rank ranges on values is memory. • When a new value is inserted, kick a value out of memory. • Insert-only algorithm • Can be extended to support deletes (“GK2”). • Maintain two instances – one for insertions and one for deletions.

  7. Maintenance of Equi-Depth Histograms (using a backing sample) • Gibbons, Matias, Poosala –’97 • Scan the dataset and choose values for the sample using the “reservoir” method. • Treat insertions as a “continuous” scan. • When a deletion from the sample is necessary – rescan only if number of items drops below a specified minimum. • Works well for a mostly-insertions enviornment.

  8. The authors’ main result • The RSS algorithm • RSS– Random Subset Sum • Space – polylogarithmic in universe size • Proportional time • A priori guarantee of accuracy within a user specified error ε, with a user specified probability of failure δ.

  9. Some formalism… • The universe: U = {0, …, |U |-1} • Number of tuples in data set: ||A||=N • Data set can be thought of as an array:A[i] – number of tuples with value i • Our goal for computing Ф-quantiles – find a jk such that:

  10. Some assumptions • The universe’s size is known • Later we’ll throw that assumption away • Update = Delete + Insert

  11. Computing quantiles • Let’s say A[i] is known for every i. • Easy to maintain through updates • Summing up array items ? • Not a very good complexity…

  12. Computing quantiles (cont.) • We need a method of reducing summation overhead. • We should be able to compute any sum of items in A in logarithmic time. • The solution: Keeping computed sums of intervals.

  13. Dyadic intervals - defined • Atomic dyadic interval – a single point. • Ij,k = [k*2log(|U|)-j,(k+1)*2log(|U|)-j-1] • j – resolution level • Example: I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7) 0 1 2 3 4 5 6 7 I(2,0) I(2,1) I(2,2) I(2,3) I(1,0) I(1,1) I(0,0)

  14. Computing an arbitrary interval • Let’s say we have sums for all dyadic intervals as in the above example. • We want to compute A[0,6]. • A[0,6] = I(1,0) + I(2,2) + I(3,6) I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7) 0 1 2 3 4 5 6 7 I(2,0) I(2,1) I(2,2) I(2,3) I(1,0) I(1,1) I(0,0)

  15. Dyadic intervals - observations • Log(|U|) + 1 resolution levels • 2|U| - 1 dyadic intervals altogether • O(|U|) space needed to keep them all • O(log(|U|)) time needed to compute any arbitrary interval.

  16. Computing quantiles (Cont.) • We can now efficiently compute any arbitrary interval in A. • A ф-quantile for any k can be computed thus: • We need a jk s.t.:A[0,jk) < kФN < a[0,jk+1) • Use binary search to find it !

  17. But… • Keeping O(|U|) of data presents a real space complexity problem. • We need a way of estimating A[i] on demand. • … And also of estimating any dyadic interval on demand.

  18. Introducing random sets • Let S be a random set of values from U. • Each value has a probability of ½ of being in S. • Expectation of the number of items in S is ½|U|.

  19. Random subset sums • Define ||AS|| as the number of items in A with values in S. • Expectation of ||AS|| is ½||A||=½N. • Now consider only subsets S containing a certain value i.

  20. Random subset sums (cont.) • Suppose we keep a number of random sets S, each containing random values from U– each with probability ½. • We maintain ||AS|| for each such set. • Easy to maintain during updates. • How can we now estimate A[i] ?

  21. Random subset sums (cont.) • We can estimate A[i] for any i with:A[i] = 2||AS|| - ||A|| • Proof: • The authors prove that repeating the process O(1/ε2) times yields the required accuracy.

  22. Random subset sums (cont.) • We can also estimate any dyadic interval Ij,k using the same method. • Improvement: We can compute the sums for dyadic intervals from a certain level. • We can now estimate any arbitrary interval in the universe…

  23. Space Considerations • Keeping a set of expected size ½|U| is still O(|U|). • We need a method of “keeping” a set without actually keeping it… • The technique: instead of sets, keep random seeds of size o(log|U|) bits and compute whether a given iєS on demand.

  24. Extended Hamming Code • Used for generating the random sets. • Provides sufficient “randomness” • For example: • |U| = 8 • Seed size: log|U|+1 = 4 • G(seed, i) = seed X i’th column

  25. RSS Algorithm Summary • To compute a dyadic interval. • Compute 2||AS|| - ||A|| for sets containing the given dyadic interval. • To compute an arbitrary interval. • Write it as a disjoint union of dyadic intervals, estimate them and take a median over possible results (simplified). • To compute the quantiles. • Use binary search and compute the intervals until found.

  26. Algorithm Complexity Claim • The RSS algorithm’s space complexity (for t quantile queries): • Time complexity for inserts, deletes and computing each quantile on demand is proportional to the space used.

  27. Proof Outline • Declare random variable • Xk=2||AIk|| if Ik is in S and 0 otherwise • X – Sum of all Xk’s in a certain set • Y – Sum of all X’s in a given interval • Z – A number of repetitions of X.

  28. Proof Outline (Cont.) • In a similar fashion to previous slides, show that Y and ||A|| can be used to compute ||AI||. • Compute the variance. • Use Chebyshev’s and then Chernoff’s inequalities, together with the computed variance, to achieve the required result.

  29. What If U Is Unknown ? • In practice, the universe U is not always known. • Predict a range [0, u-1] for U. • Given an inserted (or updated) value i s.t. (i > u-1), add another instance of RSS with range [u, u2-1], and so on… • Estimating dyadic intervals can be done in a single instance of RSS. • Increased cost factor: log2log(|U|).

  30. Some RSS Properties • RSS may return as a quantile a value which is not really in the dataset. • Order of insertions and deletions does not affect result and accuracy. • Can be parallelized quite easily (as long as random subsets are pre-agreed).

  31. Experimental Results • Experiments • Static artificial dataset • Dynamic artificial dataset • Dynamic real dataset • Participants • Naïve[l] • RSS[l] • GK • GK2 – an improvement for GK

  32. Static Artificial Dataset • |U| = 220 • Compute 15 quantiles at position (1/16)k for k = 1,2,…,15. • 3 different distributions • Uniform • Zipf • Normal[m,v] • Algorithm used: RSS[7] (11K footprint).

  33. Errors for Zipf data

  34. Errors for Normal[U/2, U/50] Distribution

  35. Dynamic Artificial Dataset • Insert N=104,858 items from uniform dist. D1=Uni[1,U], U=220. • Insert αN more items from uniform dist. D2=Uni[U/2-U/32, U/2+U/32]. • Delete all values from the first insertion. • Parameter α controls the mass of the second insertion with respect to the first.

  36. Dynamic Artificial Dataset Results

  37. Dynamic Real Dataset • Based on true Call Detail Records (CDRs) from AT&T. • Dataset used includes 4.42 million CDRs covering a period of 18 hours. • Objective: find the median length of current calls. • Probe for estimates every 10,000 records. • Algorithm used: RSS[6] (4K footprint).

  38. Number of Active Phone Calls Over Time

  39. Error in Computation of Median Over Time

  40. Average Error for Last 50 Snapshots, For Deciles

  41. Conclusions – RSS • Algorithm for maintaining dynamic quantiles. • Works well (within a user-defined precision) both for insertions AND deletions. • Polylogarithmic (in universe size) in space and time complexities.

  42. Thanks for listening !

More Related