1 / 30

CS 361 Lecture 5

CS 361 Lecture 5. Approximate Quantiles and Histograms. 9 Oct 2002. Gurmeet Singh Manku (manku@cs.stanford.edu). Frequency Related Problems. Top-k most frequent elements. Find elements that occupy 0.1% of the tail. Mean + Variance?. Median?.

aram
Download Presentation

CS 361 Lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku (manku@cs.stanford.edu)

  2. Frequency Related Problems ... Top-k most frequent elements Find elements that occupy 0.1% of the tail. Mean + Variance? Median? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the frequency of element 3? What is the total frequency of elements between 8 and 14? How many elements have non-zero frequency?

  3. Types of Histograms ... Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values • V-Optimal Histograms • Idea: Select buckets to minimize frequency variance within buckets Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values • Equi-Depth Histograms • Idea: Select buckets such that counts per bucket are equal

  4. Histograms: Applications • One Dimensional Data • Database Query Optimization [Selinger78] • Selectivity estimation • Parallel Sorting [DNS91] [NowSort97] • Jim Gray’s sorting benchmark • [PIH96] [Poo97] introduced a taxonomy, algorithms, etc. • Multidimensional Data • OLTP: not much use (independent attribute assumption) • OLAP & Mining: yeah

  5. Finding The Median ... • Exact median in main memory O(n) [BFPRT 73] • Exact median in one pass n/2 [Pohl 68] • Exact median in p passes O(n^(1/p)) [MP 80] 2 passes O(sqrt(n)) How about an approximate median?

  6. Approximate Medians & Quantiles -approximate median Typical  = 0.01 (1%) Multiple equi-spaced -approximate quantiles = Equi-depth Histogram -Quantileelement with rank  N 0 <  < 1 ( = 0.5 means Median) -Approximate-quantileany element with rank (  ) N 0 <  < 1

  7. Plan for Today ... Greenwald-Khanna Algorithm for arbitrary length stream Sampling-based Algorithms for arbitrary length stream Munro-Paterson Algorithm for fixed N Generalization Randomized Algorithm for fixed N Randomized Algorithm for arbitrary length stream

  8. Data distribution assumptions ... Input sequence of ranks is arbitrary. e.g., warehouse data

  9. Munro-Paterson Algorithm [MP 80] b buffers, each of size k Memory = bk Munro-Paterson [1980] 4 3 3 2 2 2 2 1 1 1 1 1 1 1 1 b = 4 b  log ( N) k  1/ log ( N) Memory = bk = Input: N and  How do we collapse two sorted buffers into one? Merge  Pick alternate elements Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > N Max relative error in rank = b/2k < 

  10. Error Propagation ... x “?” elements 2x+1 “?” elements S S SS S SS S SS ? ? ? ?? ? ? ? ? L L LL L LL L L L L Depth d+1 Depth d+1 S S S S ? ? ? ? ? ? L L L L L S S SS SS ? ? ? L L LL LL Number of “?” elements <= 2x+1 Top-down analysis Depth d S S S S S ? ? ? ? L L L L L L

  11. Error Propagation at Depth 0 ... Depth 0 S S S S S S S M L L L L L L L S S SS S S S S S SS SS S SM L L L L L L L L L L L L L L Depth 1 Depth 1 S S S S M L L L L L L L L L L S S SS S SS SS S S L L L L

  12. Error Propagation at Depth 1 ... Depth 1 S S S S S S S S S S L L L L L S S SS S SS S SS S S S SS S S SS S ? L L LL L LL L L Depth 2 Depth 2 S S S S S S S S ? L L L L L L S S SS S SS SS S S S L LL

  13. Error propagation at Depth 2 ... Depth 2 S S S S S S S S ? L L L L L L S S SS S SS S SS S S S SS S ? ? ? L L LL L LL L L L L Depth 3 Depth 3 S S S S S S S S ? ? L L L L L S S SS SS S S ? L L LL LL

  14. Error Propagation ... x “?” elements 2x+1 “?” elements S S SS S SS S SS ? ? ? ?? ? ? ? ? L L LL L LL L L L L Depth d+1 Depth d+1 S S S S ? ? ? ? ? ? L L L L L S S SS SS ? ? ? L L LL LL Number of ? elements <= 2x+1 Depth d S S S S S ? ? ? ? L L L L L L

  15. Error Propagation level by level b buffers, each of size k Memory = bk Munro-Paterson [1980] 0 1 1 Depth d = 2 2 2 2 2 3 3 3 3 3 3 3 3 b = 4 Fractional error in rank at depth 0 is 0. Max depth = b So, total fractional error is <= b/2k Constraint 2: b/2k <  Number of elements at depth d = k 2^d Let sum of “?” elements at depth d be X Then fraction of “?” elements at depth d f = X / (k 2^d) Sum of “?” elements at depth d+1 is at most 2X+2^d Then fraction of “?” elements at depth d+1 f’ <= (2X + 2^d) / (k 2^(d+1)) = f + 1/2k Increase in fractional error in rank is 1/2k per level

  16. Generalized Munro-Paterson [MRL 98] How do we collapse Buffers with different weights? b = 5 Each buffer has a ‘weight’ associated with it.

  17. Generalized Collapse ... • Weight 6 • 6 10 15 27 35 • 5 56 68 8 8 10 10 1012 1213 13 13 15 16 19 19 19 25 27 28 31 3135 35 3537 37 31 31 37 37 6 6 12 12 5 5 10 10 10 35 35 35 8 8 8 19 19 19 13 13 13 28 15 16 25 27 • 28 15 16 25 27 • 31 37 6 12 5 10 35 8 19 13 • Weight 2 • Weight 3 • Weight 1 • k = 5

  18. Analysis of Generalized Munro-Paterson Munro-Paterson Generalized Munro-Paterson - But smaller constant

  19. Reservoir Sampling [Vitter 85] Maintain a uniform sample of size s Input Sequence of length N Sample of size s Approximate median = median of sample If s = , then with probability at least 1-, answer is an -approximate median

  20. “Non-Reservoir” Sampling Choose 1 out of every N/s successive elements A B D B A B D F A S C D D B A B D F A T X Y D B A X T F A S X Z D B A B D T G H N/s elements At end of stream, sample size is s Approximate median = median of sample If s = , then with probability at least 1-, answer is an -approximate median

  21. Non-uniform Sampling ... A B D B A B D F A S C D D B A B D F A T X Y D B A X T F A S X Z D B A B D T G H ... s out of 4s elements Weight = 4 s out of s elements Weight = 1 s out of 2s elements Weight = 2 s out of 8s elements Weight = 8 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s = , then with probability at least 1-, answer is an -approximate median

  22. Sampling + Generalized Munro-Paterson [MRL 98] Stream of unknown length,  and  Stream of known length N,  and  Advance knowledge of N Reservoir Sampling Maintain samples. “1-in-N/s” Sampling Choose s = samples. Generalized Munro-Paterson Compute -approximate median of samples Memory required = Compute exact median of samples. Output is an -approximate median with probability at least 1-. Memory required: Memory required:

  23. Unknown-N Algorithm [MRL 99] Stream of unknown length,  and  Non-uniform Sampling Modified Deterministic Algorithm For Approximate Medians Output is an -approximate median with probability at least 1-. Memory required:

  24. Non-uniform Sampling ... A B D B A B D F A S C D D B A B D F A T X Y D B A X T F A S A B D E X Z D B A B D T. … s out of 4s elements Weight = 4 s out of s elements Weight = 1 s out of 2s elements Weight = 2 s out of s elements Weight = 1 s out of 8s elements Weight = 8 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s = , then with probability at least 1-, answer is an -approximate median

  25. Modified Deterministic Algorithm ... b buffers, each of size k Compute approximate median of weighted samples. L h+3 h+2 h+1 h Height Sample Input s elements with W = 8 s elements with W = 4 s elements with W = 2^(L-h) 2s elements with W = 1 s elements with W = 2 L = highest level h = height of tree

  26. Modified Munro-Paterson Algorithm Compute approximate median of weighted samples. b buffers, each of size k H b+3 b+2 b+1 b Height Weighted Samples s elements with W = 8 s elements with W = 4 s elements with W = 2^(H-b) 2s elements with W = 1 s elements with W = 2 H = highest level b = height of tree

  27. Error Analysis ... Increase in fractional error in rank is 1/2k per level Total fractional error <= b buffers, each of size k b+h = total height b = height of small tree b+h b+3 b+2 b+1 b Weighted Samples s elements with W = 8 s elements with W = 4 s elements with W = 2^(H-b) 2s elements with W = 1 s elements with W = 2

  28. Error Analysis contd... Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > s where s = Max fractional error in rank = b/k < (1-)  b  O(log ( s)) k  O(1/ log ( s)) Memory = bk = Almost the same as before

  29. Summary of Algorithms ... • Reservoir Sampling [Vitter 85] • Probabilistic • Munro-Paterson [MP 80] • Deterministic • Generalized Munro-Paterson [MRL 98] • Deterministic • Sampling + Generalized MP [MRL98] • Probabilistic • Non-uniform Sampling + GMP [MRL 99] • Probabilistic • Greenwald & Khanna [GK 01] • Deterministic Require advance knowledge of n.

  30. List of papers ... [Hoeffding63] W Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables”, Amer. Stat. Journal, p 13-30, 1963 [MP80] J I Munro and M S Paterson, “Selection and Sorting in Limited Storage”, Theoretical Computer Science, 12:315-323, 1980. [Vit85] J S Vitter, “Random Sampling with a Reservoir”,ACM Trans. on Math. Software, 11(1):37-57, 1985. [MRL98] G S Manku, S Rajagopalan and B G Lindsay, “Approximate Medians and other Quantiles in One Pass and with Limited Memory”, ACM SIGMOD 98, p 426-435, 1998. [MRL99] G S Manku, S Rajagopalan and B G Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets”, ACM SIGMOD 99, pp 251-262, 1999. [GK01] M Greenwald and S Khanna, “Space-Efficient Online Computation of Quantile Summaries”, ACM SIGMOD 2001, p 58-66, 2001.

More Related