CS 361 Lecture 5

CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku (manku@cs.stanford.edu)

Frequency Related Problems ... Top-k most frequent elements Find elements that occupy 0.1% of the tail. Mean + Variance? Median? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the frequency of element 3? What is the total frequency of elements between 8 and 14? How many elements have non-zero frequency?

Types of Histograms ... Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values • V-Optimal Histograms • Idea: Select buckets to minimize frequency variance within buckets Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values • Equi-Depth Histograms • Idea: Select buckets such that counts per bucket are equal

Histograms: Applications • One Dimensional Data • Database Query Optimization [Selinger78] • Selectivity estimation • Parallel Sorting [DNS91] [NowSort97] • Jim Gray’s sorting benchmark • [PIH96] [Poo97] introduced a taxonomy, algorithms, etc. • Multidimensional Data • OLTP: not much use (independent attribute assumption) • OLAP & Mining: yeah

Finding The Median ... • Exact median in main memory O(n) [BFPRT 73] • Exact median in one pass n/2 [Pohl 68] • Exact median in p passes O(n^(1/p)) [MP 80] 2 passes O(sqrt(n)) How about an approximate median?

Approximate Medians & Quantiles -approximate median Typical  = 0.01 (1%) Multiple equi-spaced -approximate quantiles = Equi-depth Histogram -Quantileelement with rank  N 0 <  < 1 ( = 0.5 means Median) -Approximate-quantileany element with rank (  ) N 0 <  < 1

Plan for Today ... Greenwald-Khanna Algorithm for arbitrary length stream Sampling-based Algorithms for arbitrary length stream Munro-Paterson Algorithm for fixed N Generalization Randomized Algorithm for fixed N Randomized Algorithm for arbitrary length stream

Data distribution assumptions ... Input sequence of ranks is arbitrary. e.g., warehouse data

Munro-Paterson Algorithm [MP 80] b buffers, each of size k Memory = bk Munro-Paterson [1980] 4 3 3 2 2 2 2 1 1 1 1 1 1 1 1 b = 4 b  log ( N) k  1/ log ( N) Memory = bk = Input: N and  How do we collapse two sorted buffers into one? Merge  Pick alternate elements Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > N Max relative error in rank = b/2k < 

Error Propagation ... x “?” elements 2x+1 “?” elements S S SS S SS S SS ? ? ? ?? ? ? ? ? L L LL L LL L L L L Depth d+1 Depth d+1 S S S S ? ? ? ? ? ? L L L L L S S SS SS ? ? ? L L LL LL Number of “?” elements <= 2x+1 Top-down analysis Depth d S S S S S ? ? ? ? L L L L L L

Error Propagation at Depth 0 ... Depth 0 S S S S S S S M L L L L L L L S S SS S S S S S SS SS S SM L L L L L L L L L L L L L L Depth 1 Depth 1 S S S S M L L L L L L L L L L S S SS S SS SS S S L L L L

Error Propagation at Depth 1 ... Depth 1 S S S S S S S S S S L L L L L S S SS S SS S SS S S S SS S S SS S ? L L LL L LL L L Depth 2 Depth 2 S S S S S S S S ? L L L L L L S S SS S SS SS S S S L LL

Error propagation at Depth 2 ... Depth 2 S S S S S S S S ? L L L L L L S S SS S SS S SS S S S SS S ? ? ? L L LL L LL L L L L Depth 3 Depth 3 S S S S S S S S ? ? L L L L L S S SS SS S S ? L L LL LL

Error Propagation ... x “?” elements 2x+1 “?” elements S S SS S SS S SS ? ? ? ?? ? ? ? ? L L LL L LL L L L L Depth d+1 Depth d+1 S S S S ? ? ? ? ? ? L L L L L S S SS SS ? ? ? L L LL LL Number of ? elements <= 2x+1 Depth d S S S S S ? ? ? ? L L L L L L

Error Propagation level by level b buffers, each of size k Memory = bk Munro-Paterson [1980] 0 1 1 Depth d = 2 2 2 2 2 3 3 3 3 3 3 3 3 b = 4 Fractional error in rank at depth 0 is 0. Max depth = b So, total fractional error is <= b/2k Constraint 2: b/2k <  Number of elements at depth d = k 2^d Let sum of “?” elements at depth d be X Then fraction of “?” elements at depth d f = X / (k 2^d) Sum of “?” elements at depth d+1 is at most 2X+2^d Then fraction of “?” elements at depth d+1 f’ <= (2X + 2^d) / (k 2^(d+1)) = f + 1/2k Increase in fractional error in rank is 1/2k per level

Generalized Munro-Paterson [MRL 98] How do we collapse Buffers with different weights? b = 5 Each buffer has a ‘weight’ associated with it.

Generalized Collapse ... • Weight 6 • 6 10 15 27 35 • 5 56 68 8 8 10 10 1012 1213 13 13 15 16 19 19 19 25 27 28 31 3135 35 3537 37 31 31 37 37 6 6 12 12 5 5 10 10 10 35 35 35 8 8 8 19 19 19 13 13 13 28 15 16 25 27 • 28 15 16 25 27 • 31 37 6 12 5 10 35 8 19 13 • Weight 2 • Weight 3 • Weight 1 • k = 5

Analysis of Generalized Munro-Paterson Munro-Paterson Generalized Munro-Paterson - But smaller constant

Reservoir Sampling [Vitter 85] Maintain a uniform sample of size s Input Sequence of length N Sample of size s Approximate median = median of sample If s = , then with probability at least 1-, answer is an -approximate median

“Non-Reservoir” Sampling Choose 1 out of every N/s successive elements A B D B A B D F A S C D D B A B D F A T X Y D B A X T F A S X Z D B A B D T G H N/s elements At end of stream, sample size is s Approximate median = median of sample If s = , then with probability at least 1-, answer is an -approximate median

Non-uniform Sampling ... A B D B A B D F A S C D D B A B D F A T X Y D B A X T F A S X Z D B A B D T G H ... s out of 4s elements Weight = 4 s out of s elements Weight = 1 s out of 2s elements Weight = 2 s out of 8s elements Weight = 8 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s = , then with probability at least 1-, answer is an -approximate median

Sampling + Generalized Munro-Paterson [MRL 98] Stream of unknown length,  and  Stream of known length N,  and  Advance knowledge of N Reservoir Sampling Maintain samples. “1-in-N/s” Sampling Choose s = samples. Generalized Munro-Paterson Compute -approximate median of samples Memory required = Compute exact median of samples. Output is an -approximate median with probability at least 1-. Memory required: Memory required:

Unknown-N Algorithm [MRL 99] Stream of unknown length,  and  Non-uniform Sampling Modified Deterministic Algorithm For Approximate Medians Output is an -approximate median with probability at least 1-. Memory required:

Non-uniform Sampling ... A B D B A B D F A S C D D B A B D F A T X Y D B A X T F A S A B D E X Z D B A B D T. … s out of 4s elements Weight = 4 s out of s elements Weight = 1 s out of 2s elements Weight = 2 s out of s elements Weight = 1 s out of 8s elements Weight = 8 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s = , then with probability at least 1-, answer is an -approximate median

Modified Deterministic Algorithm ... b buffers, each of size k Compute approximate median of weighted samples. L h+3 h+2 h+1 h Height Sample Input s elements with W = 8 s elements with W = 4 s elements with W = 2^(L-h) 2s elements with W = 1 s elements with W = 2 L = highest level h = height of tree

Modified Munro-Paterson Algorithm Compute approximate median of weighted samples. b buffers, each of size k H b+3 b+2 b+1 b Height Weighted Samples s elements with W = 8 s elements with W = 4 s elements with W = 2^(H-b) 2s elements with W = 1 s elements with W = 2 H = highest level b = height of tree

Error Analysis ... Increase in fractional error in rank is 1/2k per level Total fractional error <= b buffers, each of size k b+h = total height b = height of small tree b+h b+3 b+2 b+1 b Weighted Samples s elements with W = 8 s elements with W = 4 s elements with W = 2^(H-b) 2s elements with W = 1 s elements with W = 2

Error Analysis contd... Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > s where s = Max fractional error in rank = b/k < (1-)  b  O(log ( s)) k  O(1/ log ( s)) Memory = bk = Almost the same as before

Summary of Algorithms ... • Reservoir Sampling [Vitter 85] • Probabilistic • Munro-Paterson [MP 80] • Deterministic • Generalized Munro-Paterson [MRL 98] • Deterministic • Sampling + Generalized MP [MRL98] • Probabilistic • Non-uniform Sampling + GMP [MRL 99] • Probabilistic • Greenwald & Khanna [GK 01] • Deterministic Require advance knowledge of n.

List of papers ... [Hoeffding63] W Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables”, Amer. Stat. Journal, p 13-30, 1963 [MP80] J I Munro and M S Paterson, “Selection and Sorting in Limited Storage”, Theoretical Computer Science, 12:315-323, 1980. [Vit85] J S Vitter, “Random Sampling with a Reservoir”,ACM Trans. on Math. Software, 11(1):37-57, 1985. [MRL98] G S Manku, S Rajagopalan and B G Lindsay, “Approximate Medians and other Quantiles in One Pass and with Limited Memory”, ACM SIGMOD 98, p 426-435, 1998. [MRL99] G S Manku, S Rajagopalan and B G Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets”, ACM SIGMOD 99, pp 251-262, 1999. [GK01] M Greenwald and S Khanna, “Space-Efficient Online Computation of Quantile Summaries”, ACM SIGMOD 2001, p 58-66, 2001.

CS 361 Lecture 5

CS 361 Lecture 5

Presentation Transcript

CS 7960-4 Lecture 5

CS 361 – Chapter 5

CS 11 C track: lecture 5

CS 140 Lecture 5

CS 361 – Chapter 8

CS 361 – Chapter 2

CS-361: Control Structures Week 4

CS-361: Control Structures Week 2

CS 361 – Chapter 13

CS 361 – Chapter 1

CS 140 Lecture 5

CS 361 – Chapter 16

CS 7810 Lecture 5

CS 361: Artificial Intelligence

CS 268: Lecture 5 (TCP Congestion Control)

CS 361 – Chapter 2