Finding Frequent Items in Data Streams

Finding Frequent Items in Data Streams Moses Charikar Princeton Un., Google Inc. Kevin Chen UC Berkeley, Google Inc. Martin Franch-Colton Rutgers Un., Google Inc. Presented by Amir Rothschild

Presenting: • 1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space. • The algorithm achieves especially good space bounds for Zipfian distribution • 2-pass algorithm for estimating the items with the largest change in frequency between two data streams.

Definitions: • Data stream: • where • Object oi appears ni times in S. • Order oi so that • fi = ni/n

nk n2 n1 The first problem: • FindApproxTop(S,k,ε) • Input: stream S, int k, real ε. • Output: k elements from S such that: • for every element Oi in the output: • Contains every item with:

Clarifications: • This is not the problem discussed last week! • Sampling algorithm does not give any bounds for this version of the problem.

Hash functions • We say that h is a pair wise independent hash function, if h is chosen randomly from a group H, so that:

C S Let’s start with some intuition… • Idea: • Let s be a hash function from objects to {+1,-1}, and let c be a counter. • For each qi in the stream, update c += s(qi) • Estimate ni=c*s(oi) • (since )

Realization

Claim: • For each element Oj other then Oi, s(Oj)*s(Oi)=-1 w.p.1/2 s(Oj)*s(Oi)=+1 w.p. 1/2. • So Oj adds the counter +nj w.p. 1/2 and -nj w.p. 1/2, and so has no influence on the expectation. • Oi on the other hand, adds +ni to the counter w.p. 1 (since s(Oi)*s(Oi)=+1) • So the expectation (average) is +ni. • Proof:

That’s not enough: • The variance is very high. • O(m) objects have estimates that are wrong by more then the variance.

First attempt to fix the algorithm… • t independent hash functions Sj • t different counters Cj • For each element qi in the stream: For each j in {1,2,…,t} do Cj += Sj(qi) • Take the mean or the median of the estimates Cj*Sj(oi) to estimate ni. C1 C2 C3 C4 C5 C6 S1 S2 S3 S4 S5 S6

Still not enough • Collisions with high frequency elements like O1 can spoil most estimates of lower frequency elements, as Ok.

The solution !!! • Divide & Conquer: • Don’t let each element update every counter. • More precisely: replace each counter with a hash table of b counters and have the items one counter per hash table. Ti hi Ci Si

Let’s start working… Presenting the CountSketch algorithm…

CountSketch data structure t hash tables h1 T1 h2 T2 ht Tt ht b buckets h1 h2 S1 S2 St

The CountSketch data structure • Define CountSkatch d.s. as follows: • Let t and b be parameters with values determined later. • h1,…,ht – hash functions O -> {1,2,…,b}. • T1,…,Tt – arrays of b counters. • S1,…,St – hash functions from objects O to {+1,-1}. • From now on, define : hi[oj] := Ti[hi(oj)]

The d.s. supports 2 operations: • Add(q): • Estimate(q): • Why median and not mean? • In order to show the median is close to reality it’s enough to show that ½ of the estimates are good. • The mean on the other hand is very sensitive to outliers.

Finally, the algorithm: • Keep a CountSketch d.s. C, and a heap of the top k elements. • Given a data stream q1,…,qn: • For each j=1,…,n: • C.Add(qj); • If qj is in the heap, increment it’s count. • Else, If C.Estimate(qj) > smallest estimated count in the heap, add qj to the heap. • (If the heap is full evict the object with the smallest estimated count from it)

And now for the hard part: Algorithms analysis

Definitions

Claims & Proofs

The CountSketch algorithm space complexity:

Zipfian distribution Analysis of the CountSketch algorithm for Zipfian distribution

Zipfian distribution • Zipfian(z): for some constant c. • This distribution is very common in human languages (useful in search engines).

Prq(oi=q)

Observations • k most frequent elements can only be preceded by elements j with nj > (1-ε)nk • => Choosing l instead of k so that nl+1 <(1-ε)nk will ensure that our list will include the k most frequent elements. nl+1 nk n2 n1

Analysis for Zipfian distribution • For this distribution the space complexity of the algorithm is where:

Proof of the space bounds: Part 1, l=O(k)

Proof of the space bounds: Part 2

Comparison of space requirements for random sampling vs. our algorithm

Yet another algorithm which uses CountSketch d.s. Finding items with largest frequency change

The problem • Let be the number of occurrences of o in S. • Given 2 streams S1,S2 find the items o such that is maximal. • 2-pass algorithm.

The algorithm – first pass • First pass – only update the counters:

The algorithm – second pass • Pass over S1 and S2 and:

Explanation • Though A can change, items once removed are never added back. • Thus accurate exact counts can be maintained for all objects currently in A. • Space bounds for this algorithm are similar to those of the former with replaced by

Finding Frequent Items in Data Streams