Finding frequent items in data streams
Download
1 / 37

finding frequent items in data streams - PowerPoint PPT Presentation


  • 459 Views
  • Uploaded on

Finding Frequent Items in Data Streams. Moses Charikar Princeton Un., Google Inc. Kevin Chen UC Berkeley, Google Inc. Martin Franch-Colton Rutgers Un., Google Inc. Presented by Amir Rothschild. Presenting:.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'finding frequent items in data streams' - KeelyKia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Finding frequent items in data streams l.jpg

Finding Frequent Items in Data Streams

Moses Charikar Princeton Un., Google Inc.

Kevin Chen UC Berkeley, Google Inc.

Martin Franch-Colton Rutgers Un., Google Inc.

Presented by Amir Rothschild


Presenting l.jpg
Presenting:

  • 1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space.

  • The algorithm achieves especially good space bounds for Zipfian distribution

  • 2-pass algorithm for estimating the items with the largest change in frequency between two data streams.


Definitions l.jpg
Definitions:

  • Data stream:

  • where

  • Object oi appears ni times in S.

  • Order oi so that

  • fi = ni/n


The first problem l.jpg

nk

n2

n1

The first problem:

  • FindApproxTop(S,k,ε)

    • Input: stream S, int k, real ε.

    • Output: k elements from S such that:

      • for every element Oi in the output:

  • Contains every item with:


Clarifications l.jpg
Clarifications:

  • This is not the problem discussed last week!

  • Sampling algorithm does not give any bounds for this version of the problem.


Hash functions l.jpg
Hash functions

  • We say that h is a pair wise independent hash function, if h is chosen randomly from a group H, so that:


Let s start with some intuition l.jpg

C

S

Let’s start with some intuition…

  • Idea:

  • Let s be a hash function from objects to {+1,-1}, and let c be a counter.

  • For each qi in the stream, update c += s(qi)

  • Estimate ni=c*s(oi)

  • (since )



Claim l.jpg
Claim:

  • For each element Oj other then Oi, s(Oj)*s(Oi)=-1 w.p.1/2

    s(Oj)*s(Oi)=+1 w.p. 1/2.

  • So Oj adds the counter +nj w.p. 1/2 and -nj w.p. 1/2, and so has no influence on the expectation.

  • Oi on the other hand, adds +ni to the counter w.p. 1 (since s(Oi)*s(Oi)=+1)

  • So the expectation (average) is +ni.

  • Proof:


That s not enough l.jpg
That’s not enough:

  • The variance is very high.

  • O(m) objects have estimates that are wrong by more then the variance.


First attempt to fix the algorithm l.jpg
First attempt to fix the algorithm…

  • t independent hash functions Sj

  • t different counters Cj

  • For each element qi in the stream:

    For each j in {1,2,…,t} do

    Cj += Sj(qi)

  • Take the mean or the median of the estimates Cj*Sj(oi) to estimate ni.

C1

C2

C3

C4

C5

C6

S1

S2

S3

S4

S5

S6


Still not enough l.jpg
Still not enough

  • Collisions with high frequency elements like O1 can spoil most estimates of lower frequency elements, as Ok.


The solution l.jpg
The solution !!!

  • Divide & Conquer:

  • Don’t let each element update every counter.

  • More precisely: replace each counter with a hash table of b counters and have the items one counter per hash table.

Ti

hi

Ci

Si


Let s start working l.jpg

Let’s start working…

Presenting the CountSketch algorithm…


Countsketch data structure l.jpg
CountSketch data structure

t hash tables

h1

T1

h2

T2

ht

Tt

ht

b buckets

h1

h2

S1

S2

St


The countsketch data structure l.jpg
The CountSketch data structure

  • Define CountSkatch d.s. as follows:

  • Let t and b be parameters with values determined later.

  • h1,…,ht – hash functions O -> {1,2,…,b}.

  • T1,…,Tt – arrays of b counters.

  • S1,…,St – hash functions from objects O to {+1,-1}.

  • From now on, define : hi[oj] := Ti[hi(oj)]


The d s supports 2 operations l.jpg
The d.s. supports 2 operations:

  • Add(q):

  • Estimate(q):

  • Why median and not mean?

  • In order to show the median is close to reality it’s enough to show that ½ of the estimates are good.

  • The mean on the other hand is very sensitive to outliers.


Finally the algorithm l.jpg
Finally, the algorithm:

  • Keep a CountSketch d.s. C, and a heap of the top k elements.

  • Given a data stream q1,…,qn:

  • For each j=1,…,n:

    • C.Add(qj);

    • If qj is in the heap, increment it’s count.

    • Else, If C.Estimate(qj) > smallest estimated count in the heap, add qj to the heap.

    • (If the heap is full evict the object with the smallest estimated count from it)


And now for the hard part l.jpg

And now for the hard part:

Algorithms analysis





Zipfian distribution l.jpg

Zipfian distribution

Analysis of the CountSketch algorithm for Zipfian distribution


Zipfian distribution25 l.jpg
Zipfian distribution

  • Zipfian(z): for some constant c.

  • This distribution is very common in human languages (useful in search engines).


Slide26 l.jpg

Prq(oi=q)


Observations l.jpg
Observations

  • k most frequent elements can only be preceded by elements j with nj > (1-ε)nk

  • => Choosing l instead of k so that nl+1 <(1-ε)nk will ensure that our list will include the k most frequent elements.

nl+1

nk

n2

n1


Analysis for zipfian distribution l.jpg
Analysis for Zipfian distribution

  • For this distribution the space complexity of the algorithm is where:





Yet another algorithm which uses countsketch d s l.jpg

Yet another algorithm which uses CountSketch d.s. algorithm

Finding items with largest frequency change


The problem l.jpg
The problem algorithm

  • Let be the number of occurrences of o in S.

  • Given 2 streams S1,S2 find the items o such that is maximal.

  • 2-pass algorithm.


The algorithm first pass l.jpg
The algorithm – first pass algorithm

  • First pass – only update the counters:


The algorithm second pass l.jpg
The algorithm – second pass algorithm

  • Pass over S1 and S2 and:


Explanation l.jpg
Explanation algorithm

  • Though A can change, items once removed are never added back.

  • Thus accurate exact counts can be maintained for all objects currently in A.

  • Space bounds for this algorithm are similar to those of the former with replaced by


ad