- 462 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'finding frequent items in data streams' - KeelyKia

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Finding Frequent Items in Data Streams

### Let’s start working…

### And now for the hard part:

### Zipfian distribution

### Yet another algorithm which uses CountSketch d.s.

Moses Charikar Princeton Un., Google Inc.

Kevin Chen UC Berkeley, Google Inc.

Martin Franch-Colton Rutgers Un., Google Inc.

Presented by Amir Rothschild

Presenting:

- 1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space.
- The algorithm achieves especially good space bounds for Zipfian distribution
- 2-pass algorithm for estimating the items with the largest change in frequency between two data streams.

Definitions:

- Data stream:
- where
- Object oi appears ni times in S.
- Order oi so that
- fi = ni/n

n2

n1

The first problem:- FindApproxTop(S,k,ε)
- Input: stream S, int k, real ε.
- Output: k elements from S such that:
- for every element Oi in the output:

- Contains every item with:

Clarifications:

- This is not the problem discussed last week!
- Sampling algorithm does not give any bounds for this version of the problem.

Hash functions

- We say that h is a pair wise independent hash function, if h is chosen randomly from a group H, so that:

S

Let’s start with some intuition…- Idea:
- Let s be a hash function from objects to {+1,-1}, and let c be a counter.
- For each qi in the stream, update c += s(qi)

- Estimate ni=c*s(oi)
- (since )

Claim:

- For each element Oj other then Oi, s(Oj)*s(Oi)=-1 w.p.1/2

s(Oj)*s(Oi)=+1 w.p. 1/2.

- So Oj adds the counter +nj w.p. 1/2 and -nj w.p. 1/2, and so has no influence on the expectation.
- Oi on the other hand, adds +ni to the counter w.p. 1 (since s(Oi)*s(Oi)=+1)
- So the expectation (average) is +ni.

- Proof:

That’s not enough:

- The variance is very high.
- O(m) objects have estimates that are wrong by more then the variance.

First attempt to fix the algorithm…

- t independent hash functions Sj
- t different counters Cj
- For each element qi in the stream:

For each j in {1,2,…,t} do

Cj += Sj(qi)

- Take the mean or the median of the estimates Cj*Sj(oi) to estimate ni.

C1

C2

C3

C4

C5

C6

S1

S2

S3

S4

S5

S6

Still not enough

- Collisions with high frequency elements like O1 can spoil most estimates of lower frequency elements, as Ok.

The solution !!!

- Divide & Conquer:
- Don’t let each element update every counter.
- More precisely: replace each counter with a hash table of b counters and have the items one counter per hash table.

Ti

hi

Ci

Si

Presenting the CountSketch algorithm…

The CountSketch data structure

- Define CountSkatch d.s. as follows:
- Let t and b be parameters with values determined later.
- h1,…,ht – hash functions O -> {1,2,…,b}.
- T1,…,Tt – arrays of b counters.
- S1,…,St – hash functions from objects O to {+1,-1}.
- From now on, define : hi[oj] := Ti[hi(oj)]

The d.s. supports 2 operations:

- Add(q):
- Estimate(q):

- Why median and not mean?
- In order to show the median is close to reality it’s enough to show that ½ of the estimates are good.
- The mean on the other hand is very sensitive to outliers.

Finally, the algorithm:

- Keep a CountSketch d.s. C, and a heap of the top k elements.
- Given a data stream q1,…,qn:
- For each j=1,…,n:
- C.Add(qj);
- If qj is in the heap, increment it’s count.
- Else, If C.Estimate(qj) > smallest estimated count in the heap, add qj to the heap.
- (If the heap is full evict the object with the smallest estimated count from it)

Algorithms analysis

Analysis of the CountSketch algorithm for Zipfian distribution

Zipfian distribution

- Zipfian(z): for some constant c.
- This distribution is very common in human languages (useful in search engines).

Observations

- k most frequent elements can only be preceded by elements j with nj > (1-ε)nk
- => Choosing l instead of k so that nl+1 <(1-ε)nk will ensure that our list will include the k most frequent elements.

nl+1

nk

n2

n1

Analysis for Zipfian distribution

- For this distribution the space complexity of the algorithm is where:

Finding items with largest frequency change

The problem

- Let be the number of occurrences of o in S.
- Given 2 streams S1,S2 find the items o such that is maximal.
- 2-pass algorithm.

The algorithm – first pass

- First pass – only update the counters:

The algorithm – second pass

- Pass over S1 and S2 and:

Explanation

- Though A can change, items once removed are never added back.
- Thus accurate exact counts can be maintained for all objects currently in A.
- Space bounds for this algorithm are similar to those of the former with replaced by

Download Presentation

Connecting to Server..