What s hot and what s not tracking most frequent items dynamically
Download
1 / 15

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically - PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically. G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems 2003 ACM Transactions on Database Systems 2005. Introduction. Find “ hot ” items, but the set of hot items will change over time

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically' - diep


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
What s hot and what s not tracking most frequent items dynamically

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically

G. Cormode and S. Muthukrishman

Rutgers University

ACM Principles of Database Systems 2003

ACM Transactions on Database Systems 2005


Introduction
Introduction

  • Find “hot” items, but the set of hot items will change over time

  • Applications: caching, load balancing, sensor networks, data mining, etc.

  • Usually focus on “insert” only, this paper also take “delete” into account


Prior works

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1

Prior works

  • Stream with sliding window (*)

  • Flajolet-Martin approach (*)

    • Estimate number of distinct elements

  • Majority voting algorithm

    • Use only one counter to identify the majority item

  • Lossy counting

Arrival time

Elements

* http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=385


Contribution of the paper
Contribution of the paper 40 41 42 43 44 45 46

  • Dynamically maintain the hot items

    • Both insert and delete transactions are supported

  • Randomized algorithm

    • Use hash table

    • Use “random” to confuse omniscient adversary

  • Small space required

  • Short processing time


Finding the majority item
Finding the majority item 40 41 42 43 44 45 46

  • Keep log2m+1 counters

    • C0: keep how many items are “live”

    • Cj (j!=0): increase or decrease if bit(x,j)=1

    • Search: if there is a majority, it is given by

    • No false negative, but false positive is possible



Example

#>(counter 0)/2 ? updates

2

Example

Find majority:

x=0

+21

+0=2

Space of 8 items

Counter 0

Counter 1 (20)

Counter 2 (21)

Counter 3 (22)

1

2

2

2

7

2

4

6

False positive is possible!


Finding hot items
Finding hot items updates

  • Sequence with length n

  • Item identifiers: 1..m

  • nx(t): # of inserts - # of deletes before time t

  • fx(t): nx(t)/sigma(ny(t), y=1..m)

  • Hot item: given k, fx(t) > 1/(k+1)


Process item insert or delete
Process Item (insert or delete) updates

  • Classify sets by universal hash function

  • Initialize c[0..2Tk][0..logm]=0, c=0

    • T: # of groups

    • k: frequency threshold (fx(t)>1/(k+1))

  • for all (i, transType) do

    if (transType == insert)  c=c+1

    else c=c-1

    forx=1 toTdo

    index = hash(x) // uniformly distributed

    UpdateCounters(i,transType,c[index])


Find hot sets
Find hot sets updates

  • fori=1 toTdo //for each group

    ifc[i][0] ≧n/(k+1)

    position=0; t=1;

    forj=1 to logmdo

    if (c[i][j] ≧ n/(k+1))

    position = position + t

    t = t*2

    output(position)

Similar to the algorithm to find the majority


Error probability
Error probability updates

  • Choosing |h|≧2k, T=log2(k/δ), the algorithm ensures that the probability of all hot items being output is at least 1-δ

    • Details of the proof (*,**)

* Universal classes of hash functions, J. Comput. Syst. 1979

** the two papers currently presented


Experiments
Experiments updates

  • Synthetic data:

    • Uniformly insert

    • Zip-f insert

    • Uniformly delete

    • 1,000,000 items

    • k=50 (hot items: f>1/(k+1))

  • Real data:

    • Telephone connections (from AT&T)

    • 3.5 million transactions

    • Every 100,000 transactions, query (src, dest) pairs with frequency greater than 1%


Results of synthetic data
Results of synthetic data updates

  • Recall: proportion of the hot items that are found by the method

  • Precision: proportion of items identified by the algorithm are hot items



Conclusion
Conclusion updates

  • Propose a new method for identifying hot items

  • Cope with dynamic datasets


ad