1 / 15

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically. G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems 2003 ACM Transactions on Database Systems 2005. Introduction. Find “ hot ” items, but the set of hot items will change over time

diep
Download Presentation

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems 2003 ACM Transactions on Database Systems 2005

  2. Introduction • Find “hot” items, but the set of hot items will change over time • Applications: caching, load balancing, sensor networks, data mining, etc. • Usually focus on “insert” only, this paper also take “delete” into account

  3. 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 Prior works • Stream with sliding window (*) • Flajolet-Martin approach (*) • Estimate number of distinct elements • Majority voting algorithm • Use only one counter to identify the majority item • Lossy counting Arrival time Elements * http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=385

  4. Contribution of the paper • Dynamically maintain the hot items • Both insert and delete transactions are supported • Randomized algorithm • Use hash table • Use “random” to confuse omniscient adversary • Small space required • Short processing time

  5. Finding the majority item • Keep log2m+1 counters • C0: keep how many items are “live” • Cj (j!=0): increase or decrease if bit(x,j)=1 • Search: if there is a majority, it is given by • No false negative, but false positive is possible

  6. Algorithms to find the majority element in a sequence of updates

  7. #>(counter 0)/2 ? 2 Example Find majority: x=0 +21 +0=2 Space of 8 items Counter 0 Counter 1 (20) Counter 2 (21) Counter 3 (22) 1 2 2 2 7 2 4 6 False positive is possible!

  8. Finding hot items • Sequence with length n • Item identifiers: 1..m • nx(t): # of inserts - # of deletes before time t • fx(t): nx(t)/sigma(ny(t), y=1..m) • Hot item: given k, fx(t) > 1/(k+1)

  9. Process Item (insert or delete) • Classify sets by universal hash function • Initialize c[0..2Tk][0..logm]=0, c=0 • T: # of groups • k: frequency threshold (fx(t)>1/(k+1)) • for all (i, transType) do if (transType == insert)  c=c+1 else c=c-1 forx=1 toTdo index = hash(x) // uniformly distributed UpdateCounters(i,transType,c[index])

  10. Find hot sets • fori=1 toTdo //for each group ifc[i][0] ≧n/(k+1) position=0; t=1; forj=1 to logmdo if (c[i][j] ≧ n/(k+1)) position = position + t t = t*2 output(position) Similar to the algorithm to find the majority

  11. Error probability • Choosing |h|≧2k, T=log2(k/δ), the algorithm ensures that the probability of all hot items being output is at least 1-δ • Details of the proof (*,**) * Universal classes of hash functions, J. Comput. Syst. 1979 ** the two papers currently presented

  12. Experiments • Synthetic data: • Uniformly insert • Zip-f insert • Uniformly delete • 1,000,000 items • k=50 (hot items: f>1/(k+1)) • Real data: • Telephone connections (from AT&T) • 3.5 million transactions • Every 100,000 transactions, query (src, dest) pairs with frequency greater than 1%

  13. Results of synthetic data • Recall: proportion of the hot items that are found by the method • Precision: proportion of items identified by the algorithm are hot items

  14. Results of real data

  15. Conclusion • Propose a new method for identifying hot items • Cope with dynamic datasets

More Related