1 / 29

Approximate Frequency Counts over Data Streams

Approximate Frequency Counts over Data Streams. Gurmeet Singh Manku, Bajeev Motwani Proceeding of the 28 th VLDB Conference , 2002. 報告人 : 吳建良. Motivation. In some new applications, data come as a continuous “stream” The sheer volume of a stream over its lifetime is huge

sandrawolfe
Download Presentation

Approximate Frequency Counts over Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Bajeev Motwani Proceeding of the 28th VLDB Conference , 2002 報告人:吳建良

  2. Motivation • In some new applications, data come as a continuous “stream” • The sheer volume of a stream over its lifetime is huge • Response times of queries should be small • Examples: • Network traffic measurements • Market data

  3. ALERT: RED flow exceeds 1% of all traffic through me, check it!!! Network Traffic Management • Frequent Items: Frequent Flow identification at IP router • short-term monitoring • long-term management

  4. Mining Market Data Among 100 million records: (1) at least 1% customers buy both beer and diaper at same time (2) 51% customers who buy beer also buy diaper! … • Frequent Itemsets at Supermarket • store layout • catalog design …

  5. Challenges • Single pass • Limited Memory (network management) • Enumeration of itemsets (mining market Data)

  6. Summary in Memory Data Streams Stream Processing Engine (Approximate) Answer General Solution

  7. Approximate Algorithm • Propose two algorithms for frequent item • Sticky Sampling • Lossy Counting • Propose one algorithm for frequent itemset • Extended Lossy Counting for frequent itemsets

  8. Property of proposed algorithm • All item(set)s whose true frequency exceeds sN are output • No item(set) whose true frequency is less than is output • Estimated frequencies are less than the true frequencies by at most

  9. Sticky Sampling Algorithm • User input includes three parameters, namely: • Support threshold s • Error parameter  • Probability of failure  • Counts are kept in a data structure S • Each entry in S is in the form (e,f), where: • e is the item • f is the estimated frequency of e in the stream • When queried about the frequent items, all entries (e,f) such that f (s - )N • N denote the current length of the stream

  10. Sticky Sampling Algorithm (cont’d) Example Empty S Stream

  11. Prune S的時機: at sampling rate change Sticky Sampling Algorithm(cont’d) • S ; N  0; t  1/ log (1/s); r 1 • e  next item; N  N + 1 • if (e,f) exists in S do • increment the count f • else if random(0,1) > 1/r do • insert (e,1) to S • endif • if N = 2t  2n do • r  2r • Prune(S); • endif • Goto 2; S: The set of all counts e: item N: Curr. len. of stream r: Sampling rate t: 1/ log (1/s)

  12. Sticky Sampling Algorithm: Prune S • function Prune(S) • for every entry (e,f) in S do • while random(0,1) < 0.5 and f > 0 do • f f – 1 • if f = 0 do • remove the entry from S • endif

  13. Lossy Counting Algorithm • Incoming data stream is conceptually divided into buckets of w=1/ transactions • Current bucket id denote as bcurrent = N/w • fe: the true frequency of e in the stream • Counts are kept in a data structure D • Each entry in D is in the form (e, f, ), where: • e is the item • f is the estimated frequency of e in the stream •  is the maximum possible error in f

  14. Lossy Counting Algorithm(cont’d) Example: =0.2,w=5, N=17, bcurrent=4 Bucket 1 Bucket 2 Bucket 3 bcurrent= 4 A B C A B E A C C D D A B E D F C D D D D (A,2,0) (B,2,0) (C,1,0) (A,3,0) (B,2,0) (C,2,1) (E,1,1) (D,1,1) (A,4,0) (B,1,2) (C,2,1) (D,2,2) (E,1,2) (A,4,0) (C,1,3) (D,2,2) (F,1,3) Prune D Prune D Prune D D D D (A,4,0) (D,2,2) (A,2,0) (B,2,0) (A,3,0) (C,2,1)

  15. Prune D的時機: at bucket boundary Lossy Counting Algorithm(cont’d) • D ; N  0 • w  1/; bcurrent  1 • e  next item; N  N + 1 • if (e,f,) exists in D do • f  f + 1 • else do • insert (e,1, bcurrent-1) to D • endif • if N mod w = 0 do • prune(D, bcurrent); • bcurrent  bcurrent + 1 • endif • Goto 3; D: The set of all counts N: Curr. len. of stream e: item w: Bucket width bcurrent: Current bucket id

  16. Lossy Counting Algorithm: prune D • function prune(D, bcurrent) • for each entry (e,f,) in D do • if f +   bcurrent do • remove the entry from D • endif

  17. Lossy Counting Algorithm (cont’d) • Four Lemmas Lemma1: Whenever deletions occur, bcurrent N Lemma2: Whenever an entry (e,f,) gets deleted, fe bcurrent Lemma3: If e does not appear in D, then fe N Lemma4: If (e,f,) D, then f fe f+N

  18. Extended Lossy Counting for Frequent Itemsets • Incoming data stream is conceptually divided into buckets of w= 1/ transactions • Counts are kept in a data structure D • Multiple buckets ( of them say) are processed in a batch • Each entry in D is in the form (set, f, ), where: • set is the itemset • f is the approximate frequency of set in the stream • is the maximum possible error in f

  19. Bucket 1 Bucket 2 Bucket 3 Extended Lossy Counting for Frequent Itemsets (cont’d) Put 3 buckets of data into main memory one time

  20. Overview of the algorithm • D is updated by the operations UPDATE_SET and NEW_SET • UPDATE_SET updates and deletes entries in D • For each entry (set, f, ), count occurrence of set in the batch and update the entry • If an updated entry satisfies f +   bcurrent, the entry is removed from D • NEW_SET inserts new entries into D • If a set set has frequency f  in the batch and set does not occur in D, create a new entry (set, f, bcurrent-)

  21. Implementation • Challenges: • Not to enumerate all subsets of a transaction • Data structure must be compact for better space efficiency • 3 major modules: • Buffer • Trie • SetGen

  22. Implementation(cont’d) • Buffer: repeatedly reads in a batch of buckets of transactions, into available main memory • Trie: maintains the data structure D • SetGen: generates subsets of item-id’s along with their frequency counts in the current batch • Not all possible subsets need to be generated • If a subset S is not inserted into D after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered

  23. Example Main Memory bucket3 bucket4 ACE BCD AB ABC AD BCE ACE: AC, A, C BCD: BC, B, C AB: AB, A, B ABC: AB, AC, BC, A, B, C AD: A BCE: BC, B, C UPDATE_SET SetGen D D (A,5,0) (B,3,0) (C,3,0) (D,2,0) (AB,2,0) (AC,3,0) (AD,2,0) (BC,2,0) (A,9,0) (B,7,0) (C,7,0) (AC,5,0) (BC,5,0) NEW_SET Add (AB,2,2) into D

  24. IBM synthetic dataset T10.I4.1000K N = 1Million Avg Tran Size = 10 Input Size = 49MB IBM synthetic dataset T15.I6.1000K N = 1Million Avg Tran Size = 15 Input Size = 69MB Frequent word pairs in 100K web documents N = 100K Avg Tran Size = 134 Input Size = 54MB Frequent word pairs in 806K Reuters newsreports N = 806K Avg Tran Size = 61 Input Size = 210MB Experiments

  25. Varying support s and BUFFER B Time in seconds Time in seconds S = 0.004 S = 0.008 S = 0.001 S = 0.012 S = 0.002 S = 0.016 S = 0.004 S = 0.020 S = 0.008 BUFFER size in MB BUFFER size in MB IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s

  26. Varying length N and support s S = 0.001 S = 0.002 S = 0.001 Time in seconds S = 0.004 Time in seconds S = 0.002 S = 0.004 Length of stream in Thousands Length of stream in Thousands IBM 1M transactions Reuters 806K docs Fixed: BUFFER size B Varying: Stream length N Support threshold s

  27. Varying BUFFER B and support s Time in seconds Time in seconds B = 4 MB B = 4 MB B = 16 MB B = 16 MB B = 28 MB B = 28 MB B = 40 MB B = 40 MB Support threshold s Support threshold s IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s

  28. Comparison with fast A-priori Dataset: IBM T10.I4.1000K with 1M transactions, average size 10.

  29. No of counters No of counters N (stream length) Sticky Sampling Expected: 2/ log 1/s Lossy Counting Worst Case: 1/ log N Support s = 1% Error ε = 0.1% Log10 of N (stream length)

More Related