Approximate Frequency Counts over Data Streams

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku (Standford) Rajeev Motwani (Standford) Presented by Michal Spivak November, 2003

Stream The Problem… Identify all elements whose current frequency exceeds support threshold s = 0.1%.

Stream Related problem… Identify all subsets of items whose current frequency exceeds s=0.1%

Purpose of this paper Present an algorithm for computing frequency counts exceeding a user-specified threshold over data streams with the following advantages: • Simple • Low memory footprint • Output is approximate but guaranteed not to exceed a user specified error parameter. • Can be deployed for streams of singleton items and handle streams of variable sized sets of items.

Overview • Introduction • Frequency counting applications • Problem definition • Algorithm for Frequent Items • Algorithm for Frequent Sets of Items • Experimental results • Summary

Introduction

Motivating examples • Iceberg Query Perform an aggregate function over an attribute and eliminate those below some threshold. • Association RulesRequire computation of frequent itemsets. • Iceberg DatacubesGroup by’s of a CUBE operator whose aggregate frequency exceeds threshold • Traffic measurement Require identification of flows that exceed a certain fraction of total traffic

What’s out there today… • Algorithms that compute exact results • Attempt to minimize number of data passes (best algorithms take two passes). Problems when adapted to streams: • Only one pass is allowed. • Results are expected to be available with short response time. • Fail to provide any a-priori guarantee on the quality of their output.

Why Streams?Streams vs. Stored data • Volume of a stream over its lifetime can be huge • Queries for streams require timely answers, response times need to be small As a result it is not possible to store the stream as an entirety.

Frequency counting applications

Existing applications for the following problems • Iceberg Query Perform an aggregate function over an attribute and eliminate those below some threshold. • Association RulesRequire computation of frequent itemsets. • Iceberg DatacubesGroup by’s of a CUBE operator whose aggregate frequency exceeds threshold • Traffic measurement Require identification of flows that exceed a certain fraction of total traffic

Iceberg QueriesIdentify aggregates that exceed a user-specified threshold r One of the published algorithms to compute iceberg queries efficiently uses repeated hashing over multiple passes.* Basic Idea: • In the first pass a set of counters is maintained • Each incoming item is hashed to one of the counters which is incremented • These counters are then compressed to a bitmap, with a 1 denoting large counter value • In the second pass exact frequencies are maintained for only those elements that hash to a counter whose bitmap value is 1 This algorithm is difficult to adapt for streams because it requires two passes * M. FANG, N. SHIVAKUMAR, H. GARCIA-MOLINA,R. MOTWANI, AND J. ULLMAN. Computing iceberg queries efficiently. In Proc. of 24th Intl. Conf. on Very Large Data Bases, pages 299–310, 1998.

Association Rules Definitions • Transaction – subset of items drawn from I, the universe of all Items. • Itemset X I has support s if X occurs as a subset at least a fraction - s of all transactions • Associations rules over a set of transactions are of the form X=>Y, where X and Y are subsets of I such that X∩Y = 0 and XUY has support exceeding a user specified threshold s. • Confidence of a rule X => Y is the value support(XUY) / support(X) U|

Example - Market basket analysis • For support = 50%, confidence = 50%, we have the following rules • 1 => 3 with 50% support and 66% confidence • 3 => 1 with 50% support and 100% confidence

Reduce to computing frequent itemsets For support = 50%, confidence = 50% • For the rule 1 => 3: • Support = Support({1, 3}) = 50% • Confidence = Support({1,3})/Support({1}) = 66%

Toivonen’s algorithm • Based on sampling of the data stream. • Basically, in the first pass, frequencies are computed for samples of the stream, and in the second pass these the validity of these items is determined.Can be adapted for data stream • Problems:- false negatives occur because the error in frequency counts is two sided- for small values of e, the number of samples is enormous ~ 1/e (100 million samples)

Network flow identification • Flow – sequence of transport layer packets that share the same source+destination addresses • Estan and Verghese proposed an algorithm for this identifying flows that exceed a certain threshold.The algorithm is a combination of repeated hashing and sampling, similar to those for iceberg queries. • Algorithm presented in this paper is directly applicable to the problem of network flow identification. It beats the algorithm in terms of space and requirements.

Problem definition

Problem Definition • Algorithm accepts two user-specified parameters- support threshold s E (0,1)- error parameter εE (0,1)- ε << s • N – length of stream (i.e no. of tuples seen so far) • Itemset – set of items • Denote item(set) to be item or itemset • At any point of time, the algorithm can be asked to produce a list of item(set)s along with their estimated frequency.

Approximation guarantees • All item(set)s whose true frequency exceeds sN are output. There are no false negatives. • No item(set)s whose true frequency is less than (s- ε(N is output. • Estimated frequencies are less than true frequencies by at mostεN

Input Example • S = 0.1% • ε as a rule of thumb, should be set to one-tenth or one-twentieth of s. ε = 0.01% • As per property 1, ALL elements with frequency exceeding 0.1% will be output. • As per property 2, NO element with frequency below 0.09% will be output • Elements between 0.09% and 0.1% may or may not be output. Those that “make their way” are false positives • As per property 3, all individual frequencies are less than their true frequencies by at most 0.01%

Problem Definition cont… • An algorithm maintains an ε-deficient synopsis if its output satisifies the aforementioned properties • Goal:to devise algorithms to support ε-deficient synopsis using as little main memory as possible

The Algorithms for frequent Items Sticky Sampling Lossy Counting

Stream Sticky Sampling Algorithm 34 15 30 28 31 41 23 35 19  Create counters by sampling

Notations… • Data structure S - set of entries of the form (e,f) • f – estimates the frequency of an element e. • r – sampling rate. Sampling an element with rate = r means we select the element with probablity = 1/r

Sticky Sampling cont… • Initially – S is empty, r = 1. • For each incoming element eif (e exists in S) increment corresponding felse { sample element with rate r if (sampled) add entry (e,1) to S else ignore }

The sampling rate • Let t = 1/ ε log(s-1-1) ( = probability of failure) • First 2t elements are sampled at rate=1 • The next 2t elements at rate=2 • The next 4t elements at rate=4 and so on…

Sticky Sampling cont… Whenever the sampling rate r changes: for each entry (e,f) in S repeat { toss an unbiased coin if (toss is not successful) diminsh f by one if (f == 0) { delete entry from S break }} until toss is successful

Sticky Sampling cont… • The number of unsuccessful coin tosses folows a geometric distribution. • Effectively, after each rate change S is transformed to exactly the state it would have been in, if the new rate had been used from the beginning. • When a user requests a list of items with threshold s, the output are those entries in S where f ≥ (s –ε)N

Theorem 1 Sticky Sampling computes an ε-deficient synopsis with probability at least 1 -  using at most 2/ ε log(s-1-1) expected number of entries.

Theorem 1 - proof • First 2t elements find their way into S • When r ≥ 2 N = rt + rt` ( t`E [1,t) ) => 1/r ≥ t/N • Error in frequency corresponds to a sequence of unsuccessful coin tosses during the first few occurrences of e.the probability that this length exceeds εN is at most (1 – 1/r)εN < (1 – t/N)-εN < e-εt • Number of elements with f > s is no more than 1/s => the probability that the estimate for any of them is deficient by εN is at most e-εt/s

Theorem 1 – proof cont… • Probability of failure should be at most . This yieldse-εt/s <  • t ≥ 1/ ε log(s-1-1) since the space requirements are 2t, the space bound follows…

Sticky Sampling summary • The algorithm name is called sticky sampling because S sweeps over the stream like a magnet, attracting all elements which already have an entry in S • The space complexity is independent of N • The idea of maintaining samples was first presented by Gibbons and Matias who used it to solve the top-k problem. • This algorithm is different in that the sampling rate r increases logarithmically to produce ALL items with frequency > s, not just the top k

bucket 1 bucket 3 bucket 2 Lossy Counting Divide the stream into buckets Keep exact counters for items in the buckets Prune entrys at bucket boundaries

Lossy Counting cont… • A deterministic algorithm that computes frequency counts over a stream of singleitem transactions, satisfying the guarantees outlined in Section 3 using at most 1/εlog(εN) space where N denotes the current length of the stream. • The user specifies two parameters:- support s- error ε

Definitions • The incoming stream is conceptually divided into buckets of width w = ceil(1/e) • Buckets are labeled with bucket ids, starting from 1 • Denote the current bucket id by bcurrent whose value is ceil(N/w) • Denote feto be the true frequency of an element e in the stream seen so far • Data stucture D is a set of entries of the form (e,f,D)

The algorithm • Initially D is empty • Receive element eif (e exists in D) increment its frequency (f) by 1else create a new entry (e, 1, bcurrent– 1) • If it bucket boundary prune D by the following the rule:(e,f,D) is deleted if f + D≤ bcurrent • When the user requests a list of items with threshold s, output those entries in D where f ≥ (s –ε)N

Some algorithm facts • For an entry (e,f,D) f represents the exact frequency count for e ever since it was inserted into D. • The value D is the maximum number of times e could have occurred in the first bcurrent– 1 buckets ( this value is exactly bcurrent– 1) • Once a value is inserted into D its value D is unchanged

Frequency Counts + First Bucket Lossy counting in action D is Empty At window boundary, remove entries that for themf+D ≤bcurrent

Frequency Counts + Next Bucket Lossy counting in action cont At window boundary, remove entries that for themf+D ≤bcurrent

Lemma 1 • Whenver deletions occur, bcurrent≤eN Proof: N = bcurrentw + neN = bcurrent + eneN ≥ bcurrent

Lemma 2 • Whenever an entry (e,f,D) gets deleted fe≤ bcurrent Proof by induction • Base case: bcurrent = 1 (e,f,D) is deleted only if f = 1 Thus fe ≤ bcurrent (fe = f) • Induction step: - Consider (e,f,D) that gets deleted for some bcurrent > 1. - This entry was inserted when bucket D+1 was being processed. - It was deleted at late as the time as bucket D became full. - By induction the true frequency for e was no more than D. - f is the true frequency of e since it was inserted.- fe ≤ f+D combined with the deletion rule f+D ≤ bcurrent => fe ≤ bcurrent

Lemma 3 • If e does not appear in D, then fe≤eN Proof: If the lemma is true for an element e whenever it gets deleted, it is true for all other N also.From lemmas 1, 2 we infer that fe ≤ eN whenever it gets deleted.

Lemma 4 • If (e,f,D) E D, then f ≤ fe≤ f + eN Proof: If D=0, f=fe. Otherwisee was possibly deleted in the first D buckets.From lemma 2 fe ≤ f+DD ≤ bcurrent– 1 ≤ eN Conclusionf ≤ fe≤ f + eN

Lossy Counting cont… • Lemma 3 shows that all elements whose true frequency exceed eN have entries in D • Lemma 4 shows that the estimated frequency of all such elements are accurate to within eN => D correctly maintains an e-deficient synopsis

Theorem 2 • Lossy counting computes an e-deficient synopsis using at most 1/elog(eN) entries

Theorem 2 - proof • Let B = bcurrent • di– denote the number of entries in D whose bucket id is B - i + 1 (iE[1,B]) • e corresponding to di must occur at least i times in buckets B-i+1 through B, otherwise it would have been deleted • We get the following constraint:(1) Sidi ≤ jw for j = 1,2,…B. i = 1..j

Theorem 2 – proof • The following inequality can be proved by induction:Sdi ≤ Sw/ifor j = 1,2,…B i = 1..j • |D| = Sdi for i = 1..B • From the above inequality |D| ≤ Sw/i ≤ 1/elogB = 1/elog(eN) z

No of entries Sticky Sampling vs. Lossy counting Log10 of N (stream length) Support s = 1% Error ε = 0.1%

No of entries N (stream length) Sticky Sampling vs.Lossy counting cont… Kinks in the curve for sticky sampling correspond to re-sampling Kinks in the curve for lossy counting correspond to bucket boundaries

Approximate Frequency Counts over Data Streams

Approximate Frequency Counts over Data Streams

Presentation Transcript

Managing Data Streams

Data Streams

Frequency Counts

Approximate Data Exchange

Approximate Counting of Cycles in Streams

Data Streams

Data Streams

Mining Data Streams

Approximate Counts and Quantiles over Sliding Windows

Continuous Queries over Data Streams

Continuously Maintaining Order Statistics Over Data Streams

Approximate Query Processing (AQP) in Data Streams

Multiple Aggregations Over Data Streams

Optimal Approximations of the Frequency Moments of Data Streams

Data Streams

Approximate Selection Queries over Imprecise Data

Adaptive Frequency Counting over Bursty Data Streams

dQUOB: SQL queries over data streams

Multiple Aggregations Over Data Streams

Approximate Frequency Counts over Data Streams