830 likes | 1.31k Views
Introduction to Data Stream. DBG@UNSW. Acknowledgement. Some of the slides are modified from. Nikos Koudas (Toronto U) Minos Garofalakis ( yahoo! research) Divesh Srivastava (AT & T) S. Muthukrishnan (Rutgers) Georges Hébrail (ENST Paris). Outline.
E N D
Introduction to Data Stream DBG@UNSW introduction to data stream DBG@UNSW
Acknowledgement Some of the slides are modified from • Nikos Koudas (Toronto U) • Minos Garofalakis ( yahoo! research) • Divesh Srivastava (AT & T) • S. Muthukrishnan (Rutgers) • Georges Hébrail (ENST Paris) introduction to data stream DBG@UNSW
Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW
What is a data stream ? • Golab & Oszu (2003): “A data stream is a real-tme, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.” • Structured records audio or video data • Massive volumes of data, records arrive at a high rate introduction to data stream DBG@UNSW
Data stream applications. • Transactional data streams: log interactions between entities : • Credit card: purchases by consumers from merchants • Telecommunications: phone calls by callers to dialed parties • Web: accesses by clients of resources at servers • Measurement data streams: monitor evolution of entity states • IP network: traffic at router interfaces Sensor networks: physical phenomena, road traffic Earth climate: temperature, moisture at weather stations introduction to data stream DBG@UNSW
Network supervision Center Applications :Network Management involves monitoring and configuring network hardware and software to ensure smooth operation • Quickly detect faults, congestion, attack • Qos • Load balancing, improve utilization of network resources introduction to data stream DBG@UNSW
( more details ) • Traffic estimation • What fraction network IP addresses are active? • How many bytes were sent between a pair of IP addresses? • List the top 100 IP addresses in terms of traffic • Traffic analysis • What is the average duration of an IP session? • What is the median of the number of bytes in each IP session? • Fraud detection • List all sessions that transmitted more than 1000 bytes • Identify all sessions whose duration was more than twice the normal • Security/Denial of Service • List all IP addresses that have witnessed a sudden spike in traffic • Identify IP addresses involved in more than 1000 sessions introduction to data stream DBG@UNSW
Application : stock monitoring • Stream of price and sales volume of stocks over time • Technical analysis/charting for stock investors • Support trading decisions • Notify me when some stock goes up by at least 5%. • Notify me when the price of any stock increases monotonically for ≥ 40 min. introduction to data stream DBG@UNSW
Challenges • Massive in volume or even infinite • AT&T long-distance: ~300M call tuples/day • AT&T IP backbone: ~50B IP flows/day • Rapid arriving rate • Real-time monitoring (response) required • Continuous query introduction to data stream DBG@UNSW
Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW
DSMS DBMS-Data Base Management System • Data model ( relational ) • Data isstored on disk • SQL language • Creating structures • Inserting/updating/deleting data • Retrieving data (query) • Good performance evenwith large volumes of data DSMS - Data Stream Management System • Data model ( streams and permanent relations) • Permanent relations are stored on diskbut streams are processed on the fly • SQL likequerylanguage • Standard SQL on permanent relations • Extended SQL on streamswithwindowingfeatures • New paradigm of queries (continuousqueries) • Tools for capturing input streams and producing output streams • Good performance: optimization of computer resources introduction to data stream DBG@UNSW
Existing DSMS Principal specialized DSMS’s • Gigascope and Hancock : AT&T • Network monitoring • Analysis of telecommunication calls • NiagaraCQ : University of Wisconsin-Madison • Large number of continuous queries on web content (XML-QL) • Tradebot (finance), Statstream (statistics) Principal general-purpose DSMS’s • STREAM : University of Stanford • TelegraphCQ : University of Berkeley • Aurora : Brown University, MIT, Brandeis Sensor network • Cougar : Cornell University • TinyDB : University of Berkeley introduction to data stream DBG@UNSW
Streamed Result Stored Result Register Query DSMS Input streams Archive Scratch Store Stored Relations STREAM from stanford introduction to data stream DBG@UNSW
STREAM ( cont. ) • General-purpose DSMS for streams and stored data • Relational(unlikely to change) • Centralized server model (likely to change) • Single-threaded and parallel versions • Declarative language for registering continuous queries (CQL) • Query optimization with good memory management • Approximate answer with synopses management introduction to data stream DBG@UNSW
STREAM ( cont. ) Some Implementation Issues • Designed to cope with: • Stream rates that may be high, variable, bursty • Continuous query loads that may be high, volatile • Primary coping techniques • Continuous self-monitoring and reoptimization • Graceful approximation as necessary • Careful resource allocation and use introduction to data stream DBG@UNSW
Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW
Models for data streams • Structure of a stream • Infinite sequence of items (elements) • One item: structured information, i.e. tuple or object • Same structure for all items in a stream • Timestamping • Explicit ( date field in data ) • Implicit ( timestamp given when items arrive ) • Representation of time • Physical (date) • Logical (integer) introduction to data stream DBG@UNSW
Models for data streams (cont.) • One-dimensional array A[1…N] with values A[i] all initially zero • Signal is implicitly represented via a stream of updates • j-th update is <k, c[j]> implying A[k] := A[k] + c[j] (c[j] can be >=0, <0) • Goal: Compute functions on A[ ] subject to • Small space • Fast processing of updates • Fast function computation • … introduction to data stream DBG@UNSW
Models for data streams (cont.) • Time-Series Model Only j-th update updates A[j] (i.e., A[j] := c[j]) • Cash-Register Model • c[j] is always >= 0 (i.e., increment-only) • Typically, c[j]=1, so we see a multi-set of items in one pass • Turnstile Model • Most general streaming model • c[j] can be >=0 or <0 (i.e., increment or decrement) Problem difficulty varies depending on the model • E.g., MIN/MAX in Time-Series vs. Turnstile! introduction to data stream DBG@UNSW
Window on the stream Beginning of the stream t Current date Windowing Applying queries/mining tasks to the whole stream (from beginning to current time) Applying queries/mining tasks to a portion of the stream introduction to data stream DBG@UNSW
Windowing ( cont.) Definition of windows of interest on streams • Fixed windows: September 2007 • Sliding windows: last 3 hours ( n of N window ) • Landmark windows: from September 1st, 2007 Window specification • Physical time: last 3 hours • Logical time: last 1000 items Refreshing rate • Rate of producing results (every item, every 10 items, every minute, …) introduction to data stream DBG@UNSW
Synopsis in Memory Data Streams (Approximate) Answer Stream Processing Engine Computation Model • Stream processing requirements • Single pass: Each record is examined at most once • Bounded storage: Limited Memory (M) for storing synopsis • Real-time: Per record processing time (to maintain synopsis) must be low • Data Independent : no priori knowledge required about data set (size, range, distribution, order) introduction to data stream DBG@UNSW
Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW
Approximation • Exact answer is too expensive to compute • May need too large memory to afford ( distinct, median ) • May need too long time to complete • Approximate answer is acceptable in many applications ε-approximate answers [ Absolute error / Relative error ] Like: E = 100 , ε=0.1 then [90 , 110] are acceptable answers • Only small size of memory is needed • Compute very quickly • Error is guaranteed to be small introduction to data stream DBG@UNSW
Approximation (cont.) • Deterministic approximate methods • Deterministic algorithms carefully controls error. • Non-deterministic approximate methods. • Randomization, Sampling … etc. Provides good approximation with high probability. introduction to data stream DBG@UNSW 25
Basic synopses Basic stream synopses computation • Samples: Answering queries using samples.Reservoir sampling, inverse sampling • Histograms: Equi-depth histograms, On-line quantile computation • Wavelets: Haar-wavelet histogram construction & maintenance • Sketch : AMS Sketch, CM Sketch, FM Sketch introduction to data stream DBG@UNSW
Sampling • Idea: A small random sample S of the data often well-represents • all the data • For a fast approximate answer, apply “modified” query to S • Example: select agg from R (n=12) • If agg is avg, return average of the elements in S • Number of odd elements ? Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 answer: 11.5 introduction to data stream DBG@UNSW
Probabilistic Guarantees • Example: Actual answer is within 11.5 ± 1 with prob 0.9 • Randomized algorithms:Answer returned is a specially-built random variable. • Use Tail Inequalities to give probabilistic bounds on returned answer • Markov Inequality • Chebyshev’s Inequality • Chernoff/Hoeffding Bound introduction to data stream DBG@UNSW
Probability distribution Tail probability Basic Tools: Tail Inequalities • General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation) • Basic Inequalities: Let X be a random variable with expectation and variance Var[X]. Then for any Markov: Chebyshev: introduction to data stream DBG@UNSW
Tail Inequalities ( cont.) • Hoeffding’s Inequality: Let X1, ..., Xm be independent random variables with 0<=Xi <= r. Let and be the expectation of . Then, for any • Chernoff Bound (… ) introduction to data stream DBG@UNSW
Histogram Histograms approximate the frequency distribution of element values in a stream A histogram (typically) consists of • A partitioning of element domain values into buckets • A count per bucket B (of the number of elements in B) Count for bucket Domain values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 introduction to data stream DBG@UNSW
[1.5, 4] [0.5, 0] [2.75] [-1.25] Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] Wavelet Wavelets: Mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: Simplest wavelet, easy to understand and implement • Recursive pairwise averaging and differencing at different resolutions Resolution Averages Detail Coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] ---- 2 [2, 1, 4, 4] [0, -1, -1, 0] 1 0 introduction to data stream DBG@UNSW
Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW
Find elements that occupy 0.1% of the tail. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Frequency RelatedProblems. Top-k most frequent elements How many elements have non-zero frequency? (distinctnumber ) introduction to data stream DBG@UNSW
2 9 9 9 7 6 4 9 9 9 3 9 An Old Chestnut: Majority • A sequence of N items. • You have constant memory. • In one pass, decide if some item is in majority (occurs > N/2 times)? N = 12; item 9 is majority Any Idea ? introduction to data stream DBG@UNSW
Misra-Gries Algorithm (‘82) • A counter and an ID. • If new item is same as stored ID, increment counter. • Otherwise, decrement the counter. • If counter 0, store new item with count = 1. • If counter > 0, then its item is the only candidate for majority. introduction to data stream DBG@UNSW
ID ID1 ID2 . . . . IDk count . . A generalization: Frequent Items(Karp 03) Find k items, each occurring at least N/(k+1) times. • Algorithm: • Maintain k items, and their counters. • If next item x is one of the k, increment its counter. • Else if a zero counter, put x there with count = 1 • Else (all counters non-zero) decrement all k counters introduction to data stream DBG@UNSW
Frequent Elements: Analysis • A frequent item’s count is decremented if all counters are full: it erases k+1 items. • If x occurs > N/(k+1) times, then it cannot be completely erased. • Similarly, x must get inserted at some point, because there are not enough items to keep it away. introduction to data stream DBG@UNSW
Problem of False Positives • False positives in Misra-Gries(MG) algorithm • It identifies all true heavy hitters, but not all reported items are necessarily heavy hitters. • How can we tell if the non-zero counters correspond to true heavy hitters or not? • A second pass is needed to verify. • False positives are problematic if heavy hitters are used for billing or punishment. • What guarantees can we achieve in one pass? introduction to data stream DBG@UNSW
Approximation Guarantees • Find heavy hitters with a guaranteed approximation error [MM02] • Manku-Motwani ( Lossy Counting ) • Suppose you want -heavy hitters --- items with freq > N • An approximation parameter , where << .(E.g., = .01 and = .0001; = 1% and = .01% ) • Identify all items with frequency > N • No reported item has frequency < ( - )N • The algorithm uses O(1/ log (N)) memory introduction to data stream DBG@UNSW
Window 1 Window 2 Window 3 MM02 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size W is function of support s – specify later… introduction to data stream DBG@UNSW
Frequency Counts + First Window At window boundary, decrement all counters by 1 Lossy Counting in Action ... Empty introduction to data stream DBG@UNSW
Frequency Counts + Next Window At window boundary, decrement all counters by 1 Lossy Counting continued ... introduction to data stream DBG@UNSW
Error Analysis How much do we undercount? If current size of stream = N and window-size W = 1/ε then# windows = εN frequency error Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1% introduction to data stream DBG@UNSW
Putting it all together… Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N How many counters do we need? • Worst case bound: 1/ε log εN counters introduction to data stream DBG@UNSW
2 1 1 1 0 f(1) f(2) f(3) f(4) f(5) Data stream: 3, 1, 2, 4, -2, 3, 5, . . . Frequent items ( Turnsile ) • Ask for f(1) = ? f(4) = ? - AMS based algorithm - Count Min sketch. introduction to data stream DBG@UNSW
AMS ( sketch ) based algorithm. Key Intuition: Use randomized linear projections of f() to define random variable Z such that For given element A[i] E( Z ) = ||A[i]|| = fi Similar, we have E( Z ) = fj Basic Idea: Define a family of 4-wise independent {-1, +1} random variables Pr[ = +1] = Pr[ = -1] = ½ Let Z = So E( Z ) Example : 0 1 introduction to data stream DBG@UNSW
AMS cont. • Keep an array of w X d counters for Zij • Use d hash functions to map element x to [1..w] W h1(a) d a hd(a) Est(fa) = median i (Z[i,hi(a)] ) Z[i, hi(a)] += introduction to data stream DBG@UNSW 48
The Count Min (CM) Sketch • Simple sketch idea, can be used for point queries ( fi), range queries, quantiles, join size estimation • Creates a small summary as an array of w X d counters C • Use d hash functions to map element to [1..w] W = Bloom Filter Technique. d = 49 introduction to data stream DBG@UNSW
+1 +1 +1 +1 CM Sketch Structure • Each element xi is mapped to one counter per row • C[ k,hk(xi)] = C[k, hk(xi)]+1 ( -1 for deletion ) or +c[j] if income is <j, c[j]> • Estimate A[j] by taking mink C[k,hk(j)] h1(xi) d xi hd(xi ) w introduction to data stream DBG@UNSW