Introduction to Data Stream

Introduction to Data Stream DBG@UNSW introduction to data stream DBG@UNSW

Acknowledgement Some of the slides are modified from • Nikos Koudas (Toronto U) • Minos Garofalakis ( yahoo! research) • Divesh Srivastava (AT & T) • S. Muthukrishnan (Rutgers) • Georges Hébrail (ENST Paris) introduction to data stream DBG@UNSW

Outline • Introduction and Applications • Data Stream Management System • Modeling for Data Streams • Approximation Technique • Data Stream Computation introduction to data stream DBG@UNSW

What is a data stream ? • Golab & Oszu (2003): “A data stream is a real-tme, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.” • Structured records  audio or video data • Massive volumes of data, records arrive at a high rate introduction to data stream DBG@UNSW

Data stream applications. • Transactional data streams: log interactions between entities : •  Credit card: purchases by consumers from merchants •  Telecommunications: phone calls by callers to dialed parties •  Web: accesses by clients of resources at servers • Measurement data streams: monitor evolution of entity states •  IP network: traffic at router interfaces  Sensor networks: physical phenomena, road traffic  Earth climate: temperature, moisture at weather stations introduction to data stream DBG@UNSW

Network supervision Center Applications :Network Management involves monitoring and configuring network hardware and software to ensure smooth operation • Quickly detect faults, congestion, attack • Qos • Load balancing, improve utilization of network resources introduction to data stream DBG@UNSW

( more details ) • Traffic estimation • What fraction network IP addresses are active? • How many bytes were sent between a pair of IP addresses? • List the top 100 IP addresses in terms of traffic • Traffic analysis • What is the average duration of an IP session? • What is the median of the number of bytes in each IP session? • Fraud detection • List all sessions that transmitted more than 1000 bytes • Identify all sessions whose duration was more than twice the normal • Security/Denial of Service • List all IP addresses that have witnessed a sudden spike in traffic • Identify IP addresses involved in more than 1000 sessions introduction to data stream DBG@UNSW

Application : stock monitoring • Stream of price and sales volume of stocks over time • Technical analysis/charting for stock investors • Support trading decisions • Notify me when some stock goes up by at least 5%. • Notify me when the price of any stock increases monotonically for ≥ 40 min. introduction to data stream DBG@UNSW

Challenges • Massive in volume or even infinite • AT&T long-distance: ~300M call tuples/day • AT&T IP backbone: ~50B IP flows/day • Rapid arriving rate • Real-time monitoring (response) required • Continuous query introduction to data stream DBG@UNSW

DSMS DBMS-Data Base Management System • Data model ( relational ) • Data isstored on disk • SQL language • Creating structures • Inserting/updating/deleting data • Retrieving data (query) • Good performance evenwith large volumes of data DSMS - Data Stream Management System • Data model ( streams and permanent relations) • Permanent relations are stored on diskbut streams are processed on the fly • SQL likequerylanguage • Standard SQL on permanent relations • Extended SQL on streamswithwindowingfeatures • New paradigm of queries (continuousqueries) • Tools for capturing input streams and producing output streams • Good performance: optimization of computer resources introduction to data stream DBG@UNSW

Existing DSMS Principal specialized DSMS’s • Gigascope and Hancock : AT&T • Network monitoring • Analysis of telecommunication calls • NiagaraCQ : University of Wisconsin-Madison • Large number of continuous queries on web content (XML-QL) • Tradebot (finance), Statstream (statistics) Principal general-purpose DSMS’s • STREAM : University of Stanford • TelegraphCQ : University of Berkeley • Aurora : Brown University, MIT, Brandeis Sensor network • Cougar : Cornell University • TinyDB : University of Berkeley introduction to data stream DBG@UNSW

Streamed Result Stored Result Register Query DSMS Input streams Archive Scratch Store Stored Relations STREAM from stanford introduction to data stream DBG@UNSW

STREAM ( cont. ) • General-purpose DSMS for streams and stored data • Relational(unlikely to change) • Centralized server model (likely to change) • Single-threaded and parallel versions • Declarative language for registering continuous queries (CQL) • Query optimization with good memory management • Approximate answer with synopses management introduction to data stream DBG@UNSW

STREAM ( cont. ) Some Implementation Issues • Designed to cope with: • Stream rates that may be high, variable, bursty • Continuous query loads that may be high, volatile • Primary coping techniques • Continuous self-monitoring and reoptimization • Graceful approximation as necessary • Careful resource allocation and use introduction to data stream DBG@UNSW

Models for data streams • Structure of a stream • Infinite sequence of items (elements) • One item: structured information, i.e. tuple or object • Same structure for all items in a stream • Timestamping • Explicit ( date field in data ) • Implicit ( timestamp given when items arrive ) • Representation of time • Physical (date) • Logical (integer) introduction to data stream DBG@UNSW

Models for data streams (cont.) • One-dimensional array A[1…N] with values A[i] all initially zero • Signal is implicitly represented via a stream of updates • j-th update is <k, c[j]> implying A[k] := A[k] + c[j] (c[j] can be >=0, <0) • Goal: Compute functions on A[ ] subject to • Small space • Fast processing of updates • Fast function computation • … introduction to data stream DBG@UNSW

Models for data streams (cont.) • Time-Series Model Only j-th update updates A[j] (i.e., A[j] := c[j]) • Cash-Register Model • c[j] is always >= 0 (i.e., increment-only) • Typically, c[j]=1, so we see a multi-set of items in one pass • Turnstile Model • Most general streaming model • c[j] can be >=0 or <0 (i.e., increment or decrement) Problem difficulty varies depending on the model • E.g., MIN/MAX in Time-Series vs. Turnstile! introduction to data stream DBG@UNSW

Window on the stream Beginning of the stream t Current date Windowing Applying queries/mining tasks to the whole stream (from beginning to current time) Applying queries/mining tasks to a portion of the stream introduction to data stream DBG@UNSW

Windowing ( cont.) Definition of windows of interest on streams • Fixed windows: September 2007 • Sliding windows: last 3 hours ( n of N window ) • Landmark windows: from September 1st, 2007 Window specification • Physical time: last 3 hours • Logical time: last 1000 items Refreshing rate • Rate of producing results (every item, every 10 items, every minute, …) introduction to data stream DBG@UNSW

Synopsis in Memory Data Streams (Approximate) Answer Stream Processing Engine Computation Model • Stream processing requirements • Single pass: Each record is examined at most once • Bounded storage: Limited Memory (M) for storing synopsis • Real-time: Per record processing time (to maintain synopsis) must be low • Data Independent : no priori knowledge required about data set (size, range, distribution, order) introduction to data stream DBG@UNSW

Approximation • Exact answer is too expensive to compute • May need too large memory to afford ( distinct, median ) • May need too long time to complete • Approximate answer is acceptable in many applications ε-approximate answers [ Absolute error / Relative error ] Like: E = 100 , ε=0.1 then [90 , 110] are acceptable answers • Only small size of memory is needed • Compute very quickly • Error is guaranteed to be small introduction to data stream DBG@UNSW

Approximation (cont.) • Deterministic approximate methods • Deterministic algorithms carefully controls error. • Non-deterministic approximate methods. • Randomization, Sampling … etc. Provides good approximation with high probability. introduction to data stream DBG@UNSW 25

Basic synopses Basic stream synopses computation • Samples: Answering queries using samples.Reservoir sampling, inverse sampling • Histograms: Equi-depth histograms, On-line quantile computation • Wavelets: Haar-wavelet histogram construction & maintenance • Sketch : AMS Sketch, CM Sketch, FM Sketch introduction to data stream DBG@UNSW

Sampling • Idea: A small random sample S of the data often well-represents • all the data • For a fast approximate answer, apply “modified” query to S • Example: select agg from R (n=12) • If agg is avg, return average of the elements in S • Number of odd elements ? Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 answer: 11.5 introduction to data stream DBG@UNSW

Probabilistic Guarantees • Example: Actual answer is within 11.5 ± 1 with prob  0.9 • Randomized algorithms:Answer returned is a specially-built random variable. • Use Tail Inequalities to give probabilistic bounds on returned answer • Markov Inequality • Chebyshev’s Inequality • Chernoff/Hoeffding Bound introduction to data stream DBG@UNSW

Probability distribution Tail probability Basic Tools: Tail Inequalities • General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation) • Basic Inequalities: Let X be a random variable with expectation and variance Var[X]. Then for any Markov: Chebyshev: introduction to data stream DBG@UNSW

Tail Inequalities ( cont.) • Hoeffding’s Inequality: Let X1, ..., Xm be independent random variables with 0<=Xi <= r. Let and be the expectation of . Then, for any • Chernoff Bound (… ) introduction to data stream DBG@UNSW

Histogram Histograms approximate the frequency distribution of element values in a stream A histogram (typically) consists of • A partitioning of element domain values into buckets • A count per bucket B (of the number of elements in B) Count for bucket Domain values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 introduction to data stream DBG@UNSW

[1.5, 4] [0.5, 0] [2.75] [-1.25] Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] Wavelet Wavelets: Mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: Simplest wavelet, easy to understand and implement • Recursive pairwise averaging and differencing at different resolutions Resolution Averages Detail Coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] ---- 2 [2, 1, 4, 4] [0, -1, -1, 0] 1 0 introduction to data stream DBG@UNSW

Find elements that occupy 0.1% of the tail. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Frequency RelatedProblems. Top-k most frequent elements How many elements have non-zero frequency? (distinctnumber ) introduction to data stream DBG@UNSW

2 9 9 9 7 6 4 9 9 9 3 9 An Old Chestnut: Majority • A sequence of N items. • You have constant memory. • In one pass, decide if some item is in majority (occurs > N/2 times)? N = 12; item 9 is majority Any Idea ? introduction to data stream DBG@UNSW

Misra-Gries Algorithm (‘82) • A counter and an ID. • If new item is same as stored ID, increment counter. • Otherwise, decrement the counter. • If counter 0, store new item with count = 1. • If counter > 0, then its item is the only candidate for majority. introduction to data stream DBG@UNSW

ID ID1 ID2 . . . . IDk count . . A generalization: Frequent Items(Karp 03) Find k items, each occurring at least N/(k+1) times. • Algorithm: • Maintain k items, and their counters. • If next item x is one of the k, increment its counter. • Else if a zero counter, put x there with count = 1 • Else (all counters non-zero) decrement all k counters introduction to data stream DBG@UNSW

Frequent Elements: Analysis • A frequent item’s count is decremented if all counters are full: it erases k+1 items. • If x occurs > N/(k+1) times, then it cannot be completely erased. • Similarly, x must get inserted at some point, because there are not enough items to keep it away. introduction to data stream DBG@UNSW

Problem of False Positives • False positives in Misra-Gries(MG) algorithm • It identifies all true heavy hitters, but not all reported items are necessarily heavy hitters. • How can we tell if the non-zero counters correspond to true heavy hitters or not? • A second pass is needed to verify. • False positives are problematic if heavy hitters are used for billing or punishment. • What guarantees can we achieve in one pass? introduction to data stream DBG@UNSW

Approximation Guarantees • Find heavy hitters with a guaranteed approximation error [MM02] • Manku-Motwani ( Lossy Counting ) • Suppose you want -heavy hitters --- items with freq > N • An approximation parameter , where << .(E.g.,  = .01 and  = .0001;  = 1% and  = .01% ) • Identify all items with frequency >  N • No reported item has frequency < ( - )N • The algorithm uses O(1/ log (N)) memory introduction to data stream DBG@UNSW

Window 1 Window 2 Window 3 MM02 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size W is function of support s – specify later… introduction to data stream DBG@UNSW

Frequency Counts + First Window At window boundary, decrement all counters by 1 Lossy Counting in Action ... Empty introduction to data stream DBG@UNSW

Frequency Counts + Next Window At window boundary, decrement all counters by 1 Lossy Counting continued ... introduction to data stream DBG@UNSW

Error Analysis How much do we undercount? If current size of stream = N and window-size W = 1/ε then# windows = εN frequency error Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1% introduction to data stream DBG@UNSW

Putting it all together… Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N How many counters do we need? • Worst case bound: 1/ε log εN counters introduction to data stream DBG@UNSW

2 1 1 1 0 f(1) f(2) f(3) f(4) f(5) Data stream: 3, 1, 2, 4, -2, 3, 5, . . . Frequent items ( Turnsile ) • Ask for f(1) = ? f(4) = ? - AMS based algorithm - Count Min sketch. introduction to data stream DBG@UNSW

AMS ( sketch ) based algorithm. Key Intuition: Use randomized linear projections of f() to define random variable Z such that For given element A[i] E( Z ) = ||A[i]|| = fi Similar, we have E( Z ) = fj Basic Idea: Define a family of 4-wise independent {-1, +1} random variables Pr[ = +1] = Pr[ = -1] = ½ Let Z = So E( Z ) Example : 0 1 introduction to data stream DBG@UNSW

AMS cont. • Keep an array of w X d counters for Zij • Use d hash functions to map element x to [1..w] W h1(a) d a hd(a) Est(fa) = median i (Z[i,hi(a)] ) Z[i, hi(a)] += introduction to data stream DBG@UNSW 48

The Count Min (CM) Sketch • Simple sketch idea, can be used for point queries ( fi), range queries, quantiles, join size estimation • Creates a small summary as an array of w X d counters C • Use d hash functions to map element to [1..w] W = Bloom Filter Technique. d = 49 introduction to data stream DBG@UNSW

+1 +1 +1 +1 CM Sketch Structure • Each element xi is mapped to one counter per row • C[ k,hk(xi)] = C[k, hk(xi)]+1 ( -1 for deletion ) or +c[j] if income is <j, c[j]> • Estimate A[j] by taking mink C[k,hk(j)] h1(xi) d xi hd(xi ) w introduction to data stream DBG@UNSW

Introduction to Data Stream

Introduction to Data Stream

Presentation Transcript

Stream-based Data Management

Introduction to Stream Deflectors

Introduction to Stream Cipher

Introduction to Value Stream Mapping

Data Stream Mining

Data Stream Processor

Stream Data

An Introduction to Stream Ciphers

Data Stream Clustering

Data Stream Protocol

Data Stream Management

Data Stream Managing Unit

Data Stream Computation

Data Stream Mining

A Short Introduction to Stream Ciphers

STREAM: The Stanford Data Stream Management System

Introduction to audio stream

STREAM: The Stanford Data Stream Management System

Data Stream Management Systems

Data Stream Mining