- 109 Views
- Uploaded on

Download Presentation
## Introduction to Data Stream

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

OutlineOutline

Acknowledgement

Some of the slides are modified from

- Nikos Koudas (Toronto U)
- Minos Garofalakis ( yahoo! research)
- Divesh Srivastava (AT & T)
- S. Muthukrishnan (Rutgers)
- Georges Hébrail (ENST Paris)

introduction to data stream DBG@UNSW

Outline

- Introduction and Applications
- Data Stream Management System
- Modeling for Data Streams
- Approximation Technique
- Data Stream Computation

introduction to data stream DBG@UNSW

What is a data stream ?

- Golab & Oszu (2003): “A data stream is a real-tme, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.”
- Structured records audio or video data
- Massive volumes of data, records arrive at a high rate

introduction to data stream DBG@UNSW

Data stream applications.

- Transactional data streams: log interactions between entities :
- Credit card: purchases by consumers from merchants
- Telecommunications: phone calls by callers to dialed parties
- Web: accesses by clients of resources at servers

- Measurement data streams: monitor evolution of entity states
- IP network: traffic at router interfaces

Sensor networks: physical phenomena, road traffic

Earth climate: temperature, moisture at weather stations

introduction to data stream DBG@UNSW

Applications :Network Management

involves monitoring and configuring network hardware and software

to ensure smooth operation

- Quickly detect faults, congestion, attack

- Qos

- Load balancing, improve utilization of network resources

introduction to data stream DBG@UNSW

( more details )

- Traffic estimation
- What fraction network IP addresses are active?
- How many bytes were sent between a pair of IP addresses?
- List the top 100 IP addresses in terms of traffic
- Traffic analysis
- What is the average duration of an IP session?
- What is the median of the number of bytes in each IP session?
- Fraud detection
- List all sessions that transmitted more than 1000 bytes
- Identify all sessions whose duration was more than twice the normal
- Security/Denial of Service
- List all IP addresses that have witnessed a sudden spike in traffic
- Identify IP addresses involved in more than 1000 sessions

introduction to data stream DBG@UNSW

Application : stock monitoring

- Stream of price and sales volume of stocks over time
- Technical analysis/charting for stock investors
- Support trading decisions

- Notify me when some stock goes up by at least 5%.
- Notify me when the price of any stock increases monotonically for ≥ 40 min.

introduction to data stream DBG@UNSW

Challenges

- Massive in volume or even infinite
- AT&T long-distance: ~300M call tuples/day
- AT&T IP backbone: ~50B IP flows/day
- Rapid arriving rate
- Real-time monitoring (response) required
- Continuous query

introduction to data stream DBG@UNSW

Outline

- Introduction and Applications
- Data Stream Management System
- Modeling for Data Streams
- Approximation Technique
- Data Stream Computation

introduction to data stream DBG@UNSW

DSMS

DBMS-Data Base Management System

- Data model ( relational )
- Data isstored on disk
- SQL language
- Creating structures
- Inserting/updating/deleting data
- Retrieving data (query)
- Good performance evenwith large volumes of data

DSMS - Data Stream Management System

- Data model ( streams and permanent relations)
- Permanent relations are stored on diskbut streams are processed on the fly
- SQL likequerylanguage
- Standard SQL on permanent relations
- Extended SQL on streamswithwindowingfeatures
- New paradigm of queries (continuousqueries)
- Tools for capturing input streams and producing output streams
- Good performance: optimization of computer resources

introduction to data stream DBG@UNSW

Existing DSMS

Principal specialized DSMS’s

- Gigascope and Hancock : AT&T
- Network monitoring
- Analysis of telecommunication calls
- NiagaraCQ : University of Wisconsin-Madison
- Large number of continuous queries on web content (XML-QL)
- Tradebot (finance), Statstream (statistics)

Principal general-purpose DSMS’s

- STREAM : University of Stanford
- TelegraphCQ : University of Berkeley
- Aurora : Brown University, MIT, Brandeis

Sensor network

- Cougar : Cornell University
- TinyDB : University of Berkeley

introduction to data stream DBG@UNSW

Result

Stored

Result

Register

Query

DSMS

Input streams

Archive

Scratch Store

Stored

Relations

STREAM from stanfordintroduction to data stream DBG@UNSW

STREAM ( cont. )

- General-purpose DSMS for streams and stored data
- Relational(unlikely to change)
- Centralized server model (likely to change)
- Single-threaded and parallel versions
- Declarative language for registering continuous queries (CQL)
- Query optimization with good memory management
- Approximate answer with synopses management

introduction to data stream DBG@UNSW

STREAM ( cont. )

Some Implementation Issues

- Designed to cope with:
- Stream rates that may be high, variable, bursty
- Continuous query loads that may be high, volatile
- Primary coping techniques
- Continuous self-monitoring and reoptimization
- Graceful approximation as necessary
- Careful resource allocation and use

introduction to data stream DBG@UNSW

Outline

- Introduction and Applications
- Data Stream Management System
- Modeling for Data Streams
- Approximation Technique
- Data Stream Computation

introduction to data stream DBG@UNSW

Models for data streams

- Structure of a stream
- Infinite sequence of items (elements)
- One item: structured information, i.e. tuple or object
- Same structure for all items in a stream
- Timestamping
- Explicit ( date field in data )
- Implicit ( timestamp given when items arrive )
- Representation of time
- Physical (date)
- Logical (integer)

introduction to data stream DBG@UNSW

Models for data streams (cont.)

- One-dimensional array A[1…N] with values A[i] all initially zero
- Signal is implicitly represented via a stream of updates
- j-th update is <k, c[j]> implying

A[k] := A[k] + c[j] (c[j] can be >=0, <0)

- Goal: Compute functions on A[ ] subject to
- Small space
- Fast processing of updates
- Fast function computation
- …

introduction to data stream DBG@UNSW

Models for data streams (cont.)

- Time-Series Model

Only j-th update updates A[j] (i.e., A[j] := c[j])

- Cash-Register Model
- c[j] is always >= 0 (i.e., increment-only)
- Typically, c[j]=1, so we see a multi-set of items in one pass
- Turnstile Model
- Most general streaming model
- c[j] can be >=0 or <0 (i.e., increment or decrement)

Problem difficulty varies depending on the model

- E.g., MIN/MAX in Time-Series vs. Turnstile!

introduction to data stream DBG@UNSW

Beginning of the stream

t

Current date

WindowingApplying queries/mining tasks to the whole stream (from beginning to current time)

Applying queries/mining tasks to a portion of the stream

introduction to data stream DBG@UNSW

Windowing ( cont.)

Definition of windows of interest on streams

- Fixed windows: September 2007
- Sliding windows: last 3 hours ( n of N window )
- Landmark windows: from September 1st, 2007

Window specification

- Physical time: last 3 hours
- Logical time: last 1000 items

Refreshing rate

- Rate of producing results (every item, every 10 items, every minute, …)

introduction to data stream DBG@UNSW

Data Streams

(Approximate)

Answer

Stream Processing Engine

Computation Model- Stream processing requirements
- Single pass: Each record is examined at most once
- Bounded storage: Limited Memory (M) for storing synopsis
- Real-time: Per record processing time (to maintain synopsis) must be low
- Data Independent : no priori knowledge required about data set (size, range, distribution, order)

introduction to data stream DBG@UNSW

- Introduction and Applications
- Data Stream Management System
- Modeling for Data Streams
- Approximation Technique
- Data Stream Computation

introduction to data stream DBG@UNSW

Approximation

- Exact answer is too expensive to compute
- May need too large memory to afford ( distinct, median )
- May need too long time to complete
- Approximate answer is acceptable in many applications

ε-approximate answers [ Absolute error / Relative error ]

Like: E = 100 , ε=0.1 then [90 , 110] are acceptable answers

- Only small size of memory is needed
- Compute very quickly
- Error is guaranteed to be small

introduction to data stream DBG@UNSW

Approximation (cont.)

- Deterministic approximate methods
- Deterministic algorithms carefully controls error.
- Non-deterministic approximate methods.
- Randomization, Sampling … etc. Provides good approximation with high probability.

introduction to data stream DBG@UNSW

25

Basic synopses

Basic stream synopses computation

- Samples: Answering queries using samples.Reservoir sampling, inverse sampling
- Histograms: Equi-depth histograms, On-line quantile computation
- Wavelets: Haar-wavelet histogram construction & maintenance
- Sketch : AMS Sketch, CM Sketch, FM Sketch

introduction to data stream DBG@UNSW

Sampling

- Idea: A small random sample S of the data often well-represents
- all the data
- For a fast approximate answer, apply “modified” query to S
- Example: select agg from R (n=12)
- If agg is avg, return average of the elements in S
- Number of odd elements ?

Data stream: 9 3 5 2 7 1 6 5 8 4 9 1

Sample S: 9 5 1 8

answer: 11.5

introduction to data stream DBG@UNSW

Probabilistic Guarantees

- Example: Actual answer is within 11.5 ± 1 with prob 0.9
- Randomized algorithms:Answer returned is a specially-built random variable.
- Use Tail Inequalities to give probabilistic bounds on returned answer
- Markov Inequality
- Chebyshev’s Inequality
- Chernoff/Hoeffding Bound

introduction to data stream DBG@UNSW

distribution

Tail probability

Basic Tools: Tail Inequalities- General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation)
- Basic Inequalities: Let X be a random variable with expectation and variance Var[X]. Then for any

Markov:

Chebyshev:

introduction to data stream DBG@UNSW

Tail Inequalities ( cont.)

- Hoeffding’s Inequality: Let X1, ..., Xm be independent random variables with 0<=Xi <= r. Let and be the expectation of . Then, for any

- Chernoff Bound (… )

introduction to data stream DBG@UNSW

Histogram

Histograms approximate the frequency distribution of element values in a stream

A histogram (typically) consists of

- A partitioning of element domain values into buckets
- A count per bucket B (of the number of elements in B)

Count for

bucket

Domain values

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

introduction to data stream DBG@UNSW

[0.5, 0]

[2.75]

[-1.25]

Haar wavelet decomposition:

[2.75, -1.25, 0.5, 0, 0, -1, -1, 0]

WaveletWavelets: Mathematical tool for hierarchical decomposition of functions/signals

Haar wavelets: Simplest wavelet, easy to understand and implement

- Recursive pairwise averaging and differencing at different resolutions

Resolution Averages Detail Coefficients

3

[2, 2, 0, 2, 3, 5, 4, 4]

----

2

[2, 1, 4, 4]

[0, -1, -1, 0]

1

0

introduction to data stream DBG@UNSW

- Introduction and Applications
- Data Stream Management System
- Modeling for Data Streams
- Approximation Technique
- Data Stream Computation

introduction to data stream DBG@UNSW

occupy 0.1% of the tail.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Find all elements

with frequency > 0.1%

What is the frequency

of element 3?

What is the total frequency

of elements between 8 and 14?

Frequency RelatedProblems.Top-k most frequent elements

How many elements have non-zero frequency? (distinctnumber )

introduction to data stream DBG@UNSW

9

9

9

7

6

4

9

9

9

3

9

An Old Chestnut: Majority- A sequence of N items.
- You have constant memory.
- In one pass, decide if some item is in majority (occurs > N/2 times)?

N = 12; item 9 is majority

Any Idea ?

introduction to data stream DBG@UNSW

Misra-Gries Algorithm (‘82)

- A counter and an ID.
- If new item is same as stored ID, increment counter.
- Otherwise, decrement the counter.
- If counter 0, store new item with count = 1.
- If counter > 0, then its item is the only candidate for majority.

introduction to data stream DBG@UNSW

ID1

ID2

.

.

.

.

IDk

count

.

.

A generalization: Frequent Items(Karp 03)Find k items, each occurring at least N/(k+1) times.

- Algorithm:
- Maintain k items, and their counters.
- If next item x is one of the k, increment its counter.
- Else if a zero counter, put x there with count = 1
- Else (all counters non-zero) decrement all k counters

introduction to data stream DBG@UNSW

Frequent Elements: Analysis

- A frequent item’s count is decremented if all counters are full: it erases k+1 items.
- If x occurs > N/(k+1) times, then it cannot be completely erased.
- Similarly, x must get inserted at some point, because there are not enough items to keep it away.

introduction to data stream DBG@UNSW

Problem of False Positives

- False positives in Misra-Gries(MG) algorithm
- It identifies all true heavy hitters, but not all reported items are necessarily heavy hitters.
- How can we tell if the non-zero counters correspond to true heavy hitters or not?
- A second pass is needed to verify.
- False positives are problematic if heavy hitters are used for billing or punishment.
- What guarantees can we achieve in one pass?

introduction to data stream DBG@UNSW

Approximation Guarantees

- Find heavy hitters with a guaranteed approximation error [MM02]
- Manku-Motwani ( Lossy Counting )
- Suppose you want -heavy hitters --- items with freq > N
- An approximation parameter , where << .(E.g., = .01 and = .0001; = 1% and = .01% )
- Identify all items with frequency > N
- No reported item has frequency < ( - )N
- The algorithm uses O(1/ log (N)) memory

introduction to data stream DBG@UNSW

Window 2

Window 3

MM02 Algorithm 1: Lossy CountingStep 1: Divide the stream into ‘windows’

Window-size W is function of support s – specify later…

introduction to data stream DBG@UNSW

Counts

+

First Window

At window boundary, decrement all counters by 1

Lossy Counting in Action ...Empty

introduction to data stream DBG@UNSW

Counts

+

Next Window

At window boundary, decrement all counters by 1

Lossy Counting continued ...introduction to data stream DBG@UNSW

Error Analysis

How much do we undercount?

If current size of stream = N

and window-size W = 1/ε

then# windows = εN

frequency error

Rule of thumb:

Set ε = 10% of support s

Example:

Given support frequency s = 1%,

set error frequency ε = 0.1%

introduction to data stream DBG@UNSW

Putting it all together…

Output:

Elements with counter values exceeding (s-ε)N

Approximation guarantees

Frequencies underestimated by at most εN

No false negatives

False positives have true frequency at least (s–ε)N

How many counters do we need?

- Worst case bound: 1/ε log εN counters

introduction to data stream DBG@UNSW

1

1

1

0

f(1) f(2) f(3) f(4) f(5)

Data stream: 3, 1, 2, 4, -2, 3, 5, . . .

Frequent items ( Turnsile )- Ask for f(1) = ? f(4) = ?

- AMS based algorithm

- Count Min sketch.

introduction to data stream DBG@UNSW

AMS ( sketch ) based algorithm.

Key Intuition: Use randomized linear projections of f() to define random variable Z such that

For given element A[i]

E( Z ) = ||A[i]|| = fi

Similar, we have E( Z ) = fj

Basic Idea:

Define a family of 4-wise independent {-1, +1} random variables

Pr[ = +1] = Pr[ = -1] = ½

Let Z =

So E( Z )

Example :

0

1

introduction to data stream DBG@UNSW

AMS cont.

- Keep an array of w X d counters for Zij
- Use d hash functions to map element x to [1..w]

W

h1(a)

d

a

hd(a)

Est(fa) = median i (Z[i,hi(a)] )

Z[i, hi(a)] +=

introduction to data stream DBG@UNSW

48

The Count Min (CM) Sketch

- Simple sketch idea, can be used for point queries ( fi), range queries, quantiles, join size estimation
- Creates a small summary as an array of w X d counters C
- Use d hash functions to map element to [1..w]

W =

Bloom Filter Technique.

d =

49

introduction to data stream DBG@UNSW

+1

+1

+1

CM Sketch Structure- Each element xi is mapped to one counter per row
- C[ k,hk(xi)] = C[k, hk(xi)]+1 ( -1 for deletion )

or +c[j] if income is <j, c[j]>

- Estimate A[j] by taking mink C[k,hk(j)]

h1(xi)

d

xi

hd(xi )

w

introduction to data stream DBG@UNSW

CM Sketch Summary

- CM sketch guarantees approximation error on point queries less than e||A|| in size O(1/e log 1/d)
- Probability of more error is less than 1-d
- How to extend to Range Query ?

introduction to data stream DBG@UNSW

Distinct Value Estimation ( F0 )

- Problem: Find the number of distinct values in a stream of values with domain [0,...,N-1]
- Zeroth frequency moment
- Statistics: number of species or classes in a population
- Important for query optimizers
- Network monitoring: distinct destination IP addresses, source/destination pairs, requested URLs, etc.
- Hard problem for random sampling!
- Must sample almost the entire table to guarantee the estimate is within a factor of 10 with probability > 1/2, regardless of the estimator used!

Data stream: 3 0 5 3 0 1 7 5 1 0 3 7

Number of distinct values: 5

introduction to data stream DBG@UNSW

5 4 3 2 1 0

lsb(h(x)) = 2

h(x) = 101100

0

0

0

1

0

0

FM Sketch- Assume a hash function h(x) that maps incoming values x in [0,…, N-1]
- uniformly across [0,…, 2^L-1], where L = O(logN)
- Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y
- A value x is mapped to lsb(h(x))
- Maintain Hash Sketch = BITMAP array of L bits, initialized to 0
- For each incoming value x, set BITMAP[ lsb(h(x)) ] = 1
- Prob[ lsb(h(x) = i ] = ?

x = 5

introduction to data stream DBG@UNSW

0

0

0

1

0

0

0

1

1

0

1

1

1

1

0

1

1

1

fringe of 0/1s around log(d)

position >> log(d)

position << log(d)

FM Sketch ( cont. )- By uniformity through h(x): Prob[ BITMAP[k]=1 ] =
- Assuming d distinct values: expect d/2 to map to BITMAP[0] ,
- d/4 to map to BITMAP[1], . . .
- Let R = position of rightmost zero in BITMAP
- Use as indicator of log(d)
- [FM85] prove that E[R] = , where
- Estimate d =

L-1

0

introduction to data stream DBG@UNSW

Boosting the Accuracy

BITMAP 1

0

0

0

1

0

0

0

1

1

1

0

1

0

0

0

1

0

1

1

1

1

1

0

1

BITMAP m

0

0

0

1

0

1

1

1

1

1

0

1

0

Approximation with probability at least 1-

introduction to data stream DBG@UNSW

(X1,8)

(X5,8)

5

0

0

0

8

0

0

0

1

(X6,2)

(X1,8)

1

1

0

0

2

1

…

…

…

(X7,11)

(X3,5)

(X8,2)

FM ( cont. )- Find out the number of distinct object with value smaller than V ?

B1

B2

B1

0, for v=4

1, for v=10

introduction to data stream DBG@UNSW

Distinct Sampling

- Applications . Click Stream, Inverse Distrubtion.

- [ Gib01] Use FM-like hash function h() for each streaming value x
- Prob[ h(x) = k ] =
- Key Invariant:“All values with h(x) >= level (and only these) are in the distinct sample”

introduction to data stream DBG@UNSW

Distinct Sampling ( cont. )

DistinctSampling( B , r )

// B = space bound, r = tuple-reservoir size for each distinct value

level = 0; S =

for each new tuple t do

let x = value of DISTINCT target attribute in t

if h(x) >= level then // x belongs in the distinct sample

use t to update the reservoir sample of tuples for x

if |S| >= B then // out of space

evict from S all tuples with h(target-attribute-value) = level

set level = level + 1

introduction to data stream DBG@UNSW

Reference

- [FM85]P.Flajolet, G. Martin Probabilistic Counting Foundations of Computer Science 1985
- [GMPS02]M.Datar, A.Gionis , P.Indyk and Rajeev Motwani Maintaining stream statistics over sliding windows

SODA 2002

- [MM02]G. Manku and R. MotwaniApproximate frequency counts over data streams VLDB 2002
- [Karp03]R. Karp, C. Papadimitriou and S. Shenker A simple algorithm for finding frequent elements in streams and bags. TODS 2003
- [Gib01] P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports, VLDB 2001.

introduction to data stream DBG@UNSW

Download Presentation

Connecting to Server..