introduction to data stream n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Data Stream PowerPoint Presentation
Download Presentation
Introduction to Data Stream

Loading in 2 Seconds...

play fullscreen
1 / 59

Introduction to Data Stream - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Introduction to Data Stream. DBG@UNSW. Acknowledgement. Some of the slides are modified from. Nikos Koudas (Toronto U) Minos Garofalakis ( yahoo! research) Divesh Srivastava (AT & T) S. Muthukrishnan (Rutgers) Georges Hébrail (ENST Paris). Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Data Stream' - hamlin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to data stream

Introduction to Data Stream

DBG@UNSW

introduction to data stream DBG@UNSW

acknowledgement
Acknowledgement

Some of the slides are modified from

  • Nikos Koudas (Toronto U)
  • Minos Garofalakis ( yahoo! research)
  • Divesh Srivastava (AT & T)
  • S. Muthukrishnan (Rutgers)
  • Georges Hébrail (ENST Paris)

introduction to data stream DBG@UNSW

outline
Outline
  • Introduction and Applications
  • Data Stream Management System
  • Modeling for Data Streams
  • Approximation Technique
  • Data Stream Computation

introduction to data stream DBG@UNSW

what is a data stream
What is a data stream ?
  • Golab & Oszu (2003): “A data stream is a real-tme, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.”
  • Structured records  audio or video data
  • Massive volumes of data, records arrive at a high rate

introduction to data stream DBG@UNSW

data stream applications
Data stream applications.
  • Transactional data streams: log interactions between entities :
  •  Credit card: purchases by consumers from merchants
  •  Telecommunications: phone calls by callers to dialed parties
  •  Web: accesses by clients of resources at servers
  • Measurement data streams: monitor evolution of entity states
  •  IP network: traffic at router interfaces

 Sensor networks: physical phenomena, road traffic

 Earth climate: temperature, moisture at weather stations

introduction to data stream DBG@UNSW

applications network management

Network supervision Center

Applications :Network Management

involves monitoring and configuring network hardware and software

to ensure smooth operation

  • Quickly detect faults, congestion, attack
  • Qos
  • Load balancing, improve utilization of network resources

introduction to data stream DBG@UNSW

more details
( more details )
  • Traffic estimation
    • What fraction network IP addresses are active?
    • How many bytes were sent between a pair of IP addresses?
    • List the top 100 IP addresses in terms of traffic
  • Traffic analysis
    • What is the average duration of an IP session?
    • What is the median of the number of bytes in each IP session?
  • Fraud detection
    • List all sessions that transmitted more than 1000 bytes
    • Identify all sessions whose duration was more than twice the normal
  • Security/Denial of Service
    • List all IP addresses that have witnessed a sudden spike in traffic
    • Identify IP addresses involved in more than 1000 sessions

introduction to data stream DBG@UNSW

application stock monitoring
Application : stock monitoring
  • Stream of price and sales volume of stocks over time
  • Technical analysis/charting for stock investors
  • Support trading decisions
  • Notify me when some stock goes up by at least 5%.
  • Notify me when the price of any stock increases monotonically for ≥ 40 min.

introduction to data stream DBG@UNSW

challenges
Challenges
  • Massive in volume or even infinite
    • AT&T long-distance: ~300M call tuples/day
    • AT&T IP backbone: ~50B IP flows/day
  • Rapid arriving rate
  • Real-time monitoring (response) required
  • Continuous query

introduction to data stream DBG@UNSW

outline1
Outline
  • Introduction and Applications
  • Data Stream Management System
  • Modeling for Data Streams
  • Approximation Technique
  • Data Stream Computation

introduction to data stream DBG@UNSW

slide11
DSMS

DBMS-Data Base Management System

  • Data model ( relational )
  • Data isstored on disk
  • SQL language
    • Creating structures
    • Inserting/updating/deleting data
    • Retrieving data (query)
  • Good performance evenwith large volumes of data

DSMS - Data Stream Management System

  • Data model ( streams and permanent relations)
  • Permanent relations are stored on diskbut streams are processed on the fly
  • SQL likequerylanguage
    • Standard SQL on permanent relations
    • Extended SQL on streamswithwindowingfeatures
    • New paradigm of queries (continuousqueries)
  • Tools for capturing input streams and producing output streams
  • Good performance: optimization of computer resources

introduction to data stream DBG@UNSW

existing dsms
Existing DSMS

Principal specialized DSMS’s

  • Gigascope and Hancock : AT&T
    • Network monitoring
    • Analysis of telecommunication calls
  • NiagaraCQ : University of Wisconsin-Madison
    • Large number of continuous queries on web content (XML-QL)
  • Tradebot (finance), Statstream (statistics)

Principal general-purpose DSMS’s

  • STREAM : University of Stanford
  • TelegraphCQ : University of Berkeley
  • Aurora : Brown University, MIT, Brandeis

Sensor network

  • Cougar : Cornell University
  • TinyDB : University of Berkeley

introduction to data stream DBG@UNSW

stream from stanford

Streamed

Result

Stored

Result

Register

Query

DSMS

Input streams

Archive

Scratch Store

Stored

Relations

STREAM from stanford

introduction to data stream DBG@UNSW

stream cont
STREAM ( cont. )
  • General-purpose DSMS for streams and stored data
  • Relational(unlikely to change)
  • Centralized server model (likely to change)
    • Single-threaded and parallel versions
  • Declarative language for registering continuous queries (CQL)
  • Query optimization with good memory management
  • Approximate answer with synopses management

introduction to data stream DBG@UNSW

stream cont1
STREAM ( cont. )

Some Implementation Issues

  • Designed to cope with:
    • Stream rates that may be high, variable, bursty
    • Continuous query loads that may be high, volatile
  • Primary coping techniques
    • Continuous self-monitoring and reoptimization
    • Graceful approximation as necessary
    • Careful resource allocation and use

introduction to data stream DBG@UNSW

outline2
Outline
  • Introduction and Applications
  • Data Stream Management System
  • Modeling for Data Streams
  • Approximation Technique
  • Data Stream Computation

introduction to data stream DBG@UNSW

models for data streams
Models for data streams
  • Structure of a stream
    • Infinite sequence of items (elements)
    • One item: structured information, i.e. tuple or object
    • Same structure for all items in a stream
    • Timestamping
      • Explicit ( date field in data )
      • Implicit  ( timestamp given when items arrive )
    • Representation of time
      • Physical  (date)
      • Logical  (integer)

introduction to data stream DBG@UNSW

models for data streams cont
Models for data streams (cont.)
  • One-dimensional array A[1…N] with values A[i] all initially zero
  • Signal is implicitly represented via a stream of updates
    • j-th update is <k, c[j]> implying

A[k] := A[k] + c[j] (c[j] can be >=0, <0)

  • Goal: Compute functions on A[ ] subject to
    • Small space
    • Fast processing of updates
    • Fast function computation

introduction to data stream DBG@UNSW

slide19

Models for data streams (cont.)

  • Time-Series Model

Only j-th update updates A[j] (i.e., A[j] := c[j])

  • Cash-Register Model
    • c[j] is always >= 0 (i.e., increment-only)
    • Typically, c[j]=1, so we see a multi-set of items in one pass
  • Turnstile Model
    • Most general streaming model
    • c[j] can be >=0 or <0 (i.e., increment or decrement)

Problem difficulty varies depending on the model

    • E.g., MIN/MAX in Time-Series vs. Turnstile!

introduction to data stream DBG@UNSW

windowing

Window on the stream

Beginning of the stream

t

Current date

Windowing

Applying queries/mining tasks to the whole stream (from beginning to current time)

Applying queries/mining tasks to a portion of the stream

introduction to data stream DBG@UNSW

windowing cont
Windowing ( cont.)

Definition of windows of interest on streams

  • Fixed windows: September 2007
  • Sliding windows: last 3 hours ( n of N window )
  • Landmark windows: from September 1st, 2007

Window specification

  • Physical time: last 3 hours
  • Logical time: last 1000 items

Refreshing rate

  • Rate of producing results (every item, every 10 items, every minute, …)

introduction to data stream DBG@UNSW

computation model

Synopsis in Memory

Data Streams

(Approximate)

Answer

Stream Processing Engine

Computation Model
  • Stream processing requirements
    • Single pass: Each record is examined at most once
    • Bounded storage: Limited Memory (M) for storing synopsis
    • Real-time: Per record processing time (to maintain synopsis) must be low
    • Data Independent : no priori knowledge required about data set (size, range, distribution, order)

introduction to data stream DBG@UNSW

outline3
Outline
  • Introduction and Applications
  • Data Stream Management System
  • Modeling for Data Streams
  • Approximation Technique
  • Data Stream Computation

introduction to data stream DBG@UNSW

approximation
Approximation
  • Exact answer is too expensive to compute
    • May need too large memory to afford ( distinct, median )
    • May need too long time to complete
  • Approximate answer is acceptable in many applications

ε-approximate answers [ Absolute error / Relative error ]

Like: E = 100 , ε=0.1 then [90 , 110] are acceptable answers

    • Only small size of memory is needed
    • Compute very quickly
    • Error is guaranteed to be small

introduction to data stream DBG@UNSW

approximation cont
Approximation (cont.)
  • Deterministic approximate methods
    • Deterministic algorithms carefully controls error.
  • Non-deterministic approximate methods.
    • Randomization, Sampling … etc. Provides good approximation with high probability.

introduction to data stream DBG@UNSW

25

basic synopses
Basic synopses

Basic stream synopses computation

  • Samples: Answering queries using samples.Reservoir sampling, inverse sampling
  • Histograms: Equi-depth histograms, On-line quantile computation
  • Wavelets: Haar-wavelet histogram construction & maintenance
  • Sketch : AMS Sketch, CM Sketch, FM Sketch

introduction to data stream DBG@UNSW

sampling
Sampling
  • Idea: A small random sample S of the data often well-represents
  • all the data
    • For a fast approximate answer, apply “modified” query to S
    • Example: select agg from R (n=12)
    • If agg is avg, return average of the elements in S
    • Number of odd elements ?

Data stream: 9 3 5 2 7 1 6 5 8 4 9 1

Sample S: 9 5 1 8

answer: 11.5

introduction to data stream DBG@UNSW

probabilistic guarantees
Probabilistic Guarantees
  • Example: Actual answer is within 11.5 ± 1 with prob  0.9
  • Randomized algorithms:Answer returned is a specially-built random variable.
  • Use Tail Inequalities to give probabilistic bounds on returned answer
    • Markov Inequality
    • Chebyshev’s Inequality
    • Chernoff/Hoeffding Bound

introduction to data stream DBG@UNSW

basic tools tail inequalities

Probability

distribution

Tail probability

Basic Tools: Tail Inequalities
  • General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation)
  • Basic Inequalities: Let X be a random variable with expectation and variance Var[X]. Then for any

Markov:

Chebyshev:

introduction to data stream DBG@UNSW

tail inequalities cont
Tail Inequalities ( cont.)
  • Hoeffding’s Inequality: Let X1, ..., Xm be independent random variables with 0<=Xi <= r. Let and be the expectation of . Then, for any
  • Chernoff Bound (… )

introduction to data stream DBG@UNSW

histogram
Histogram

Histograms approximate the frequency distribution of element values in a stream

A histogram (typically) consists of

  • A partitioning of element domain values into buckets
  • A count per bucket B (of the number of elements in B)

Count for

bucket

Domain values

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

introduction to data stream DBG@UNSW

wavelet

[1.5, 4]

[0.5, 0]

[2.75]

[-1.25]

Haar wavelet decomposition:

[2.75, -1.25, 0.5, 0, 0, -1, -1, 0]

Wavelet

Wavelets: Mathematical tool for hierarchical decomposition of functions/signals

Haar wavelets: Simplest wavelet, easy to understand and implement

  • Recursive pairwise averaging and differencing at different resolutions

Resolution Averages Detail Coefficients

3

[2, 2, 0, 2, 3, 5, 4, 4]

----

2

[2, 1, 4, 4]

[0, -1, -1, 0]

1

0

introduction to data stream DBG@UNSW

outline4
Outline
  • Introduction and Applications
  • Data Stream Management System
  • Modeling for Data Streams
  • Approximation Technique
  • Data Stream Computation

introduction to data stream DBG@UNSW

frequency related problems

Find elements that

occupy 0.1% of the tail.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Find all elements

with frequency > 0.1%

What is the frequency

of element 3?

What is the total frequency

of elements between 8 and 14?

Frequency RelatedProblems.

Top-k most frequent elements

How many elements have non-zero frequency? (distinctnumber )

introduction to data stream DBG@UNSW

an old chestnut majority

2

9

9

9

7

6

4

9

9

9

3

9

An Old Chestnut: Majority
  • A sequence of N items.
  • You have constant memory.
  • In one pass, decide if some item is in majority (occurs > N/2 times)?

N = 12; item 9 is majority

Any Idea ?

introduction to data stream DBG@UNSW

misra gries algorithm 82
Misra-Gries Algorithm (‘82)
  • A counter and an ID.
    • If new item is same as stored ID, increment counter.
    • Otherwise, decrement the counter.
    • If counter 0, store new item with count = 1.
  • If counter > 0, then its item is the only candidate for majority.

introduction to data stream DBG@UNSW

a generalization frequent items karp 03

ID

ID1

ID2

.

.

.

.

IDk

count

.

.

A generalization: Frequent Items(Karp 03)

Find k items, each occurring at least N/(k+1) times.

  • Algorithm:
    • Maintain k items, and their counters.
    • If next item x is one of the k, increment its counter.
    • Else if a zero counter, put x there with count = 1
    • Else (all counters non-zero) decrement all k counters

introduction to data stream DBG@UNSW

frequent elements analysis
Frequent Elements: Analysis
  • A frequent item’s count is decremented if all counters are full: it erases k+1 items.
  • If x occurs > N/(k+1) times, then it cannot be completely erased.
  • Similarly, x must get inserted at some point, because there are not enough items to keep it away.

introduction to data stream DBG@UNSW

problem of false positives
Problem of False Positives
  • False positives in Misra-Gries(MG) algorithm
    • It identifies all true heavy hitters, but not all reported items are necessarily heavy hitters.
    • How can we tell if the non-zero counters correspond to true heavy hitters or not?
  • A second pass is needed to verify.
  • False positives are problematic if heavy hitters are used for billing or punishment.
  • What guarantees can we achieve in one pass?

introduction to data stream DBG@UNSW

approximation guarantees
Approximation Guarantees
  • Find heavy hitters with a guaranteed approximation error [MM02]
  • Manku-Motwani ( Lossy Counting )
    • Suppose you want -heavy hitters --- items with freq > N
    • An approximation parameter , where << .(E.g.,  = .01 and  = .0001;  = 1% and  = .01% )
    • Identify all items with frequency >  N
    • No reported item has frequency < ( - )N
  • The algorithm uses O(1/ log (N)) memory

introduction to data stream DBG@UNSW

mm02 algorithm 1 lossy counting

Window 1

Window 2

Window 3

MM02 Algorithm 1: Lossy Counting

Step 1: Divide the stream into ‘windows’

Window-size W is function of support s – specify later…

introduction to data stream DBG@UNSW

lossy counting in action

Frequency

Counts

+

First Window

At window boundary, decrement all counters by 1

Lossy Counting in Action ...

Empty

introduction to data stream DBG@UNSW

lossy counting continued

Frequency

Counts

+

Next Window

At window boundary, decrement all counters by 1

Lossy Counting continued ...

introduction to data stream DBG@UNSW

error analysis
Error Analysis

How much do we undercount?

If current size of stream = N

and window-size W = 1/ε

then# windows = εN

frequency error

Rule of thumb:

Set ε = 10% of support s

Example:

Given support frequency s = 1%,

set error frequency ε = 0.1%

introduction to data stream DBG@UNSW

putting it all together
Putting it all together…

Output:

Elements with counter values exceeding (s-ε)N

Approximation guarantees

Frequencies underestimated by at most εN

No false negatives

False positives have true frequency at least (s–ε)N

How many counters do we need?

  • Worst case bound: 1/ε log εN counters

introduction to data stream DBG@UNSW

frequent items turnsile

2

1

1

1

0

f(1) f(2) f(3) f(4) f(5)

Data stream: 3, 1, 2, 4, -2, 3, 5, . . .

Frequent items ( Turnsile )
  • Ask for f(1) = ? f(4) = ?

- AMS based algorithm

- Count Min sketch.

introduction to data stream DBG@UNSW

ams sketch based algorithm
AMS ( sketch ) based algorithm.

Key Intuition: Use randomized linear projections of f() to define random variable Z such that

For given element A[i]

E( Z ) = ||A[i]|| = fi

Similar, we have E( Z ) = fj

Basic Idea:

Define a family of 4-wise independent {-1, +1} random variables

Pr[ = +1] = Pr[ = -1] = ½

Let Z =

So E( Z )

Example :

0

1

introduction to data stream DBG@UNSW

ams cont
AMS cont.
  • Keep an array of w X d counters for Zij
  • Use d hash functions to map element x to [1..w]

W

h1(a)

d

a

hd(a)

Est(fa) = median i (Z[i,hi(a)] )

Z[i, hi(a)] +=

introduction to data stream DBG@UNSW

48

the count min cm sketch
The Count Min (CM) Sketch
  • Simple sketch idea, can be used for point queries ( fi), range queries, quantiles, join size estimation
  • Creates a small summary as an array of w X d counters C
  • Use d hash functions to map element to [1..w]

W =

Bloom Filter Technique.

d =

49

introduction to data stream DBG@UNSW

cm sketch structure

+1

+1

+1

+1

CM Sketch Structure
  • Each element xi is mapped to one counter per row
  • C[ k,hk(xi)] = C[k, hk(xi)]+1 ( -1 for deletion )

or +c[j] if income is <j, c[j]>

  • Estimate A[j] by taking mink C[k,hk(j)]

h1(xi)

d

xi

hd(xi )

w

introduction to data stream DBG@UNSW

cm sketch summary
CM Sketch Summary
  • CM sketch guarantees approximation error on point queries less than e||A|| in size O(1/e log 1/d)
    • Probability of more error is less than 1-d
  • How to extend to Range Query ?

introduction to data stream DBG@UNSW

distinct value estimation f 0
Distinct Value Estimation ( F0 )
  • Problem: Find the number of distinct values in a stream of values with domain [0,...,N-1]
    • Zeroth frequency moment
    • Statistics: number of species or classes in a population
    • Important for query optimizers
    • Network monitoring: distinct destination IP addresses, source/destination pairs, requested URLs, etc.
  • Hard problem for random sampling!
    • Must sample almost the entire table to guarantee the estimate is within a factor of 10 with probability > 1/2, regardless of the estimator used!

Data stream: 3 0 5 3 0 1 7 5 1 0 3 7

Number of distinct values: 5

introduction to data stream DBG@UNSW

fm sketch

BITMAP

5 4 3 2 1 0

lsb(h(x)) = 2

h(x) = 101100

0

0

0

1

0

0

FM Sketch
  • Assume a hash function h(x) that maps incoming values x in [0,…, N-1]
  • uniformly across [0,…, 2^L-1], where L = O(logN)
  • Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y
    • A value x is mapped to lsb(h(x))
  • Maintain Hash Sketch = BITMAP array of L bits, initialized to 0
    • For each incoming value x, set BITMAP[ lsb(h(x)) ] = 1
  • Prob[ lsb(h(x) = i ] = ?

x = 5

introduction to data stream DBG@UNSW

fm sketch cont

BITMAP

0

0

0

1

0

0

0

1

1

0

1

1

1

1

0

1

1

1

fringe of 0/1s around log(d)

position >> log(d)

position << log(d)

FM Sketch ( cont. )
  • By uniformity through h(x): Prob[ BITMAP[k]=1 ] =
    • Assuming d distinct values: expect d/2 to map to BITMAP[0] ,
    • d/4 to map to BITMAP[1], . . .
  • Let R = position of rightmost zero in BITMAP
    • Use as indicator of log(d)
  • [FM85] prove that E[R] = , where
    • Estimate d =

L-1

0

introduction to data stream DBG@UNSW

boosting the accuracy
Boosting the Accuracy

BITMAP 1

0

0

0

1

0

0

0

1

1

1

0

1

0

0

0

1

0

1

1

1

1

1

0

1

BITMAP m

0

0

0

1

0

1

1

1

1

1

0

1

0

Approximation with probability at least 1-

introduction to data stream DBG@UNSW

fm cont

(X4,3)

(X1,8)

(X5,8)

5

0

0

0

8

0

0

0

1

(X6,2)

(X1,8)

1

1

0

0

2

1

(X7,11)

(X3,5)

(X8,2)

FM ( cont. )
  • Find out the number of distinct object with value smaller than V ?

B1

B2

B1

0, for v=4

1, for v=10

introduction to data stream DBG@UNSW

distinct sampling
Distinct Sampling
  • Applications . Click Stream, Inverse Distrubtion.
  • [ Gib01] Use FM-like hash function h() for each streaming value x
    • Prob[ h(x) = k ] =
  • Key Invariant:“All values with h(x) >= level (and only these) are in the distinct sample”

introduction to data stream DBG@UNSW

distinct sampling cont
Distinct Sampling ( cont. )

DistinctSampling( B , r )

// B = space bound, r = tuple-reservoir size for each distinct value

level = 0; S =

for each new tuple t do

let x = value of DISTINCT target attribute in t

if h(x) >= level then // x belongs in the distinct sample

use t to update the reservoir sample of tuples for x

if |S| >= B then // out of space

evict from S all tuples with h(target-attribute-value) = level

set level = level + 1

introduction to data stream DBG@UNSW

reference
Reference
  • [FM85]P.Flajolet, G. Martin Probabilistic Counting Foundations of Computer Science 1985
  • [GMPS02]M.Datar, A.Gionis , P.Indyk and Rajeev Motwani Maintaining stream statistics over sliding windows

SODA 2002

  • [MM02]G. Manku and R. MotwaniApproximate frequency counts over data streams VLDB 2002
  • [Karp03]R. Karp, C. Papadimitriou and S. Shenker A simple algorithm for finding frequent elements in streams and bags. TODS 2003
  • [Gib01] P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports, VLDB 2001.

introduction to data stream DBG@UNSW