Data mining on streams
Download
1 / 95

- PowerPoint PPT Presentation


  • 137 Views
  • Updated On :

Data Mining on Streams. Christos Faloutsos CMU. THANK YOU!. Prof. Panos Ipeirotis Julia Mills. Outline. Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions. Problem definition - example.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - carmelita


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data mining on streams l.jpg

Data Mining on Streams

Christos Faloutsos

CMU

C. Faloutsos


Thank you l.jpg
THANK YOU!

  • Prof. Panos Ipeirotis

  • Julia Mills

C. Faloutsos


Outline l.jpg
Outline

  • Problem and motivation

  • Single-sequence mining: AWSOM

  • Co-evolving sequences: SPIRIT

  • Lag correlations: BRAID

  • Conclusions

C. Faloutsos


Problem definition example l.jpg
Problem definition - example

Each sensor collects data (x1, x2, …, xt, …)

C. Faloutsos


Problem definition l.jpg
Problem definition

  • Given: one or more sequences

    x1 , x2 , … , xt , …

    (y1, y2, … , yt, …

    … )

  • Find

    • patterns; correlations; outliers

    • incrementally!

C. Faloutsos


Limitations challenges l.jpg

Find patterns using a method that is

nimble: limited resources

Memory

Bandwidth, power, CPU

incremental: on-line, ‘any-time’ response

single pass (‘you get to see it only once’)

automatic: no human intervention

eg., in remote environments

Limitations / Challenges

C. Faloutsos


Application domains l.jpg
Application domains

  • Sensor devices

    • Temperature, weather measurements

    • Road traffic data

    • Geological observations

    • Patient physiological data

  • Embedded devices

    • Network routers

    • Intelligent (active) disks

C. Faloutsos


Motivation applications cont d l.jpg
Motivation - Applications (cont’d)

  • ‘Smart house’

    • sensors monitor temperature, humidity, air quality

  • video surveillance

C. Faloutsos


Motivation applications cont d9 l.jpg
Motivation - Applications (cont’d)

  • civil/automobile infrastructure

    • bridge vibrations [Oppenheim+02]

    • road conditions / traffic monitoring

C. Faloutsos


Motivation applications cont d10 l.jpg
Motivation - Applications (cont’d)

  • Weather, environment/anti-pollution

    • volcano monitoring

    • air/water pollutant monitoring

C. Faloutsos


Motivation applications cont d11 l.jpg
Motivation - Applications (cont’d)

  • Computer systems

    • ‘Active Disks’ (buffering, prefetching)

    • web servers (ditto)

    • network traffic monitoring

    • ...

C. Faloutsos


Intemon w evan hoke jimeng sun l.jpg
InteMonw/ Evan Hoke, Jimeng Sun

self-* PetaByte

data center at CMU


Outline13 l.jpg
Outline

  • Problem and motivation

  • Single-sequence mining: AWSOM

  • Co-evolving sequences: SPIRIT

  • Lag correlations: BRAID

  • conclusions

C. Faloutsos


Single sequence mining awsom l.jpg
Single sequence mining - AWSOM

with Spiros Papadimitriou (CMU -> IBM)

Anthony Brockwell (CMU/Stat)

C. Faloutsos


Problem definition15 l.jpg

“Noise”??

Problem definition

  • Semi-infinite streams of values (time series) x1, x2, …, xt, …

  • Find patterns, forecasts, outliers…

Periodicity? (twice daily)

C. Faloutsos

Periodicity? (daily)


Requirements goals l.jpg
Requirements / Goals

  • Adapt and handle arbitrary periodic components

    and

  • nimble (limited resources, single pass)

  • on-line, any-time

  • automatic (no human intervention/tuning)

C. Faloutsos


Overview l.jpg
Overview

  • Introduction / Related work

  • Background

  • Main idea

  • Experimental results

C. Faloutsos


Wavelets example haar transform l.jpg

W1,3

W1,1

W1,4

W1,2

t

t

t

t

xt

W2,1

W2,2

t

t

t

W3,1

t

V4,1

t

WaveletsExample – Haar transform

“constant”

frequency

C. Faloutsos

time


Wavelets why we like them l.jpg

Wavelets compress many real signals well:

Image compression and processing

Vision

Astronomy, seismology, …

Wavelet coefficients can be updated as new points arrive

WaveletsWhy we like them

C. Faloutsos


Overview20 l.jpg
Overview

  • Introduction / Related work

  • Background

  • Main idea

  • Experimental results

C. Faloutsos


Awsom l.jpg

W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

=

t

t

W3,1

t

V4,1

t

time

AWSOM

xt

C. Faloutsos


Awsom22 l.jpg

W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

t

t

W3,1

t

V4,1

t

time

AWSOM

xt

C. Faloutsos


Awsom idea l.jpg

Wl,t-2

Wl,t-1

Wl,t

Wl’,t’-2

Wl’,t’-1

AWSOM - idea

Wl,t l,1Wl,t-1l,2Wl,t-2 …

Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 …

Wl’,t’

C. Faloutsos


More details l.jpg
More details…

  • Update of wavelet coefficients

  • Update of linear models

  • Feature selection

    • Not all correlations are significant

    • Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

C. Faloutsos


Complexity l.jpg

?

Complexity

  • Model update

    Space:OlgN + mk2 OlgN

    Time:Ok2 O1

    Where

    • N: number of points (so far)

    • k: number of regression coefficients; fixed

    • m: number of linear models; OlgN

C. Faloutsos


Overview26 l.jpg
Overview

  • Introduction / Related work

  • Background

  • Main idea

  • Experimental results

C. Faloutsos


Results synthetic data l.jpg
Results - Synthetic data

AWSOM

AR

Seasonal AR

  • Triangle pulse

  • Mix (sine + square)

  • AR captures wrong trend (or none)

  • Seasonal AR estimation fails

C. Faloutsos


Results real data l.jpg
Results - Real data

  • Automobile traffic

    • Daily periodicity

    • Bursty “noise” at smaller scales

  • AR fails to capture any trend

  • Seasonal AR estimation fails

C. Faloutsos


Results real data29 l.jpg
Results - real data

  • Sunspot intensity

    • Slightly time-varying “period”

  • AR captures wrong trend

  • Seasonal ARIMA

    • wrong downward trend, despite help by human!

C. Faloutsos


Conclusions l.jpg
Conclusions

  • Adapt and handle arbitrary periodic components

    and

  • nimble

    Limited memory (logarithmic)

    Constant-time update

  • on-line, any-time

    Single pass over the data

  • automatic: No human intervention/tuning

C. Faloutsos


Outline31 l.jpg
Outline

  • Problem and motivation

  • Single-sequence mining: AWSOM

  • Co-evolving sequences: SPIRIT

  • Lag correlations: BRAID

  • conclusions

C. Faloutsos


Part 2 l.jpg
Part 2

SPIRIT: Mining co-evolving streams

[Papadimitriou, Sun, Faloutsos, VLDB05]

C. Faloutsos


Motivation l.jpg
Motivation

  • Eg., chlorine concentration in water distribution network

C. Faloutsos


Motivation34 l.jpg

Phase 1

Phase 2

Phase 3

: : : : : :

chlorine concentrations

: : : : : :

Motivation

water distribution network

normal operation

May have hundreds of measurements, but

it is unlikely they are completely unrelated!

C. Faloutsos


Motivation35 l.jpg

Phase 1

Phase 2

Phase 3

: : : : : :

: : : : : :

Motivation

sensors

near leak

chlorine concentrations

sensors

away

from leak

water distribution network

normal operation

major leak

C. Faloutsos


Motivation36 l.jpg

Phase 1

Phase 2

Phase 3

: : : : : :

: : : : : :

Motivation

sensors

near leak

chlorine concentrations

sensors

away

from leak

water distribution network

normal operation

major leak

C. Faloutsos


Motivation37 l.jpg

Phase 1

Phase 1

: : : : : :

chlorine concentrations

k = 1

: : : : : :

Motivation

actual measurements

(n streams)

k hidden variable(s)

We would like to discover a few “hidden

(latent) variables” that summarize the key trends

C. Faloutsos


Motivation38 l.jpg

: : : : : :

: : : : : :

Motivation

Phase 1

Phase 2

Phase 1

Phase 2

chlorine concentrations

k = 2

actual measurements

(n streams)

k hidden variable(s)

We would like to discover a few “hidden

(latent) variables” that summarize the key trends

C. Faloutsos


Motivation39 l.jpg

: : : : : :

: : : : : :

Motivation

Phase 1

Phase 2

Phase 3

Phase 1

Phase 2

Phase 3

chlorine concentrations

k = 1

actual measurements

(n streams)

k hidden variable(s)

We would like to discover a few “hidden

(latent) variables” that summarize the key trends

C. Faloutsos


Goals l.jpg
Goals

  • Discover “hidden” (latent) variables for:

    • Summarization of main trends for users

    • Efficient forecasting, spotting outliers/anomalies

      and the usual:

  • nimble: Limited memory requirements

  • on-line, any-time: (single pass etc)

  • automatic: No special parameters to tune

C. Faloutsos


Related work stream mining l.jpg
Related workStream mining

  • Stream SVD [Guha, Gunopulos, Koudas / KDD03]

  • StatStream [Zhu, Shasha / VLDB02]

  • Clustering

    [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE],

    [Lin, Vlachos, Keogh, Gunopulos / EDBT04],

  • Classification

    [Wang, Fan, et al/KDD03], [Hulten,Spencer,Domingos/KDD01]

C. Faloutsos


Related work stream mining42 l.jpg
Related workStream mining

  • Piecewise approximations

    [Palpanas, Vlachos, Keogh, etal / ICDE 2004]

  • Queries on streams

    [Dobra, Garofalakis, Gehrke, et al / SIGMOD02],

    [Madden, Franklin, Hellerstein, et al / OSDI02],

    [Considine, Li, Kollios, et al / ICDE04],

    [Hammad, Aref, Elmagarmid / SSDBM03]

C. Faloutsos


Overview part 2 l.jpg
OverviewPart 2

  • Method

  • Experiments

  • Conclusions & Other work

C. Faloutsos


Stream correlations l.jpg
Stream correlations

  • Step 1: How to capture correlations?

  • Step 2: How to do it incrementally, when we have a very large number of points?

  • Step 3: How to dynamically adjust the number of hidden variables?

C. Faloutsos


1 how to capture correlations l.jpg

time

1. How to capture correlations?

First sensor

30oC

Temperature t1

20oC

C. Faloutsos


1 how to capture correlations46 l.jpg

time

1. How to capture correlations?

First sensor

Second sensor

30oC

Temperature t2

20oC

C. Faloutsos


1 how to capture correlations47 l.jpg
1. How to capture correlations

Correlations:

Let’s take a closer look at the first three value-pairs…

30oC

Temperature t2

20oC

20oC

30oC

C. Faloutsos

Temperature t1


1 how to capture correlations48 l.jpg

time=3

time=2

time=1

1. How to capture correlations

First three lie (almost) on a line in the space of value-pairs…

30oC

Temperature t2

offset = “hidden variable”

 O(n) numbers for the slope, and

 One number for each value-pair (offset on line)

20oC

20oC

30oC

C. Faloutsos

Temperature t1


1 how to capture correlations49 l.jpg
1. How to capture correlations

Other pairs also follow the same pattern: they lie (approximately) on this line

30oC

Temperature t2

20oC

20oC

30oC

C. Faloutsos

Temperature t1


Stream correlations50 l.jpg
Stream correlations

  • Step 1: How to capture correlations?

  • Step 2: How to do it incrementally, when we have a very large number of points?

  • Step 3: How to dynamically adjust the number of hidden variables?

C. Faloutsos


Incremental updates l.jpg

30oC

error

Temperature T2

20oC

20oC

30oC

Temperature T1

Incremental updates


Incremental updates52 l.jpg
Incremental updates

error

30oC

  • Algorithm runs in O(n) where n= # of streams

  • no need to access old data

20oC

20oC

30oC

Temperature T1


Stream correlations principal component analysis pca l.jpg
Stream correlationsPrincipal Component Analysis (PCA)

  • The “line” is the first principal component (PC)

  • This line is optimal: it minimizes the sum of squared projection errors

C. Faloutsos


2 incremental update given number of hidden variables k l.jpg

x

w1 updated

e1

w1

y1

2. Incremental updateGiven number of hidden variables k

  • Assuming k is known

  • We know how to update the slope

For each new point x and for i = 1, …, k :

  • yi := wiTx (proj. onto wi)

  • di di + yi2 (energy  i-th eigenval.)

  • ei := x – yiwi (error)

  • wi wi + (1/di) yiei (update estimate)

  • x  x – yiwi (repeat with remainder)

C. Faloutsos


Stream correlations55 l.jpg
Stream correlations

  • Step 1: How to capture correlations?

  • Step 2: How to do it incrementally, when we have a very large number of points?

  • Step 3: How to dynamically adjust k, the number of hidden variables?

C. Faloutsos


Answer l.jpg
Answer

  • When the reconstruction accuracy is too low (say, <95%)

  • then introduce another hidden variable (k++)

  • [How to initialize its values: tricky]

C. Faloutsos


Missing values l.jpg
Missing values

best guess

(given correlations:

intersection)

30oC

true values (pair)

Temperature T2

20oC

all possible

value pairs

(given onlyt1)

20oC

30oC

C. Faloutsos

Temperature T1


Forecasting l.jpg
Forecasting

  • Assume we want to forecast the next value for a particular stream (e.g. auto-regression)

?

n streams

C. Faloutsos


Forecasting59 l.jpg

+

Forecasting

  • Option 1: One complex model per stream

    • Next value = function of previous values on all streams

    • Captures correlations

    • Too costly! [ ~ O(n3) ]

n streams

C. Faloutsos


Forecasting60 l.jpg

+

Forecasting

  • Option 1: One complex model per stream

  • Option 2: One simple model per stream

    • Next value = function of previous value on same stream

    • Worse accuracy, but maybe acceptable

    • But, still need n models

n streams

C. Faloutsos


Forecasting61 l.jpg

+

k hidden vars

Forecasting

hidden

variables

Only k simple

models

Efficiency &

robustness

k << n

and already

capture correlations

n streams

C. Faloutsos


Time space requirements incremental pca l.jpg
Time/space requirementsIncremental PCA

O(nk) space (total) and time (per tuple), i.e.,

  • Independent of # points

  • Linear w.r.t. # streams (n)

  • Linear w.r.t. # hidden variables (k)

    In fact,

  • Can be done in real time

C. Faloutsos


Overview part 263 l.jpg
OverviewPart 2

  • Method

  • Experiments

  • Conclusions & Other work

C. Faloutsos


Experiments chlorine concentration l.jpg
ExperimentsChlorine concentration

Measurements

Reconstruction

166 streams

2 hidden variables (~4% error)

C. Faloutsos

[CMU Civil Engineering]


Experiments chlorine concentration65 l.jpg
ExperimentsChlorine concentration

  • Both capture global, periodic pattern

  • Second: ~ first, but phase-shifted

  • Can express any phase-shift…

hidden variables

C. Faloutsos

[CMU Civil Engineering]


Experiments light measurements l.jpg
ExperimentsLight measurements

measurement

reconstruction

54 sensors

2-4 hidden variables (~6% error)

C. Faloutsos


Experiments light measurements67 l.jpg
ExperimentsLight measurements

  • 1 & 2: main trend (as before)

  • 3 & 4: potential anomalies and outliers

intermittent

intermittent

hidden variables

C. Faloutsos


Conclusions68 l.jpg
Conclusions

SPIRIT:

  • Discovers hidden variables for

    • Summarization of main trends for users

    • Efficient forecasting, spotting outliers/anomalies

  • Incremental, real time computation

  • nimble: With limited memory

  • automatic: No special parameters to tune

C. Faloutsos


Outline69 l.jpg
Outline

  • Problem and motivation

  • Single-sequence mining: AWSOM

  • Co-evolving sequences: SPIRIT

  • Lag correlations: BRAID

  • Conclusions

C. Faloutsos


Part 3 braid discovering lag correlations in multiple streams l.jpg

Part 3:BRAID: Discovering Lag Correlations in Multiple Streams

Yasushi Sakurai,

Spiros Papadimitriou,

Christos Faloutsos

SIGMOD’05

C. Faloutsos


Lag correlations l.jpg
Lag Correlations

  • Examples

    • A decrease in interest rates typically precedes an increase in house sales by a few months

    • Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later

C. Faloutsos


Lag correlations72 l.jpg
Lag Correlations

These sequences are correlated with lag l=1300 time-ticks

  • Example of lag-correlated sequences

CCF (Cross-Correlation Function)

C. Faloutsos


Lag correlations73 l.jpg
Lag Correlations

  • Example of lag-correlated sequences

  • how to compute it

  • quickly

  • cheaply

  • incrementally

CCF (Cross-Correlation Function)

C. Faloutsos


Challenging problems l.jpg
Challenging Problems

  • Problem definitions

    • For given two co-evolving sequences X and Y, determine

      • Whether there is a lag correlation

      • If yes, what is the lag length l

    • For given k numerical sequences, X1,…,Xk , report

      • Which pairs have a lag correlation

      • The corresponding lag for each pair

C. Faloutsos


Our solution l.jpg
Our solution

  • Ideal characteristics:

    • ‘Any-time’ processing, and fast

      Computation time per time tick is constant

    • Nimble

      Memory space requirement is sub-linear of sequence length

    • Accurate

      Approximation introduces small error

C. Faloutsos


Related work l.jpg
Related Work

  • Sequence indexing

    • Agrawal et al. (FODO 1993)

    • Faloutsos et al. (SIGMOD 1994)

    • Keogh et al. (SIGMOD 2001)

  • Compression (wavelet and random projections)

    • Gilbert et al. (VLDB 2001), Guha et al. (VLDB 2004)

    • Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD 2003)

  • Data Stream Management

    • Abadi et al. (VLDB Journal 2003)

    • Motwani et al. (CIDR 2003)

    • Chandrasekaran et al. (CIDR 2003)

    • Cranor et al. (SIGMOD 2003)

C. Faloutsos


Related work77 l.jpg
Related Work

  • Pattern discovery

    • Clustering for data streams

      Guha et al. (TKDE 2003)

    • Monitoring multiple streams

      Zhu et al. (VLDB 2002)

    • Forecasting

      Yi et al. (ICDE 2000)

      Papadimitriou et al. (VLDB 2003)

  • None of previously published methods focuses on the problem

C. Faloutsos


Overview78 l.jpg
Overview

  • Introduction / Related work

  • Background

  • Main ideas

  • Theoretical analysis

  • Experimental results

C. Faloutsos


Main idea 1 l.jpg
Main Idea (1)

  • Incremental compution

    • Sufficient statistics

      • Sum of X :

      • Square sum of X :

      • Inner-product for X and the shifted Y :

    • Compute R(l) incrementally:

      • Covariance of X and Y:

      • Variance of X:

C. Faloutsos


Main idea 2 l.jpg

Correlation

t=n

Time

Lag

Main Idea (2)

  • Sequence smoothing

C. Faloutsos


Main idea 281 l.jpg

Level

h=0

t=n

Time

Correlation

Lag

Main Idea (2)

  • Sequence smoothing

    • Means of windows for each level

    • Sufficient statistics computed from the means

    • CCF computed from the sufficient statistics

    • But, it allows a partial redundancy

C. Faloutsos


Main idea 3 l.jpg

Level

h=0

t=n

Time

Correlation

Lag

Main Idea (3)

  • Geometric lag probing

C. Faloutsos


Main idea 383 l.jpg

Level

h=0

t=n

Time

Correlation

Lag

Main Idea (3)

  • Geometric lag probing

    • Use colored windows

    • Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2h,…}

    • Use a cubic spline to interpolate

C. Faloutsos


Overview84 l.jpg
Overview

  • Introduction / Related work

  • Background

  • Main ideas

  • Theoretical analysis

  • Experimental results

C. Faloutsos


Experimental results l.jpg
Experimental results

  • Setup

    • Intel Xeon 2.8GHz, 1GB memory, Linux

    • Datasets:

      Sines, SpikeTrains, Humidity, Light, Temperature,

      Kursk, Sunspots

    • Enhanced BRAID, b=16

  • Evaluation

    • Estimation error of lag correlations

    • Computation time

C. Faloutsos


Detecting lag correlations 2 l.jpg
Detecting Lag Correlations (2)

BRAID closely estimates

the correlation coefficients

  • SpikeTrains

CCF (Cross-Correlation Function)

C. Faloutsos


Detecting lag correlations 3 l.jpg
Detecting Lag Correlations (3)

BRAID closely estimates

the correlation coefficients

  • Humidity

CCF (Cross-Correlation Function)

C. Faloutsos


Detecting lag correlations 4 l.jpg
Detecting Lag Correlations (4)

BRAID closely estimates

the correlation coefficients

  • Light

CCF (Cross-Correlation Function)

C. Faloutsos


Detecting lag correlations 5 l.jpg
Detecting Lag Correlations (5)

BRAID closely estimates

the correlation coefficients

  • Kursk

CCF (Cross-Correlation Function)

C. Faloutsos


Estimation error l.jpg

Datasets

Lag correlation

Estimation

error (%)

Naive

BRAID

Sines

716

716

0.000

SpikeTrains

2841

2830

0.387

Humidity

3842

3855

0.338

Light

567

570

0.529

Kursk

1463

1472

0.615

Sunspots

1156

1168

1.038

Estimation Error

  • Largest relative error is about 1%

C. Faloutsos


Performance l.jpg
Performance

  • Almost linear w.r.t. sequence length

  • Up to 40,000 times faster

C. Faloutsos


Group lag correlations l.jpg
Group Lag Correlations

  • Two correlated pairs from 55 Temperature sequences

  • Each sensor is located in a different place

#48

#16

#19

#47

Estimation of CCF of #47 and #48

Estimation of CCF of #16 and #19

C. Faloutsos


Conclusions93 l.jpg
Conclusions

Automatic lag correlation detection on stream data

  • incremental – online, ‘any-time’

  • nimble

    • O(log n) space, O(1) time to update the statistics

    • Up to 40,000 times faster than the naive implementation

  • Accurate

    • Detecting the correct lag within 1% relative error or less

C. Faloutsos


Overall conclusions l.jpg
Overall Conclusions

  • Mining streaming numerical data: challenging!

  • Extensions: streaming matrix data (eg., network traffic matrix)

time

IP-destination

IP-source

C. Faloutsos


Thank you95 l.jpg
Thank you

  • christos <at> cs.cmu.edu

  • www.cs.cmu.edu/~christos

  • [InteMon demo]

C. Faloutsos


ad