- 137 Views
- Updated On :

Download Presentation
## PowerPoint Slideshow about '' - carmelita

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Part 3:BRAID: Discovering Lag Correlations in Multiple Streams

Outline

- Problem and motivation
- Single-sequence mining: AWSOM
- Co-evolving sequences: SPIRIT
- Lag correlations: BRAID
- Conclusions

C. Faloutsos

Problem definition

- Given: one or more sequences
x1 , x2 , … , xt , …

(y1, y2, … , yt, …

… )

- Find
- patterns; correlations; outliers
- incrementally!

C. Faloutsos

Find patterns using a method that is

nimble: limited resources

Memory

Bandwidth, power, CPU

incremental: on-line, ‘any-time’ response

single pass (‘you get to see it only once’)

automatic: no human intervention

eg., in remote environments

Limitations / ChallengesC. Faloutsos

Application domains

- Sensor devices
- Temperature, weather measurements
- Road traffic data
- Geological observations
- Patient physiological data

- Embedded devices
- Network routers
- Intelligent (active) disks

C. Faloutsos

Motivation - Applications (cont’d)

- ‘Smart house’
- sensors monitor temperature, humidity, air quality

- video surveillance

C. Faloutsos

Motivation - Applications (cont’d)

- civil/automobile infrastructure
- bridge vibrations [Oppenheim+02]
- road conditions / traffic monitoring

C. Faloutsos

Motivation - Applications (cont’d)

- Weather, environment/anti-pollution
- volcano monitoring
- air/water pollutant monitoring

C. Faloutsos

Motivation - Applications (cont’d)

- Computer systems
- ‘Active Disks’ (buffering, prefetching)
- web servers (ditto)
- network traffic monitoring
- ...

C. Faloutsos

Outline

- Problem and motivation
- Single-sequence mining: AWSOM
- Co-evolving sequences: SPIRIT
- Lag correlations: BRAID
- conclusions

C. Faloutsos

Single sequence mining - AWSOM

with Spiros Papadimitriou (CMU -> IBM)

Anthony Brockwell (CMU/Stat)

C. Faloutsos

Problem definition

- Semi-infinite streams of values (time series) x1, x2, …, xt, …
- Find patterns, forecasts, outliers…

Periodicity? (twice daily)

C. Faloutsos

Periodicity? (daily)

Requirements / Goals

- Adapt and handle arbitrary periodic components
and

- nimble (limited resources, single pass)
- on-line, any-time
- automatic (no human intervention/tuning)

C. Faloutsos

W1,3

W1,1

W1,4

W1,2

t

t

t

t

xt

W2,1

W2,2

t

t

t

W3,1

t

V4,1

t

WaveletsExample – Haar transform“constant”

frequency

C. Faloutsos

time

Wavelets compress many real signals well:

Image compression and processing

Vision

Astronomy, seismology, …

Wavelet coefficients can be updated as new points arrive

WaveletsWhy we like themC. Faloutsos

Wl,t-2

Wl,t-1

Wl,t

Wl’,t’-2

Wl’,t’-1

AWSOM - ideaWl,t l,1Wl,t-1l,2Wl,t-2 …

Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 …

Wl’,t’

C. Faloutsos

More details…

- Update of wavelet coefficients
- Update of linear models
- Feature selection
- Not all correlations are significant
- Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

C. Faloutsos

Complexity

- Model update
Space:OlgN + mk2 OlgN

Time:Ok2 O1

Where

- N: number of points (so far)
- k: number of regression coefficients; fixed
- m: number of linear models; OlgN

C. Faloutsos

Results - Synthetic data

AWSOM

AR

Seasonal AR

- Triangle pulse
- Mix (sine + square)
- AR captures wrong trend (or none)
- Seasonal AR estimation fails

C. Faloutsos

Results - Real data

- Automobile traffic
- Daily periodicity
- Bursty “noise” at smaller scales

- AR fails to capture any trend
- Seasonal AR estimation fails

C. Faloutsos

Results - real data

- Sunspot intensity
- Slightly time-varying “period”

- AR captures wrong trend
- Seasonal ARIMA
- wrong downward trend, despite help by human!

C. Faloutsos

Conclusions

- Adapt and handle arbitrary periodic components
and

- nimble
Limited memory (logarithmic)

Constant-time update

- on-line, any-time
Single pass over the data

- automatic: No human intervention/tuning

C. Faloutsos

Outline

- Problem and motivation
- Single-sequence mining: AWSOM
- Co-evolving sequences: SPIRIT
- Lag correlations: BRAID
- conclusions

C. Faloutsos

Phase 2

Phase 3

: : : : : :

chlorine concentrations

: : : : : :

Motivationwater distribution network

normal operation

May have hundreds of measurements, but

it is unlikely they are completely unrelated!

C. Faloutsos

Phase 2

Phase 3

: : : : : :

: : : : : :

Motivationsensors

near leak

chlorine concentrations

sensors

away

from leak

water distribution network

normal operation

major leak

C. Faloutsos

Phase 2

Phase 3

: : : : : :

: : : : : :

Motivationsensors

near leak

chlorine concentrations

sensors

away

from leak

water distribution network

normal operation

major leak

C. Faloutsos

Phase 1

: : : : : :

chlorine concentrations

k = 1

: : : : : :

Motivationactual measurements

(n streams)

k hidden variable(s)

We would like to discover a few “hidden

(latent) variables” that summarize the key trends

C. Faloutsos

: : : : : :

MotivationPhase 1

Phase 2

Phase 1

Phase 2

chlorine concentrations

k = 2

actual measurements

(n streams)

k hidden variable(s)

We would like to discover a few “hidden

(latent) variables” that summarize the key trends

C. Faloutsos

: : : : : :

MotivationPhase 1

Phase 2

Phase 3

Phase 1

Phase 2

Phase 3

chlorine concentrations

k = 1

actual measurements

(n streams)

k hidden variable(s)

We would like to discover a few “hidden

(latent) variables” that summarize the key trends

C. Faloutsos

Goals

- Discover “hidden” (latent) variables for:
- Summarization of main trends for users
- Efficient forecasting, spotting outliers/anomalies
and the usual:

- nimble: Limited memory requirements
- on-line, any-time: (single pass etc)
- automatic: No special parameters to tune

C. Faloutsos

Related workStream mining

- Stream SVD [Guha, Gunopulos, Koudas / KDD03]
- StatStream [Zhu, Shasha / VLDB02]
- Clustering
[Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE],

[Lin, Vlachos, Keogh, Gunopulos / EDBT04],

- Classification
[Wang, Fan, et al/KDD03], [Hulten,Spencer,Domingos/KDD01]

C. Faloutsos

Related workStream mining

- Piecewise approximations
[Palpanas, Vlachos, Keogh, etal / ICDE 2004]

- Queries on streams
[Dobra, Garofalakis, Gehrke, et al / SIGMOD02],

[Madden, Franklin, Hellerstein, et al / OSDI02],

[Considine, Li, Kollios, et al / ICDE04],

[Hammad, Aref, Elmagarmid / SSDBM03]

- …

C. Faloutsos

Stream correlations

- Step 1: How to capture correlations?
- Step 2: How to do it incrementally, when we have a very large number of points?
- Step 3: How to dynamically adjust the number of hidden variables?

C. Faloutsos

1. How to capture correlations?

First sensor

Second sensor

30oC

Temperature t2

20oC

C. Faloutsos

1. How to capture correlations

Correlations:

Let’s take a closer look at the first three value-pairs…

30oC

Temperature t2

20oC

20oC

30oC

C. Faloutsos

Temperature t1

time=2

time=1

1. How to capture correlationsFirst three lie (almost) on a line in the space of value-pairs…

30oC

Temperature t2

offset = “hidden variable”

O(n) numbers for the slope, and

One number for each value-pair (offset on line)

20oC

20oC

30oC

C. Faloutsos

Temperature t1

1. How to capture correlations

Other pairs also follow the same pattern: they lie (approximately) on this line

30oC

Temperature t2

20oC

20oC

30oC

C. Faloutsos

Temperature t1

Stream correlations

- Step 1: How to capture correlations?
- Step 2: How to do it incrementally, when we have a very large number of points?
- Step 3: How to dynamically adjust the number of hidden variables?

C. Faloutsos

Incremental updates

error

30oC

- Algorithm runs in O(n) where n= # of streams
- no need to access old data

20oC

20oC

30oC

Temperature T1

Stream correlationsPrincipal Component Analysis (PCA)

- The “line” is the first principal component (PC)
- This line is optimal: it minimizes the sum of squared projection errors

C. Faloutsos

w1 updated

e1

w1

y1

2. Incremental updateGiven number of hidden variables k- Assuming k is known
- We know how to update the slope

For each new point x and for i = 1, …, k :

- yi := wiTx (proj. onto wi)
- di di + yi2 (energy i-th eigenval.)
- ei := x – yiwi (error)
- wi wi + (1/di) yiei (update estimate)
- x x – yiwi (repeat with remainder)

C. Faloutsos

Stream correlations

- Step 1: How to capture correlations?
- Step 2: How to do it incrementally, when we have a very large number of points?
- Step 3: How to dynamically adjust k, the number of hidden variables?

C. Faloutsos

Answer

- When the reconstruction accuracy is too low (say, <95%)
- then introduce another hidden variable (k++)
- [How to initialize its values: tricky]

C. Faloutsos

Missing values

best guess

(given correlations:

intersection)

30oC

true values (pair)

Temperature T2

20oC

all possible

value pairs

(given onlyt1)

20oC

30oC

C. Faloutsos

Temperature T1

Forecasting

- Assume we want to forecast the next value for a particular stream (e.g. auto-regression)

?

n streams

C. Faloutsos

Forecasting

- Option 1: One complex model per stream
- Next value = function of previous values on all streams
- Captures correlations
- Too costly! [ ~ O(n3) ]

n streams

C. Faloutsos

Forecasting

- Option 1: One complex model per stream
- Option 2: One simple model per stream
- Next value = function of previous value on same stream
- Worse accuracy, but maybe acceptable
- But, still need n models

n streams

C. Faloutsos

k hidden vars

Forecasting

hidden

variables

Only k simple

models

Efficiency &

robustness

k << n

and already

capture correlations

n streams

C. Faloutsos

Time/space requirementsIncremental PCA

O(nk) space (total) and time (per tuple), i.e.,

- Independent of # points
- Linear w.r.t. # streams (n)
- Linear w.r.t. # hidden variables (k)
In fact,

- Can be done in real time

C. Faloutsos

ExperimentsChlorine concentration

Measurements

Reconstruction

166 streams

2 hidden variables (~4% error)

C. Faloutsos

[CMU Civil Engineering]

ExperimentsChlorine concentration

- Both capture global, periodic pattern
- Second: ~ first, but phase-shifted
- Can express any phase-shift…

hidden variables

C. Faloutsos

[CMU Civil Engineering]

ExperimentsLight measurements

measurement

reconstruction

54 sensors

2-4 hidden variables (~6% error)

C. Faloutsos

ExperimentsLight measurements

- 1 & 2: main trend (as before)
- 3 & 4: potential anomalies and outliers

intermittent

intermittent

hidden variables

C. Faloutsos

Conclusions

SPIRIT:

- Discovers hidden variables for
- Summarization of main trends for users
- Efficient forecasting, spotting outliers/anomalies

- Incremental, real time computation
- nimble: With limited memory
- automatic: No special parameters to tune

C. Faloutsos

Outline

- Problem and motivation
- Single-sequence mining: AWSOM
- Co-evolving sequences: SPIRIT
- Lag correlations: BRAID
- Conclusions

C. Faloutsos

Yasushi Sakurai,

Spiros Papadimitriou,

Christos Faloutsos

SIGMOD’05

C. Faloutsos

Lag Correlations

- Examples
- A decrease in interest rates typically precedes an increase in house sales by a few months
- Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later

C. Faloutsos

Lag Correlations

These sequences are correlated with lag l=1300 time-ticks

- Example of lag-correlated sequences

CCF (Cross-Correlation Function)

C. Faloutsos

Lag Correlations

- Example of lag-correlated sequences

- how to compute it
- quickly
- cheaply
- incrementally

CCF (Cross-Correlation Function)

C. Faloutsos

Challenging Problems

- Problem definitions
- For given two co-evolving sequences X and Y, determine
- Whether there is a lag correlation
- If yes, what is the lag length l

- For given k numerical sequences, X1,…,Xk , report
- Which pairs have a lag correlation
- The corresponding lag for each pair

- For given two co-evolving sequences X and Y, determine

C. Faloutsos

Our solution

- Ideal characteristics:
- ‘Any-time’ processing, and fast
Computation time per time tick is constant

- Nimble
Memory space requirement is sub-linear of sequence length

- Accurate
Approximation introduces small error

- ‘Any-time’ processing, and fast

C. Faloutsos

Related Work

- Sequence indexing
- Agrawal et al. (FODO 1993)
- Faloutsos et al. (SIGMOD 1994)
- Keogh et al. (SIGMOD 2001)

- Compression (wavelet and random projections)
- Gilbert et al. (VLDB 2001), Guha et al. (VLDB 2004)
- Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD 2003)

- Data Stream Management
- Abadi et al. (VLDB Journal 2003)
- Motwani et al. (CIDR 2003)
- Chandrasekaran et al. (CIDR 2003)
- Cranor et al. (SIGMOD 2003)

C. Faloutsos

Related Work

- Pattern discovery
- Clustering for data streams
Guha et al. (TKDE 2003)

- Monitoring multiple streams
Zhu et al. (VLDB 2002)

- Forecasting
Yi et al. (ICDE 2000)

Papadimitriou et al. (VLDB 2003)

- Clustering for data streams
- None of previously published methods focuses on the problem

C. Faloutsos

Overview

- Introduction / Related work
- Background
- Main ideas
- Theoretical analysis
- Experimental results

C. Faloutsos

Main Idea (1)

- Incremental compution
- Sufficient statistics
- Sum of X :
- Square sum of X :
- Inner-product for X and the shifted Y :

- Compute R(l) incrementally:
- Covariance of X and Y:
- Variance of X:

- Sufficient statistics

C. Faloutsos

h=0

t=n

Time

Correlation

Lag

Main Idea (2)- Sequence smoothing
- Means of windows for each level
- Sufficient statistics computed from the means
- CCF computed from the sufficient statistics
- But, it allows a partial redundancy

C. Faloutsos

h=0

t=n

Time

Correlation

Lag

Main Idea (3)- Geometric lag probing
- Use colored windows
- Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2h,…}
- Use a cubic spline to interpolate

C. Faloutsos

Overview

- Introduction / Related work
- Background
- Main ideas
- Theoretical analysis
- Experimental results

C. Faloutsos

Experimental results

- Setup
- Intel Xeon 2.8GHz, 1GB memory, Linux
- Datasets:
Sines, SpikeTrains, Humidity, Light, Temperature,

Kursk, Sunspots

- Enhanced BRAID, b=16

- Evaluation
- Estimation error of lag correlations
- Computation time

C. Faloutsos

Detecting Lag Correlations (2)

BRAID closely estimates

the correlation coefficients

- SpikeTrains

CCF (Cross-Correlation Function)

C. Faloutsos

Detecting Lag Correlations (3)

BRAID closely estimates

the correlation coefficients

- Humidity

CCF (Cross-Correlation Function)

C. Faloutsos

Detecting Lag Correlations (4)

BRAID closely estimates

the correlation coefficients

- Light

CCF (Cross-Correlation Function)

C. Faloutsos

Detecting Lag Correlations (5)

BRAID closely estimates

the correlation coefficients

- Kursk

CCF (Cross-Correlation Function)

C. Faloutsos

Lag correlation

Estimation

error (%)

Naive

BRAID

Sines

716

716

0.000

SpikeTrains

2841

2830

0.387

Humidity

3842

3855

0.338

Light

567

570

0.529

Kursk

1463

1472

0.615

Sunspots

1156

1168

1.038

Estimation Error- Largest relative error is about 1%

C. Faloutsos

Group Lag Correlations

- Two correlated pairs from 55 Temperature sequences
- Each sensor is located in a different place

#48

#16

#19

#47

Estimation of CCF of #47 and #48

Estimation of CCF of #16 and #19

C. Faloutsos

Conclusions

Automatic lag correlation detection on stream data

- incremental – online, ‘any-time’
- nimble
- O(log n) space, O(1) time to update the statistics
- Up to 40,000 times faster than the naive implementation

- Accurate
- Detecting the correct lag within 1% relative error or less

C. Faloutsos

Overall Conclusions

- Mining streaming numerical data: challenging!
- Extensions: streaming matrix data (eg., network traffic matrix)

time

IP-destination

IP-source

C. Faloutsos

Download Presentation

Connecting to Server..