data mining on streams n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining on Streams PowerPoint Presentation
Download Presentation
Data Mining on Streams

Loading in 2 Seconds...

play fullscreen
1 / 123

Data Mining on Streams - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

Data Mining on Streams. Christos Faloutsos CMU. Dr. Deepay Chakrabarti (Yahoo) Dr. Spiros Papadimitriou (IBM) Prof. Byoung-Kee Yi (Pohang U.). Prof. Dimitris Gunopulos (UCR) Dr. Mengzhi Wang (Google). Thanks. For more info:. 3h tutorial, at:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Mining on Streams' - luna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data mining on streams

Data Mining on Streams

Christos Faloutsos

CMU

thanks
Dr. Deepay Chakrabarti (Yahoo)

Dr. Spiros Papadimitriou (IBM)

Prof. Byoung-Kee Yi (Pohang U.)

Prof. Dimitris Gunopulos (UCR)

Dr. Mengzhi Wang (Google)

Thanks

(c) C. Faloutsos, 2006

for more info
For more info:

3h tutorial, at:

http://www.cs.cmu.edu/~christos/TALKS/EDBT04-tut/faloutsos-edbt04.pdf

(c) C. Faloutsos, 2006

outline
Outline
  • Motivation
  • Similarity Search and Indexing
  • DSP (Digital Signal Processing)
  • Linear Forecasting
  • Bursty traffic - fractals and multifractals
  • Non-linear forecasting
  • Conclusions

(c) C. Faloutsos, 2006

problem definition
Problem definition
  • Given: one or more sequences

x1 , x2 , … , xt , …

(y1, y2, … , yt, …

… )

  • Find
    • similar sequences; forecasts
    • patterns; clusters; outliers

(c) C. Faloutsos, 2006

motivation applications
Motivation - Applications
  • Financial, sales, economic series
  • Medical
    • ECGs +; blood pressure etc monitoring
    • reactions to new drugs
    • elder care

(c) C. Faloutsos, 2006

motivation applications cont d
Motivation - Applications (cont’d)
  • ‘Smart house’
    • sensors monitor temperature, humidity, air quality
  • video surveillance

(c) C. Faloutsos, 2006

motivation applications cont d1
Motivation - Applications (cont’d)
  • civil/automobile infrastructure
    • bridge vibrations [Oppenheim+02]
    • road conditions / traffic monitoring

(c) C. Faloutsos, 2006

motivation applications cont d2
Motivation - Applications (cont’d)
  • Weather, environment/anti-pollution
    • volcano monitoring
    • air/water pollutant monitoring

(c) C. Faloutsos, 2006

motivation applications cont d3
Motivation - Applications (cont’d)
  • Computer systems
    • ‘Active Disks’ (buffering, prefetching)
    • web servers (ditto)
    • network traffic monitoring
    • ...

(c) C. Faloutsos, 2006

problem 1
Problem #1:

Goal: given a signal (e.g.., #packets over time)

Find: patterns, periodicities, and/or compress

count

lynx caught per year

(packets per day;

temperature per day)

year

(c) C. Faloutsos, 2006

problem 2 forecast
Problem#2: Forecast

Given xt, xt-1, …, forecast xt+1

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

(c) C. Faloutsos, 2006

problem 2 similarity search
Problem#2’: Similarity search

E.g.., Find a 3-tick pattern, similar to the last one

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

(c) C. Faloutsos, 2006

differences from dsp stat
Differences from DSP/Stat
  • Semi-infinite streams
    • we need on-line, ‘any-time’ algorithms
  • Can not afford human intervention
    • need automatic methods
  • sensors have limited memory / processing / transmitting power
    • need for (lossy) compression

(c) C. Faloutsos, 2006

important observations
Important observations

Patterns, rules, forecasting and similarity indexing are closely related:

  • To do forecasting, we need
    • to find patterns/rules
    • to find similar settings in the past
  • to find outliers, we need to have forecasts
    • (outlier = too far away from our forecast)

(c) C. Faloutsos, 2006

important topics not in this tutorial
Important topics NOT in this tutorial:
  • Continuous queries
    • [Babu+Widom ] [Gehrke+] [Madden+]
  • Categorical data streams
    • [Hatonen+96]
  • Outlier detection (discontinuities)
    • [Breunig+00]

(c) C. Faloutsos, 2006

outline1
Outline
  • Motivation
  • Similarity Search and Indexing
  • DSP
  • Linear Forecasting
  • Bursty traffic - fractals and multifractals
  • Non-linear forecasting
  • Conclusions

(c) C. Faloutsos, 2006

outline2
Outline
  • Motivation
  • Similarity Search and Indexing
    • distance functions: Euclidean;Time-warping
    • indexing
    • feature extraction
  • DSP
  • ...

(c) C. Faloutsos, 2006

euclidean and lp

...

Euclidean and Lp
  • L1: city-block = Manhattan
  • L2 = Euclidean
  • L

(c) C. Faloutsos, 2006

slide20

$price

$price

$price

1

1

1

365

365

365

day

day

day

distance function: by expert

(c) C. Faloutsos, 2006

idea gemini
Idea: ‘GEMINI’

E.g.., ‘find stocks similar to MSFT’

Seq. scanning: too slow

How to accelerate the search?

[Faloutsos96]

(c) C. Faloutsos, 2006

gemini pictorially

S1

F(S1)

1

365

1

365

F(Sn)

day

day

Sn

‘GEMINI’ - Pictorially

eg,. std

eg, avg

(c) C. Faloutsos, 2006

gemini
GEMINI

Solution: Quick-and-dirty' filter:

  • extract n features (numbers, eg., avg., etc.)
  • map into a point in n-d feature space
  • organize points with off-the-shelf spatial access method (‘SAM’)
  • discard false alarms

(c) C. Faloutsos, 2006

examples of gemini
Examples of GEMINI
  • Time sequences: DFT (up to 100 times faster) [SIGMOD94];
  • [Kanellakis+], [Mendelzon+]

(c) C. Faloutsos, 2006

examples of gemini1
Examples of GEMINI

Even on other-than-sequence data:

  • Images (QBIC) [JIIS94]
  • tumor-like shapes [VLDB96]
  • video [Informedia + S-R-trees]
  • automobile part shapes [Kriegel+97]

(c) C. Faloutsos, 2006

indexing sams
Indexing - SAMs

Q: How do Spatial Access Methods (SAMs) work?

A: they group nearby points (or regions) together, on nearby disk pages, and answer spatial queries quickly (‘range queries’, ‘nearest neighbor’ queries etc)

For example:

(c) C. Faloutsos, 2006

r trees

Skip

R-trees
  • [Guttman84] eg., w/ fanout 4: group nearby rectangles to parent MBRs; each group -> disk page

I

C

A

G

H

F

B

J

E

D

(c) C. Faloutsos, 2006

r trees1

A

H

D

F

G

B

E

I

C

J

Skip

R-trees
  • eg., w/ fanout 4:

P1

P3

I

C

A

G

H

F

B

J

E

P4

P2

D

(c) C. Faloutsos, 2006

r trees2

P1

A

D

H

F

P2

G

B

E

I

P3

C

J

P4

Skip

R-trees
  • eg., w/ fanout 4:

P1

P3

I

C

A

G

H

F

B

J

E

P4

P2

D

(c) C. Faloutsos, 2006

r trees range search

P1

A

H

D

F

P2

G

B

E

I

P3

C

J

P4

Skip

R-trees - range search?

P1

P3

I

C

A

G

H

F

B

J

E

P4

P2

D

(c) C. Faloutsos, 2006

r trees range search1

P1

A

H

D

F

P2

G

B

E

I

P3

C

J

P4

Skip

R-trees - range search?

P1

P3

I

C

A

G

H

F

B

J

E

P4

P2

D

(c) C. Faloutsos, 2006

conclusions
Conclusions
  • Fast indexing: through GEMINI
    • feature extraction and
    • (off the shelf) Spatial Access Methods [Gaede+98]

(c) C. Faloutsos, 2006

outline3
Outline
  • Motivation
  • Similarity Search and Indexing
    • distance functions
    • indexing
    • feature extraction
  • DSP
  • ...

(c) C. Faloutsos, 2006

outline4
Outline
  • Motivation
  • Similarity Search and Indexing
    • distance functions
    • indexing
    • feature extraction
      • DFT, DWT, DCT (data independent)
      • SVD, etc (data dependent)
      • MDS, FastMap

(c) C. Faloutsos, 2006

dft and cousins
DFT and cousins

very good for compressing real signals

more details on DFT/DCT/DWT: later

(c) C. Faloutsos, 2006

feature extraction
Feature extraction
  • SVD (finds hidden/latent variables)
  • Random projections (works surprisingly well!)

(c) C. Faloutsos, 2006

conclusions practitioner s guide
Conclusions - Practitioner’s guide

Similarity search in time sequences

1) establish/choose distance (Euclidean, time-warping,…)

2) extract features (SVD, DWT, MDS), and use a SAM (R-tree/variant) or a Metric Tree (M-tree)

2’)for high intrinsic dimensionalities, consider sequential scan (it might win…)

(c) C. Faloutsos, 2006

books
Books
  • William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for SVD)
  • C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to SVD, and GEMINI)

(c) C. Faloutsos, 2006

references
References
  • Agrawal, R., K.-I. Lin, et al. (Sept. 1995). Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time-Series Databases. Proc. of VLDB, Zurich, Switzerland.
  • Babu, S. and J. Widom (2001). “Continuous Queries over Data Streams.” SIGMOD Record 30(3): 109-120.
  • Breunig, M. M., H.-P. Kriegel, et al. (2000). LOF: Identifying Density-Based Local Outliers. SIGMOD Conference, Dallas,TX.
  • Berry, Michael: http://www.cs.utk.edu/~lsi/

(c) C. Faloutsos, 2006

references1
References
  • Ciaccia, P., M. Patella, et al. (1997). M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. VLDB.
  • Foltz, P. W. and S. T. Dumais (Dec. 1992). “Personalized Information Delivery: An Analysis of Information Filtering Methods.” Comm. of ACM (CACM) 35(12): 51-60.
  • Guttman, A. (June 1984). R-Trees: A Dynamic Index Structure for Spatial Searching. Proc. ACM SIGMOD, Boston, Mass.

(c) C. Faloutsos, 2006

references2
References
  • Gaede, V. and O. Guenther (1998). “Multidimensional Access Methods.” Computing Surveys 30(2): 170-231.
  • Gehrke, J. E., F. Korn, et al. (May 2001). On Computing Correlated Aggregates Over Continual Data Streams. ACM Sigmod, Santa Barbara, California.

(c) C. Faloutsos, 2006

references3
References
  • Gunopulos, D. and G. Das (2001). Time Series Similarity Measures and Time Series Indexing. SIGMOD Conference, Santa Barbara, CA.
  • Hatonen, K., M. Klemettinen, et al. (1996). Knowledge Discovery from Telecommunication Network Alarm Databases. ICDE, New Orleans, Louisiana.
  • Jolliffe, I. T. (1986). Principal Component Analysis, Springer Verlag.

(c) C. Faloutsos, 2006

references4
References
  • Keogh, E. J., K. Chakrabarti, et al. (2001). Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. SIGMOD Conference, Santa Barbara, CA.
  • Eamonn J. Keogh, Stefano Lonardi, Chotirat (Ann) Ratanamahatana: Towards parameter-free data mining. KDD 2004: 206-215
  • Kobla, V., D. S. Doermann, et al. (Nov. 1997). VideoTrails: Representing and Visualizing Structure in Video Sequences. ACM Multimedia 97, Seattle, WA.

(c) C. Faloutsos, 2006

references5
References
  • Oppenheim, I. J., A. Jain, et al. (March 2002). A MEMS Ultrasonic Transducer for Resident Monitoring of Steel Structures. SPIE Smart Structures Conference SS05, San Diego.
  • Papadimitriou, C. H., P. Raghavan, et al. (1998). Latent Semantic Indexing: A Probabilistic Analysis. PODS, Seattle, WA.
  • Rabiner, L. and B.-H. Juang (1993). Fundamentals of Speech Recognition, Prentice Hall.

(c) C. Faloutsos, 2006

references6
References
  • Traina, C., A. Traina, et al. (October 2000). Fast feature selection using the fractal dimension,. XV Brazilian Symposium on Databases (SBBD), Paraiba, Brazil.

(c) C. Faloutsos, 2006

references7
References
  • Dennis Shasha and Yunyue Zhu High Performance Discovery in Time Series: Techniques and Case Studies Springer 2004
  • Yunyue Zhu, Dennis Shasha ``StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time'‘ VLDB, August, 2002. pp. 358-369.
  • Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. The Design of an Acquisitional Query Processor for Sensor Networks. SIGMOD, June 2003, San Diego, CA.

(c) C. Faloutsos, 2006

slide47

Part 2:

DSP (Digital

Signal Processing)

(c) C. Faloutsos, 2006

outline5
Outline
  • Motivation
  • Similarity Search and Indexing
  • DSP (DFT, DWT)
  • Linear Forecasting
  • Bursty traffic - fractals and multifractals
  • Non-linear forecasting
  • Conclusions

(c) C. Faloutsos, 2006

outline6
Outline
  • DFT
    • Definition of DFT and properties
    • how to read the DFT spectrum
  • DWT
    • Definition of DWT and properties
    • how to read the DWT scalogram

(c) C. Faloutsos, 2006

introduction problem 1
Introduction - Problem#1

Goal: given a signal (eg., packets over time)

Find: patterns and/or compress

count

lynx caught per year

(packets per day;

automobiles per hour)

year

(c) C. Faloutsos, 2006

dft amplitude spectrum
DFT: Amplitude spectrum

Amplitude:

count

Ampl.

freq=0

freq=12

year

Freq.

(c) C. Faloutsos, 2006

dft amplitude spectrum1
DFT: Amplitude spectrum

count

Ampl.

freq=0

freq=12

year

Freq.

(c) C. Faloutsos, 2006

dft amplitude spectrum2
DFT: Amplitude spectrum

count

Ampl.

freq=0

freq=12

year

Freq.

(c) C. Faloutsos, 2006

wavelets dwt
Wavelets - DWT
  • DFT is great - but, how about compressing a spike?

value

time

(c) C. Faloutsos, 2006

wavelets dwt1
Wavelets - DWT
  • DFT is great - but, how about compressing a spike?
  • A: Terrible - all DFT coefficients needed!

value

Ampl.

time

Freq.

(c) C. Faloutsos, 2006

wavelets dwt2
Wavelets - DWT
  • DFT is great - but, how about compressing a spike?
  • A: Terrible - all DFT coefficients needed!

value

time

(c) C. Faloutsos, 2006

wavelets dwt3

value

time

Wavelets - DWT
  • Similarly, DFT suffers on short-duration waves (eg., baritone, silence, soprano)

(c) C. Faloutsos, 2006

wavelets dwt4

value

time

Wavelets - DWT
  • Solution#1: Short window Fourier transform (SWFT)
  • But: how short should be the window?

freq

time

(c) C. Faloutsos, 2006

wavelets dwt5
Wavelets - DWT
  • Answer: multiple window sizes! -> DWT

Time

domain

DWT

SWFT

DFT

freq

time

(c) C. Faloutsos, 2006

haar wavelets
Haar Wavelets
  • subtract sum of left half from right half
  • repeat recursively for quarters, eight-ths, ...

(c) C. Faloutsos, 2006

wavelets construction

Skip

Wavelets - construction

x0 x1 x2 x3 x4 x5 x6 x7

(c) C. Faloutsos, 2006

wavelets construction1

Skip

Wavelets - construction

x0 x1 x2 x3 x4 x5 x6 x7

s1,0

.......

level 1

d1,1

s1,1

d1,0

+

-

(c) C. Faloutsos, 2006

wavelets construction2

Skip

Wavelets - construction

x0 x1 x2 x3 x4 x5 x6 x7

s2,0

level 2

d2,0

s1,0

.......

d1,1

s1,1

d1,0

+

-

(c) C. Faloutsos, 2006

wavelets construction3

Skip

Wavelets - construction

etc ...

x0 x1 x2 x3 x4 x5 x6 x7

s2,0

d2,0

s1,0

.......

d1,1

s1,1

d1,0

+

-

(c) C. Faloutsos, 2006

wavelets construction4

Skip

Wavelets - construction

Q: map each coefficient

on the time-freq. plane

f

x0 x1 x2 x3 x4 x5 x6 x7

s2,0

d2,0

t

s1,0

.......

d1,1

s1,1

d1,0

+

-

(c) C. Faloutsos, 2006

wavelets construction5

Skip

Wavelets - construction

Q: map each coefficient

on the time-freq. plane

f

x0 x1 x2 x3 x4 x5 x6 x7

s2,0

d2,0

t

s1,0

.......

d1,1

s1,1

d1,0

+

-

(c) C. Faloutsos, 2006

haar wavelets code
#!/usr/bin/perl5

# expects a file with numbers

# and prints the dwt transform

# The number of time-ticks should be a power of 2

# USAGE

# haar.pl <fname>

my @vals=();

my @smooth; # the smooth component of the signal

my @diff; # the high-freq. component

# collect the values into the array @val

while(<>){

@vals = ( @vals , split );

}

my $len = scalar(@vals);

my $half = int($len/2);

while($half >= 1 ){

for(my $i=0; $i< $half; $i++){

$diff [$i] = ($vals[2*$i] - $vals[2*$i + 1] )/ sqrt(2);

print "\t", $diff[$i];

$smooth [$i] = ($vals[2*$i] + $vals[2*$i + 1] )/ sqrt(2);

}

print "\n";

@vals = @smooth;

$half = int($half/2);

}

print "\t", $vals[0], "\n" ; # the final, smooth component

Haar wavelets - code

(c) C. Faloutsos, 2006

wavelets construction6
Wavelets - construction

Observation1:

‘+’ can be some weighted addition

‘-’ is the corresponding weighted difference (‘Quadrature mirror filters’)

Observation2: unlike DFT/DCT,

there are *many* wavelet bases: Haar, Daubechies-4, Daubechies-6, Coifman, Morlet, Gabor, ...

(c) C. Faloutsos, 2006

wavelets how do they look like
Wavelets - how do they look like?
  • E.g., Daubechies-4

(c) C. Faloutsos, 2006

wavelets how do they look like1
Wavelets - how do they look like?
  • E.g., Daubechies-4

?

?

(c) C. Faloutsos, 2006

wavelets how do they look like2
Wavelets - how do they look like?
  • E.g., Daubechies-4

(c) C. Faloutsos, 2006

outline7
Outline
  • Motivation
  • Similarity Search and Indexing
  • DSP
    • DFT
    • DWT
      • Definition of DWT and properties
      • how to read the DWT scalogram

(c) C. Faloutsos, 2006

wavelets drill 1

f

value

t

time

Wavelets - Drill#1:
  • Q: baritone/silence/soprano - DWT?

(c) C. Faloutsos, 2006

wavelets drill 11

value

time

Wavelets - Drill#1:
  • Q: baritone/soprano - DWT?

f

t

(c) C. Faloutsos, 2006

wavelets drill 2

f

t

Wavelets - Drill#2:
  • Q: spike - DWT?

(c) C. Faloutsos, 2006

wavelets drill 21
Wavelets - Drill#2:
  • Q: spike - DWT?

0.00 0.00 0.71 0.00

0.00 0.50

-0.35

0.35

f

t

(c) C. Faloutsos, 2006

wavelets drill 3
Wavelets - Drill#3:
  • Q: weekly + daily periodicity, + spike - DWT?

f

t

(c) C. Faloutsos, 2006

wavelets drill 31
Wavelets - Drill#3:
  • Q: weekly + daily periodicity, + spike - DWT?

f

t

(c) C. Faloutsos, 2006

wavelets drill 32
Wavelets - Drill#3:
  • Q: weekly + daily periodicity, + spike - DWT?

f

t

(c) C. Faloutsos, 2006

wavelets drill 33
Wavelets - Drill#3:
  • Q: weekly + daily periodicity, + spike - DWT?

f

t

(c) C. Faloutsos, 2006

wavelets drill 34
Wavelets - Drill#3:
  • Q: weekly + daily periodicity, + spike - DWT?

f

t

(c) C. Faloutsos, 2006

wavelets drill 35

f

t

Wavelets - Drill#3:
  • Q: DFT?

DFT

DWT

f

t

(c) C. Faloutsos, 2006

advantages of wavelets
Advantages of Wavelets
  • Better compression (better RMSE with same number of coefficients - used in JPEG-2000)
  • fast to compute (usually: O(n)!)
  • very good for ‘spikes’
  • mammalian eye and ear: Gabor wavelets

(c) C. Faloutsos, 2006

overall conclusions
Overall Conclusions
  • DFT, DCT spot periodicities
  • DWT : multi-resolution - matches processing of mammalian ear/eye better
  • All three: powerful tools for compression, pattern detection in real signals
  • All three: included in math packages
    • (matlab, ‘R’, mathematica, … - often in spreadsheets!)

(c) C. Faloutsos, 2006

overall conclusions1
Overall Conclusions
  • DWT : very suitable for self-similar traffic
  • DWT: used for summarization of streams [Gilbert+01], db histograms etc

(c) C. Faloutsos, 2006

resources software and urls
Resources - software and urls
  • http://www.dsptutor.freeuk.com/jsanalyser/FFTSpectrumAnalyser.html : Nice java applets for FFT
  • http://www.relisoft.com/freeware/freq.html voice frequency analyzer (needs microphone)

(c) C. Faloutsos, 2006

resources software and urls1
Resources: software and urls
  • xwpl: open source wavelet package from Yale, with excellent GUI
  • http://monet.me.ic.ac.uk/people/gavin/java/waveletDemos.html : wavelets and scalograms

(c) C. Faloutsos, 2006

books1
Books
  • William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT)
  • C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

(c) C. Faloutsos, 2006

additional reading
Additional Reading
  • [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

(c) C. Faloutsos, 2006

slide90

Part 3:

Linear Forecasting

skip to end

(c) C. Faloutsos, 2006

outline8
Outline
  • Motivation
  • Similarity Search and Indexing
  • DSP
  • Linear Forecasting
  • Bursty traffic - fractals and multifractals
  • Non-linear forecasting
  • Conclusions

(c) C. Faloutsos, 2006

forecasting
Forecasting

"Prediction is very difficult, especially about the future." - Nils Bohr

http://www.hfac.uh.edu/MediaFutures/thoughts.html

(c) C. Faloutsos, 2006

outline9
Outline
  • Motivation
  • ...
  • Linear Forecasting
    • Auto-regression: Least Squares; RLS
    • Co-evolving time sequences
    • Examples
    • Conclusions

(c) C. Faloutsos, 2006

problem 2 forecast1
Problem#2: Forecast
  • Example: give xt-1, xt-2, …, forecast xt

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

(c) C. Faloutsos, 2006

forecasting preprocessing
Forecasting: Preprocessing

MANUALLY:

remove trends spot periodicities

7 days

time

time

(c) C. Faloutsos, 2006

problem 2 forecast2

90

80

70

??

60

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

Problem#2: Forecast
  • Solution: try to express

xt

as a linear function of the past: xt-2, xt-2, …,

(up to a window of w)

Formally:

(c) C. Faloutsos, 2006

linear auto regression
Linear Auto Regression:

85

‘lag-plot’

80

75

70

65

Number of packets sent (t)

60

55

50

45

40

15

25

35

45

Number of packets sent (t-1)

  • lag w=1
  • Dependent variable = # of packets sent (S[t])
  • Independent variable = # of packets sent (S[t-1])

(c) C. Faloutsos, 2006

more details
More details:
  • Q1: Can it work with window w>1?
  • A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

(c) C. Faloutsos, 2006

how to choose w
goal: capture arbitrary periodicities

with NO human intervention

on a semi-infinite stream

How to choose ‘w’?

(c) C. Faloutsos, 2006

answer
‘AWSOM’ (Arbitrary Window Stream fOrecasting Method) [Papadimitriou+, vldb2003]

idea: do AR on each wavelet level

in detail:

Answer:

(c) C. Faloutsos, 2006

awsom

W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

=

t

t

W3,1

t

V4,1

t

time

AWSOM

xt

(c) C. Faloutsos, 2006

awsom1

W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

t

t

W3,1

t

V4,1

t

time

AWSOM

xt

(c) C. Faloutsos, 2006

awsom idea

Wl,t-2

Wl,t-1

Wl,t

Wl’,t’-2

Wl’,t’-1

AWSOM - idea

Wl,t l,1Wl,t-1l,2Wl,t-2 …

Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 …

Wl’,t’

(c) C. Faloutsos, 2006

more details1
More details…
  • Update of wavelet coefficients
  • Update of linear models
  • Feature selection
    • Not all correlations are significant
    • Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

(c) C. Faloutsos, 2006

results synthetic data
Results - Synthetic data

AWSOM

AR

Seasonal AR

  • Triangle pulse
  • Mix (sine + square)
  • AR captures wrong trend (or none)
  • Seasonal AR estimation fails

(c) C. Faloutsos, 2006

results real data
Results - Real data
  • Automobile traffic
    • Daily periodicity
    • Bursty “noise” at smaller scales
  • AR fails to capture any trend
  • Seasonal AR estimation fails

(c) C. Faloutsos, 2006

results real data1
Results - real data
  • Sunspot intensity
    • Slightly time-varying “period”
  • AR captures wrong trend
  • Seasonal ARIMA
    • wrong downward trend, despite help by human!

(c) C. Faloutsos, 2006

complexity

Skip

Complexity
  • Model update

Space:OlgN + mk2  OlgN

Time:Ok2  O1

  • Where
    • N: number of points (so far)
    • k: number of regression coefficients; fixed
    • m: number of linear models; OlgN

(c) C. Faloutsos, 2006

conclusions practitioner s guide1
Conclusions - Practitioner’s guide
  • AR(IMA) methodology: prevailing method for linear forecasting
  • Brilliant method of Recursive Least Squares for fast, incremental estimation.
  • See [Box-Jenkins]
  • recently: AWSOM (no human intervention)

(c) C. Faloutsos, 2006

resources software and urls2
Resources: software and urls
  • MUSCLES: Prof. Byoung-Kee Yi:

http://www.postech.ac.kr/~bkyi/

or christos@cs.cmu.edu

  • free-ware: ‘R’ for stat. analysis

(clone of Splus)

http://cran.r-project.org/

(c) C. Faloutsos, 2006

books2
Books
  • George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.)
  • Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

(c) C. Faloutsos, 2006

additional reading1
Additional Reading
  • [Papadimitriou+ vldb2003] Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003
  • [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)

(c) C. Faloutsos, 2006

outline10
Outline
  • Motivation
  • Similarity Search and Indexing
  • DSP (Digital Signal Processing)
  • Linear Forecasting
  • Bursty traffic - fractals and multifractals
  • Non-linear forecasting
  • On-going projects and Conclusions

(c) C. Faloutsos, 2006

on going projects
On-going projects
  • Lag correlations (BRAID, [SIGMOD’05])
  • Streaming SVD (SPIRIT, [VLDB’05])

http://warsteiner.db.cs.cmu.edu/

http://warsteiner.db.cs.cmu.edu/demo/intemon.jsp

  • tensor analysis ([KDD’06])

IP-to

t=0

IP-from

(c) C. Faloutsos, 2006

on going projects1
On-going projects
  • Lag correlations (BRAID, [SIGMOD’05])
  • Streaming SVD (SPIRIT, [VLDB’05])

http://warsteiner.db.cs.cmu.edu/

http://warsteiner.db.cs.cmu.edu/demo/intemon.jsp

  • tensor analysis ([KDD’06])

t=2

t=1

t=0

(c) C. Faloutsos, 2006

ongoing projects ref s
Ongoing projects - ref’s
  • [BRAID] Yasushi Sakurai, Spiros Papadimitriou, Christos Faloutsos: BRAID: Stream Mining through Group Lag Correlations. SIGMOD 2005: 599-610, Baltimore, MD, USA.
  • [SPIRIT] Spiros Papadimitriou, Jimeng Sun, Christos Faloutsos: Streaming Pattern Discovery in Multiple Time-Series. VLDB 2005: 697-708, Trodheim, Norway.
  • [Tensors] Jimeng Sun Dacheng Tao Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis KDD 2006, Philadelphia, PA, USA.

(c) C. Faloutsos, 2006

overall conclusions2
Overall conclusions
  • Similarity search: Euclidean/time-warping; feature extraction and SAMs

(c) C. Faloutsos, 2006

overall conclusions3
Overall conclusions
  • Similarity search: Euclidean/time-warping; feature extraction and SAMs
  • Signal processing: DWT is a powerful tool

(c) C. Faloutsos, 2006

overall conclusions4
Overall conclusions
  • Similarity search: Euclidean/time-warping; feature extraction and SAMs
  • Signal processing: DWT is a powerful tool
  • Linear Forecasting: AR (Box-Jenkins) methodology; AWSOM

(c) C. Faloutsos, 2006

overall conclusions5
Overall conclusions
  • Similarity search: Euclidean/time-warping; feature extraction and SAMs
  • Signal processing: DWT is a powerful tool
  • Linear Forecasting: AR (Box-Jenkins) methodology; AWSOM
  • Bursty traffic: multifractals (80-20 ‘law’)

(c) C. Faloutsos, 2006

overall conclusions6
Overall conclusions
  • Similarity search: Euclidean/time-warping; feature extraction and SAMs
  • Signal processing: DWT is a powerful tool
  • Linear Forecasting: AR (Box-Jenkins) methodology; AWSOM
  • Bursty traffic: multifractals (80-20 ‘law’)
  • Non-linear forecasting: lag-plots (Takens)

(c) C. Faloutsos, 2006

take home messages
‘Take home’ messages
  • Hard, but desirable query for sensor data: ‘find patterns / outliers’
  • We need fast, automated such tools
    • Many great tools exist (DWT, ARIMA, …)
    • some are readily usable; others need to be made scalable / single pass/ automatic

(c) C. Faloutsos, 2006

slide123

THANK YOU!

For code, papers, questions etc:

christos <at> cs.cmu.edu

www.cs.cmu.edu/~christos

(c) C. Faloutsos, 2006