Sensor data mining and forecasting
This presentation is the property of its rightful owner.
Sponsored Links
1 / 112

Sensor data mining and forecasting PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

Sensor data mining and forecasting. Christos Faloutsos CMU [email protected] Outline. Problem definition - motivation Linear forecasting - AR and AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects graph modeling, outliers etc. Problem definition.

Download Presentation

Sensor data mining and forecasting

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sensor data mining and forecasting

Sensor data mining and forecasting

Christos Faloutsos

CMU

[email protected]


Outline

Outline

Problem definition - motivation

Linear forecasting - AR and AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos


Problem definition

Problem definition

  • Given: one or more sequences

    x1 , x2 , … , xt , …

    (y1, y2, … , yt, …

    … )

  • Find

    • forecasts; patterns

    • clusters; outliers

C. Faloutsos


Motivation applications

Motivation - Applications

  • Financial, sales, economic series

  • Medical

    • ECGs +; blood pressure etc monitoring

    • reactions to new drugs

    • elderly care

C. Faloutsos


Motivation applications cont d

Motivation - Applications (cont’d)

  • ‘Smart house’

    • sensors monitor temperature, humidity, air quality

  • video surveillance

C. Faloutsos


Motivation applications cont d1

Motivation - Applications (cont’d)

  • civil/automobile infrastructure

    • bridge vibrations [Oppenheim+02]

    • road conditions / traffic monitoring

C. Faloutsos


Stream data automobile traffic

Automobile traffic

2000

1800

1600

1400

1200

1000

800

600

400

200

0

Stream Data: automobile traffic

# cars

time

C. Faloutsos


Motivation applications cont d2

Motivation - Applications (cont’d)

  • Weather, environment/anti-pollution

    • volcano monitoring

    • air/water pollutant monitoring

C. Faloutsos


Stream data sunspots

Stream Data: Sunspots

#sunspots per month

time

C. Faloutsos


Motivation applications cont d3

Motivation - Applications (cont’d)

  • Computer systems

    • ‘Active Disks’ (buffering, prefetching)

    • web servers (ditto)

    • network traffic monitoring

    • ...

C. Faloutsos


Stream data disk accesses

Stream Data: Disk accesses

#bytes

time

C. Faloutsos


Settings applications

Settings & Applications

  • One or more sensors, collecting time-series data

C. Faloutsos


Settings applications1

Settings & Applications

Each sensor collects data (x1, x2, …, xt, …)

C. Faloutsos


Settings applications2

Settings & Applications

Sensors ‘report’ to a central site

C. Faloutsos


Settings applications3

Settings & Applications

Problem #1:

Finding patterns

in a single time sequence

C. Faloutsos


Settings applications4

Settings & Applications

Problem #2:

Finding patterns

in many time

sequences

C. Faloutsos


Problem 1

Problem #1:

Goal: given a signal (eg., #packets over time)

Find: patterns, periodicities, and/or compress

count

lynx caught per year

(packets per day;

temperature per day)

year

C. Faloutsos


Problem 1 forecast

Problem#1’: Forecast

Given xt, xt-1, …, forecast xt+1

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos


Problem 2

Problem #2:

  • Given: A set of correlatedtime sequences

  • Forecast ‘Sent(t)’

C. Faloutsos


Differences from dsp stat

Differences from DSP/Stat

  • Semi-infinite streams

    • we need on-line, ‘any-time’ algorithms

  • Can not afford human intervention

    • need automatic methods

  • sensors have limited memory / processing / transmitting power

    • need for (lossy) compression

C. Faloutsos


Important observations

Important observations

Patterns, rules, compression and forecasting are closely related:

  • To do forecasting, we need

    • to find patterns/rules

  • good rules help us compress

  • to find outliers, we need to have forecasts

    • (outlier = too far away from our forecast)

C. Faloutsos


Pictorial outline of the talk

Pictorial outline of the talk

C. Faloutsos


Outline1

Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos


Mini intro to a r

Mini intro to A.R.

C. Faloutsos


Forecasting

Forecasting

"Prediction is very difficult, especially about the future." - Nils Bohr

http://www.hfac.uh.edu/MediaFutures/thoughts.html

C. Faloutsos


Problem 1 forecast1

Problem#1’: Forecast

  • Example: give xt-1, xt-2, …, forecast xt

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos


Linear regression idea

Linear Regression: idea

85

Body height

80

75

70

65

60

55

50

45

40

15

25

35

45

Body weight

  • express what we don’t know (= ‘dependent variable’)

  • as a linear function of what we know (= ‘indep. variable(s)’)

C. Faloutsos


Linear auto regression

Linear Auto Regression:

C. Faloutsos


Problem 1 forecast2

90

80

70

??

60

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

Problem#1’: Forecast

  • Solution: try to express

    xt

    as a linear function of the past: xt-2, xt-2, …,

    (up to a window of w)

    Formally:

C. Faloutsos


Linear auto regression1

Linear Auto Regression:

85

‘lag-plot’

80

75

70

65

Number of packets sent (t)

60

55

50

45

40

15

25

35

45

Number of packets sent (t-1)

  • lag w=1

  • Dependent variable = # of packets sent (S[t])

  • Independent variable = # of packets sent (S[t-1])

C. Faloutsos


More details

More details:

  • Q1: Can it work with window w>1?

  • A1: YES!

xt

xt-1

xt-2

C. Faloutsos


More details1

More details:

  • Q1: Can it work with window w>1?

  • A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos


More details2

More details:

  • Q1: Can it work with window w>1?

  • A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos


Even more details

Even more details

  • Q2: Can we estimate a incrementally?

  • A2: Yes, with the brilliant, classic method of ‘Recursive Least Squares’ (RLS) (see, e.g., [Chen+94], or [Yi+00], for details)

  • Q3: can we ‘down-weight’ older samples?

  • A3: yes (RLS does that easily!)

C. Faloutsos


Mini intro to a r1

Mini intro to A.R.

C. Faloutsos


How to choose w

goal: capture arbitrary periodicities

with NO human intervention

on a semi-infinite stream

How to choose ‘w’?

C. Faloutsos


Outline2

Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos


Problem

Problem:

  • in a train of spikes (128 ticks apart)

  • any AR with window w < 128 will fail

    What to do, then?

C. Faloutsos


Answer intuition

Answer (intuition)

  • Do a Wavelet transform (~ short window DFT)

  • look for patterns in every frequency

C. Faloutsos


Intuition

Intuition

  • Why NOT use the short window Fourier transform (SWFT)?

  • A: how short should be the window?

freq

time

w’

C. Faloutsos


Wavelets

main idea: variable-length window!

wavelets

f

t

C. Faloutsos


Advantages of wavelets

Advantages of Wavelets

  • Better compression (better RMSE with same number of coefficients - used in JPEG-2000)

  • fast to compute (usually: O(n)!)

  • very good for ‘spikes’

  • mammalian eye and ear: Gabor wavelets

C. Faloutsos


Wavelets intuition

f

value

t

time

Wavelets - intuition:

  • Q: baritone/silence/ soprano - DWT?

C. Faloutsos


Wavelets intuition1

f

value

t

time

Wavelets - intuition:

  • Q: baritone/soprano - DWT?

C. Faloutsos


Awsom

W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

=

t

t

W3,1

t

V4,1

t

time

AWSOM

xt

C. Faloutsos


Awsom1

W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

t

t

W3,1

t

V4,1

t

time

AWSOM

xt

C. Faloutsos


Awsom idea

Wl,t-2

Wl,t-1

Wl,t

Wl’,t’-2

Wl’,t’-1

AWSOM - idea

Wl,t l,1Wl,t-1l,2Wl,t-2 …

Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 …

Wl’,t’

C. Faloutsos


More details3

More details…

  • Update of wavelet coefficients

  • Update of linear models

  • Feature selection

    • Not all correlations are significant

    • Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

C. Faloutsos


Results synthetic data

Results - Synthetic data

AWSOM

AR

Seasonal AR

  • Triangle pulse

  • Mix (sine + square)

  • AR captures wrong trend (or none)

  • Seasonal AR estimation fails

C. Faloutsos


Results real data

Results - Real data

  • Automobile traffic

    • Daily periodicity

    • Bursty “noise” at smaller scales

  • AR fails to capture any trend

  • Seasonal AR estimation fails

C. Faloutsos


Results real data1

Results - real data

  • Sunspot intensity

    • Slightly time-varying “period”

  • AR captures wrong trend

  • Seasonal ARIMA

    • wrong downward trend, despite help by human!

C. Faloutsos


Complexity

Complexity

  • Model update

    Space:OlgN + mk2  OlgN

    Time:Ok2  O1

  • Where

    • N: number of points (so far)

    • k:number of regression coefficients; fixed

    • m:number of linear models; OlgN

C. Faloutsos


Conclusions

Conclusions

  • AWSOM: Automatic, ‘hands-off’ traffic modeling (first of its kind!)

C. Faloutsos


Outline3

Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos


Co evolving time sequences

Co-Evolving Time Sequences

  • Given: A set of correlatedtime sequences

  • Forecast ‘Repeated(t)’

??

C. Faloutsos


Solution

Solution:

Q: what should we do?

C. Faloutsos


Solution1

Solution:

Least Squares, with

  • Dep. Variable: Repeated(t)

  • Indep. Variables: Sent(t-1) … Sent(t-w); Lost(t-1) …Lost(t-w); Repeated(t-1), ...

  • (named: ‘MUSCLES’ [Yi+00])

C. Faloutsos


Examples experiments

Examples - Experiments

  • Datasets

    • Modem pool traffic (14 modems, 1500 time-ticks; #packets per time unit)

    • AT&T WorldNet internet usage (several data streams; 980 time-ticks)

  • Measures of success

    • Accuracy : Root Mean Square Error (RMSE)

C. Faloutsos


Accuracy modem

Accuracy - “Modem”

MUSCLES outperforms AR & “yesterday”

C. Faloutsos


Accuracy internet

Accuracy - “Internet”

  • MUSCLES consistently outperforms AR & “yesterday”

C. Faloutsos


Outline4

Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos


Detailed outline

Detailed Outline

  • Non-linear forecasting

    • Problem

    • Idea

    • How-to

    • Experiments

    • Conclusions

C. Faloutsos


Recall problem 1

Recall: Problem #1

Value

Time

Given a time series {xt}, predict its future course, that is, xt+1, xt+2, ...

C. Faloutsos


How to forecast

How to forecast?

  • ARIMA - but: linearity assumption

  • ANSWER: ‘Delayed Coordinate Embedding’ = Lag Plots [Sauer92]

C. Faloutsos


General intuition lag plot

Interpolate these…

To get the final prediction

4-NN

New Point

General Intuition (Lag Plot)

Lag = 1,k = 4 NN

xt

xt-1

C. Faloutsos


Questions

Questions:

  • Q1: How to choose lag L?

  • Q2: How to choose k (the # of NN)?

  • Q3: How to interpolate?

  • Q4: why should this work at all?

C. Faloutsos


Q1 choosing lag l

Q1: Choosing lag L

  • Manually (16, in award winning system by [Sauer94])

  • Our proposal: choose L such that the ‘intrinsic dimension’ in the lag plot stabilizes [Chakrabarti+02]

C. Faloutsos


Fractal dimensions

Fractal Dimensions

  • FD = intrinsic dimensionality

Embedding dimensionality = 3

Intrinsic dimensionality = 1

C. Faloutsos


Fractal dimensions1

Fractal Dimensions

  • FD = intrinsic dimensionality

log( # pairs)

C. Faloutsos

log(r)


Intuition1

x(t)

time

The Logistic Parabola xt = axt-1(1-xt-1) + noise

Intuition

X(t)

  • Its lag plot for lag = 1

C. Faloutsos

X(t-1)


Intuition2

x(t)

x(t-1)

x(t-2)

x(t)

x(t)

x(t-1)

x(t-1)

x(t-2)

x(t-2)

Intuition

x(t)

x(t-1)

C. Faloutsos


Intuition3

Intuition

Fractal dimension

  • The FD vs L plot does flatten out

  • L(opt) = 1

C. Faloutsos

Lag


Proposed method

Fractal Dimension

epsilon

Choose this

Lag (L)

Proposed Method

  • Use Fractal Dimensions to find the optimal lag length L(opt)

C. Faloutsos


Q2 choosing number of neighbors k

Q2: Choosing number of neighbors k

  • Manually (typically ~ 1-10)

C. Faloutsos


Q3 how to interpolate

Q3: How to interpolate?

How do we interpolate between thek nearest neighbors?

A3.1: Average

A3.2: Weighted average (weights drop with distance - how?)

C. Faloutsos


Q3 how to interpolate1

A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)

Q3: How to interpolate?

xt

Xt-1

C. Faloutsos


Q4 any theory behind it

Q4: Any theory behind it?

A4: YES!

C. Faloutsos


Theoretical foundation

Theoretical foundation

  • Based on the “Takens’ Theorem” [Takens81]

  • which says that long enough delay vectors can do prediction, even if there are unobserved variables in the dynamical system (= diff. equations)

C. Faloutsos


Theoretical foundation1

P

H

Skip

Theoretical foundation

Example: Lotka-Volterra equations

dH/dt = r H – a H*P dP/dt = b H*P – m P

H is count of prey (e.g., hare)P is count of predators (e.g., lynx)

Suppose only P(t) is observed (t=1, 2, …).

C. Faloutsos


Theoretical foundation2

P

H

Skip

Theoretical foundation

  • But the delay vector space is a faithful reconstruction of the internal system state

  • So prediction in delay vector space is as good as prediction in state space

P(t)

P(t-1)

C. Faloutsos


Detailed outline1

Detailed Outline

  • Non-linear forecasting

    • Problem

    • Idea

    • How-to

    • Experiments

    • Conclusions

C. Faloutsos


Datasets

x(t)

time

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

Lag-plot

C. Faloutsos


Datasets1

x(t)

time

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

Lag-plot

ARIMA: fails

C. Faloutsos


Logistic parabola

Logistic Parabola

Our Prediction from here

Value

Timesteps

C. Faloutsos


Logistic parabola1

Value

Logistic Parabola

Comparison of prediction to correct values

Timesteps

C. Faloutsos


Datasets2

Value

Datasets

LORENZ: Models convection currents in the air

dx / dt = a (y - x)

dy / dt = x (b - z) - y

dz / dt = xy - c z

C. Faloutsos


Lorenz

Value

LORENZ

Comparison of prediction to correct values

Timesteps

C. Faloutsos


Datasets3

Value

Datasets

  • LASER: fluctuations in a Laser over time (used in Santa Fe competition)

Time

C. Faloutsos


Laser

Value

Laser

Comparison of prediction to correct values

Timesteps

C. Faloutsos


Conclusions1

Conclusions

  • Lag plots for non-linear forecasting (Takens’ theorem)

  • suitable for ‘chaotic’ signals

C. Faloutsos


Additional projects at cmu

Additional projects at CMU

  • Graph/Network mining

  • spatio-temporal mining - outliers

C. Faloutsos


Graph network mining

Graph/network mining

  • Internet; web; gnutella P2P networks

  • Q: Any pattern?

  • Q: how to generate ‘realistic’ topologies?

  • Q: how to define/verify realism?

C. Faloutsos


Patterns

Patterns?

  • avg degree is, say 3.3

  • pick a node at random - what is the degree you expect it to have?

count

?

avg: 3.3

degree

C. Faloutsos


Patterns1

Patterns?

  • avg degree is, say 3.3

  • pick a node at random - what is the degree you expect it to have?

  • A: 1!!

count

avg: 3.3

degree

C. Faloutsos


Patterns2

Patterns?

  • avg degree is, say 3.3

  • pick a node at random - what is the degree you expect it to have?

  • A: 1!!

count

avg: 3.3

degree

C. Faloutsos


Patterns3

Patterns?

log(count)

  • A: Power laws!

log {(out) degree}

C. Faloutsos


Other laws

Effective Diameter

Other ‘laws’?

Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos


Rmat to generate realistic graphs

Effective Diameter

RMAT, to generate realistic graphs

Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos


Epidemic threshold

Epidemic threshold?

  • one a real graph, will a (computer / biological) virus die out? (given

    • beta: probability that an infected node will infect its neighbor and

    • delta: probability that an infected node will recover

NO

MAYBE

YES

C. Faloutsos


Epidemic threshold1

Epidemic threshold?

  • one a real graph, will a (computer / biological) virus die out? (given

    • beta: probability that an infected node will infect its neighbor and

    • delta: probability that an infected node will recover

  • A: depends on largest eigenvalue of adjacency matrix! [Wang+03]

C. Faloutsos


Additional projects

Additional projects

  • Graph mining

  • spatio-temporal mining - outliers

C. Faloutsos


Outliers loci

Outliers - ‘LOCI’

C. Faloutsos


Outliers loci1

finds outliers quickly,

with no human intervention

Outliers - ‘LOCI’

C. Faloutsos


Conclusions2

Conclusions

  • AWSOM for automatic, linear forecasting

  • MUSCLES for co-evolving sequences

  • F4 for non-linear forecasting

  • Graph/Network topology: power laws and generators; epidemic threshold

  • LOCI for outlier detection

C. Faloutsos


Conclusions3

Conclusions

  • Overarching theme: automatic discovery of patterns (outliers/rules) in

    • time sequences (sensors/streams)

    • graphs (computer/social networks)

    • multimedia (video, motion capture data etc)

      www.cs.cmu.edu/~christos

      [email protected]

C. Faloutsos


Books

Books

  • William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT)

  • C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

C. Faloutsos


Books1

Books

  • George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.)

  • Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

C. Faloutsos


Resources software and urls

Resources: software and urls

  • MUSCLES: Prof. Byoung-Kee Yi:

    http://www.postech.ac.kr/~bkyi/

    or [email protected]

  • AWSOM & LOCI: [email protected]

  • F4, RMAT: [email protected]

C. Faloutsos


Additional reading

Additional Reading

  • [Chakrabarti+02] Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.

  • [Chen+94] Chung-Min Chen, Nick Roussopoulos: Adaptive Selectivity Estimation Using Query Feedback. SIGMOD Conference 1994:161-172

  • [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

C. Faloutsos


Additional reading1

Additional Reading

  • Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003

  • Spiros Papadimitriou, Hiroyuki Kitagawa, Phil Gibbons and Christos Faloutsos LOCI: Fast Outlier Detection Using the Local Correlation Integral ICDE 2003, Bangalore, India, March 5 - March 8, 2003.

  • Sauer, T. (1994). Time series prediction using delay coordinate embedding. (in book by Weigend and Gershenfeld, below) Addison-Wesley.

C. Faloutsos


Additional reading2

Additional Reading

  • Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dynamical Systems and Turbulence. Berlin: Springer-Verlag.

  • Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint 22nd Symposium on Reliable Distributed Computing (SRDS2003) Florence, Italy, Oct. 6-8, 2003

C. Faloutsos


Additional reading3

Additional Reading

  • Weigend, A. S. and N. A. Gerschenfeld (1994). Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley. (Excellent collection of papers on chaotic/non-linear forecasting, describing the algorithms behind the winners of the Santa Fe competition.)

  • [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)

C. Faloutsos


  • Login