- 67 Views
- Uploaded on
- Presentation posted in: General

Sensor data mining and forecasting

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Sensor data mining and forecasting

Christos Faloutsos

CMU

christos@cs.cmu.edu

Problem definition - motivation

Linear forecasting - AR and AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

- Given: one or more sequences
x1 , x2 , … , xt , …

(y1, y2, … , yt, …

… )

- Find
- forecasts; patterns
- clusters; outliers

C. Faloutsos

- Financial, sales, economic series
- Medical
- ECGs +; blood pressure etc monitoring
- reactions to new drugs
- elderly care

C. Faloutsos

- ‘Smart house’
- sensors monitor temperature, humidity, air quality

- video surveillance

C. Faloutsos

- civil/automobile infrastructure
- bridge vibrations [Oppenheim+02]
- road conditions / traffic monitoring

C. Faloutsos

Automobile traffic

2000

1800

1600

1400

1200

1000

800

600

400

200

0

# cars

time

C. Faloutsos

- Weather, environment/anti-pollution
- volcano monitoring
- air/water pollutant monitoring

C. Faloutsos

#sunspots per month

time

C. Faloutsos

- Computer systems
- ‘Active Disks’ (buffering, prefetching)
- web servers (ditto)
- network traffic monitoring
- ...

C. Faloutsos

#bytes

time

C. Faloutsos

- One or more sensors, collecting time-series data

C. Faloutsos

Each sensor collects data (x1, x2, …, xt, …)

C. Faloutsos

Sensors ‘report’ to a central site

C. Faloutsos

Problem #1:

Finding patterns

in a single time sequence

C. Faloutsos

Problem #2:

Finding patterns

in many time

sequences

C. Faloutsos

Goal: given a signal (eg., #packets over time)

Find: patterns, periodicities, and/or compress

count

lynx caught per year

(packets per day;

temperature per day)

year

C. Faloutsos

Given xt, xt-1, …, forecast xt+1

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos

- Given: A set of correlatedtime sequences
- Forecast ‘Sent(t)’

C. Faloutsos

- Semi-infinite streams
- we need on-line, ‘any-time’ algorithms

- Can not afford human intervention
- need automatic methods

- sensors have limited memory / processing / transmitting power
- need for (lossy) compression

C. Faloutsos

Patterns, rules, compression and forecasting are closely related:

- To do forecasting, we need
- to find patterns/rules

- good rules help us compress
- to find outliers, we need to have forecasts
- (outlier = too far away from our forecast)

C. Faloutsos

C. Faloutsos

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

C. Faloutsos

"Prediction is very difficult, especially about the future." - Nils Bohr

http://www.hfac.uh.edu/MediaFutures/thoughts.html

C. Faloutsos

- Example: give xt-1, xt-2, …, forecast xt

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos

85

Body height

80

75

70

65

60

55

50

45

40

15

25

35

45

Body weight

- express what we don’t know (= ‘dependent variable’)
- as a linear function of what we know (= ‘indep. variable(s)’)

C. Faloutsos

C. Faloutsos

90

80

70

??

60

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

- Solution: try to express
xt

as a linear function of the past: xt-2, xt-2, …,

(up to a window of w)

Formally:

C. Faloutsos

85

‘lag-plot’

80

75

70

65

Number of packets sent (t)

60

55

50

45

40

15

25

35

45

Number of packets sent (t-1)

- lag w=1
- Dependent variable = # of packets sent (S[t])
- Independent variable = # of packets sent (S[t-1])

C. Faloutsos

- Q1: Can it work with window w>1?
- A1: YES!

xt

xt-1

xt-2

C. Faloutsos

- Q1: Can it work with window w>1?
- A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos

- Q1: Can it work with window w>1?
- A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos

- Q2: Can we estimate a incrementally?
- A2: Yes, with the brilliant, classic method of ‘Recursive Least Squares’ (RLS) (see, e.g., [Chen+94], or [Yi+00], for details)
- Q3: can we ‘down-weight’ older samples?
- A3: yes (RLS does that easily!)

C. Faloutsos

C. Faloutsos

goal: capture arbitrary periodicities

with NO human intervention

on a semi-infinite stream

C. Faloutsos

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

- in a train of spikes (128 ticks apart)
- any AR with window w < 128 will fail
What to do, then?

C. Faloutsos

- Do a Wavelet transform (~ short window DFT)
- look for patterns in every frequency

C. Faloutsos

- Why NOT use the short window Fourier transform (SWFT)?
- A: how short should be the window?

freq

time

w’

C. Faloutsos

main idea: variable-length window!

f

t

C. Faloutsos

- Better compression (better RMSE with same number of coefficients - used in JPEG-2000)
- fast to compute (usually: O(n)!)
- very good for ‘spikes’
- mammalian eye and ear: Gabor wavelets

C. Faloutsos

f

value

t

time

- Q: baritone/silence/ soprano - DWT?

C. Faloutsos

f

value

t

time

- Q: baritone/soprano - DWT?

C. Faloutsos

W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

=

t

t

W3,1

t

V4,1

t

time

xt

C. Faloutsos

W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

t

t

W3,1

t

V4,1

t

time

xt

C. Faloutsos

Wl,t-2

Wl,t-1

Wl,t

Wl’,t’-2

Wl’,t’-1

Wl,t l,1Wl,t-1l,2Wl,t-2 …

Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 …

Wl’,t’

C. Faloutsos

- Update of wavelet coefficients
- Update of linear models
- Feature selection
- Not all correlations are significant
- Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

C. Faloutsos

AWSOM

AR

Seasonal AR

- Triangle pulse
- Mix (sine + square)
- AR captures wrong trend (or none)
- Seasonal AR estimation fails

C. Faloutsos

- Automobile traffic
- Daily periodicity
- Bursty “noise” at smaller scales

- AR fails to capture any trend
- Seasonal AR estimation fails

C. Faloutsos

- Sunspot intensity
- Slightly time-varying “period”

- AR captures wrong trend
- Seasonal ARIMA
- wrong downward trend, despite help by human!

C. Faloutsos

- Model update
Space:OlgN + mk2 OlgN

Time:Ok2 O1

- Where
- N: number of points (so far)
- k:number of regression coefficients; fixed
- m:number of linear models; OlgN

C. Faloutsos

- AWSOM: Automatic, ‘hands-off’ traffic modeling (first of its kind!)

C. Faloutsos

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

- Given: A set of correlatedtime sequences
- Forecast ‘Repeated(t)’

??

C. Faloutsos

Q: what should we do?

C. Faloutsos

Least Squares, with

- Dep. Variable: Repeated(t)
- Indep. Variables: Sent(t-1) … Sent(t-w); Lost(t-1) …Lost(t-w); Repeated(t-1), ...
- (named: ‘MUSCLES’ [Yi+00])

C. Faloutsos

- Datasets
- Modem pool traffic (14 modems, 1500 time-ticks; #packets per time unit)
- AT&T WorldNet internet usage (several data streams; 980 time-ticks)

- Measures of success
- Accuracy : Root Mean Square Error (RMSE)

C. Faloutsos

MUSCLES outperforms AR & “yesterday”

C. Faloutsos

- MUSCLES consistently outperforms AR & “yesterday”

C. Faloutsos

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

- Non-linear forecasting
- Problem
- Idea
- How-to
- Experiments
- Conclusions

C. Faloutsos

Value

Time

Given a time series {xt}, predict its future course, that is, xt+1, xt+2, ...

C. Faloutsos

- ARIMA - but: linearity assumption
- ANSWER: ‘Delayed Coordinate Embedding’ = Lag Plots [Sauer92]

C. Faloutsos

Interpolate these…

To get the final prediction

4-NN

New Point

Lag = 1,k = 4 NN

xt

xt-1

C. Faloutsos

- Q1: How to choose lag L?
- Q2: How to choose k (the # of NN)?
- Q3: How to interpolate?
- Q4: why should this work at all?

C. Faloutsos

- Manually (16, in award winning system by [Sauer94])
- Our proposal: choose L such that the ‘intrinsic dimension’ in the lag plot stabilizes [Chakrabarti+02]

C. Faloutsos

- FD = intrinsic dimensionality

Embedding dimensionality = 3

Intrinsic dimensionality = 1

C. Faloutsos

- FD = intrinsic dimensionality

log( # pairs)

C. Faloutsos

log(r)

x(t)

time

The Logistic Parabola xt = axt-1(1-xt-1) + noise

X(t)

- Its lag plot for lag = 1

C. Faloutsos

X(t-1)

x(t)

x(t-1)

x(t-2)

x(t)

x(t)

x(t-1)

x(t-1)

x(t-2)

x(t-2)

x(t)

x(t-1)

C. Faloutsos

Fractal dimension

- The FD vs L plot does flatten out
- L(opt) = 1

C. Faloutsos

Lag

Fractal Dimension

epsilon

Choose this

Lag (L)

- Use Fractal Dimensions to find the optimal lag length L(opt)

C. Faloutsos

- Manually (typically ~ 1-10)

C. Faloutsos

How do we interpolate between thek nearest neighbors?

A3.1: Average

A3.2: Weighted average (weights drop with distance - how?)

C. Faloutsos

A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)

xt

Xt-1

C. Faloutsos

A4: YES!

C. Faloutsos

- Based on the “Takens’ Theorem” [Takens81]
- which says that long enough delay vectors can do prediction, even if there are unobserved variables in the dynamical system (= diff. equations)

C. Faloutsos

P

H

Skip

Example: Lotka-Volterra equations

dH/dt = r H – a H*P dP/dt = b H*P – m P

H is count of prey (e.g., hare)P is count of predators (e.g., lynx)

Suppose only P(t) is observed (t=1, 2, …).

C. Faloutsos

P

H

Skip

- But the delay vector space is a faithful reconstruction of the internal system state
- So prediction in delay vector space is as good as prediction in state space

P(t)

P(t-1)

C. Faloutsos

- Non-linear forecasting
- Problem
- Idea
- How-to
- Experiments
- Conclusions

C. Faloutsos

x(t)

time

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

Lag-plot

C. Faloutsos

x(t)

time

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

Lag-plot

ARIMA: fails

C. Faloutsos

Our Prediction from here

Value

Timesteps

C. Faloutsos

Value

Comparison of prediction to correct values

Timesteps

C. Faloutsos

Value

LORENZ: Models convection currents in the air

dx / dt = a (y - x)

dy / dt = x (b - z) - y

dz / dt = xy - c z

C. Faloutsos

Value

Comparison of prediction to correct values

Timesteps

C. Faloutsos

Value

- LASER: fluctuations in a Laser over time (used in Santa Fe competition)

Time

C. Faloutsos

Value

Comparison of prediction to correct values

Timesteps

C. Faloutsos

- Lag plots for non-linear forecasting (Takens’ theorem)
- suitable for ‘chaotic’ signals

C. Faloutsos

- Graph/Network mining
- spatio-temporal mining - outliers

C. Faloutsos

- Internet; web; gnutella P2P networks
- Q: Any pattern?
- Q: how to generate ‘realistic’ topologies?
- Q: how to define/verify realism?

C. Faloutsos

- avg degree is, say 3.3
- pick a node at random - what is the degree you expect it to have?

count

?

avg: 3.3

degree

C. Faloutsos

- avg degree is, say 3.3
- pick a node at random - what is the degree you expect it to have?
- A: 1!!

count

avg: 3.3

degree

C. Faloutsos

- avg degree is, say 3.3
- pick a node at random - what is the degree you expect it to have?
- A: 1!!

count

avg: 3.3

degree

C. Faloutsos

log(count)

- A: Power laws!

log {(out) degree}

C. Faloutsos

Effective Diameter

Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos

Effective Diameter

Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos

- one a real graph, will a (computer / biological) virus die out? (given
- beta: probability that an infected node will infect its neighbor and
- delta: probability that an infected node will recover

NO

MAYBE

YES

C. Faloutsos

- one a real graph, will a (computer / biological) virus die out? (given
- beta: probability that an infected node will infect its neighbor and
- delta: probability that an infected node will recover

- A: depends on largest eigenvalue of adjacency matrix! [Wang+03]

C. Faloutsos

- Graph mining
- spatio-temporal mining - outliers

C. Faloutsos

C. Faloutsos

finds outliers quickly,

with no human intervention

C. Faloutsos

- AWSOM for automatic, linear forecasting
- MUSCLES for co-evolving sequences
- F4 for non-linear forecasting
- Graph/Network topology: power laws and generators; epidemic threshold
- LOCI for outlier detection

C. Faloutsos

- Overarching theme: automatic discovery of patterns (outliers/rules) in
- time sequences (sensors/streams)
- graphs (computer/social networks)
- multimedia (video, motion capture data etc)
www.cs.cmu.edu/~christos

christos@cs.cmu.edu

C. Faloutsos

- William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT)
- C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

C. Faloutsos

- George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.)
- Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

C. Faloutsos

- MUSCLES: Prof. Byoung-Kee Yi:
http://www.postech.ac.kr/~bkyi/

or christos@cs.cmu.edu

- AWSOM & LOCI: spapadim@cs.cmu.edu
- F4, RMAT: deepay@cs.cmu.edu

C. Faloutsos

- [Chakrabarti+02] Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.
- [Chen+94] Chung-Min Chen, Nick Roussopoulos: Adaptive Selectivity Estimation Using Query Feedback. SIGMOD Conference 1994:161-172
- [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

C. Faloutsos

- Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003
- Spiros Papadimitriou, Hiroyuki Kitagawa, Phil Gibbons and Christos Faloutsos LOCI: Fast Outlier Detection Using the Local Correlation Integral ICDE 2003, Bangalore, India, March 5 - March 8, 2003.
- Sauer, T. (1994). Time series prediction using delay coordinate embedding. (in book by Weigend and Gershenfeld, below) Addison-Wesley.

C. Faloutsos

- Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dynamical Systems and Turbulence. Berlin: Springer-Verlag.
- Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint 22nd Symposium on Reliable Distributed Computing (SRDS2003) Florence, Italy, Oct. 6-8, 2003

C. Faloutsos

- Weigend, A. S. and N. A. Gerschenfeld (1994). Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley. (Excellent collection of papers on chaotic/non-linear forecasting, describing the algorithms behind the winners of the Santa Fe competition.)
- [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)

C. Faloutsos