Sensor data mining and forecasting
Download
1 / 112

Sensor data mining and forecasting - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on
  • Presentation posted in: General

Sensor data mining and forecasting. Christos Faloutsos CMU christos@cs.cmu.edu. Outline. Problem definition - motivation Linear forecasting - AR and AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects graph modeling, outliers etc. Problem definition.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Sensor data mining and forecasting

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sensor data mining and forecasting

Christos Faloutsos

CMU

christos@cs.cmu.edu


Outline

Problem definition - motivation

Linear forecasting - AR and AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos


Problem definition

  • Given: one or more sequences

    x1 , x2 , … , xt , …

    (y1, y2, … , yt, …

    … )

  • Find

    • forecasts; patterns

    • clusters; outliers

C. Faloutsos


Motivation - Applications

  • Financial, sales, economic series

  • Medical

    • ECGs +; blood pressure etc monitoring

    • reactions to new drugs

    • elderly care

C. Faloutsos


Motivation - Applications (cont’d)

  • ‘Smart house’

    • sensors monitor temperature, humidity, air quality

  • video surveillance

C. Faloutsos


Motivation - Applications (cont’d)

  • civil/automobile infrastructure

    • bridge vibrations [Oppenheim+02]

    • road conditions / traffic monitoring

C. Faloutsos


Automobile traffic

2000

1800

1600

1400

1200

1000

800

600

400

200

0

Stream Data: automobile traffic

# cars

time

C. Faloutsos


Motivation - Applications (cont’d)

  • Weather, environment/anti-pollution

    • volcano monitoring

    • air/water pollutant monitoring

C. Faloutsos


Stream Data: Sunspots

#sunspots per month

time

C. Faloutsos


Motivation - Applications (cont’d)

  • Computer systems

    • ‘Active Disks’ (buffering, prefetching)

    • web servers (ditto)

    • network traffic monitoring

    • ...

C. Faloutsos


Stream Data: Disk accesses

#bytes

time

C. Faloutsos


Settings & Applications

  • One or more sensors, collecting time-series data

C. Faloutsos


Settings & Applications

Each sensor collects data (x1, x2, …, xt, …)

C. Faloutsos


Settings & Applications

Sensors ‘report’ to a central site

C. Faloutsos


Settings & Applications

Problem #1:

Finding patterns

in a single time sequence

C. Faloutsos


Settings & Applications

Problem #2:

Finding patterns

in many time

sequences

C. Faloutsos


Problem #1:

Goal: given a signal (eg., #packets over time)

Find: patterns, periodicities, and/or compress

count

lynx caught per year

(packets per day;

temperature per day)

year

C. Faloutsos


Problem#1’: Forecast

Given xt, xt-1, …, forecast xt+1

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos


Problem #2:

  • Given: A set of correlatedtime sequences

  • Forecast ‘Sent(t)’

C. Faloutsos


Differences from DSP/Stat

  • Semi-infinite streams

    • we need on-line, ‘any-time’ algorithms

  • Can not afford human intervention

    • need automatic methods

  • sensors have limited memory / processing / transmitting power

    • need for (lossy) compression

C. Faloutsos


Important observations

Patterns, rules, compression and forecasting are closely related:

  • To do forecasting, we need

    • to find patterns/rules

  • good rules help us compress

  • to find outliers, we need to have forecasts

    • (outlier = too far away from our forecast)

C. Faloutsos


Pictorial outline of the talk

C. Faloutsos


Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos


Mini intro to A.R.

C. Faloutsos


Forecasting

"Prediction is very difficult, especially about the future." - Nils Bohr

http://www.hfac.uh.edu/MediaFutures/thoughts.html

C. Faloutsos


Problem#1’: Forecast

  • Example: give xt-1, xt-2, …, forecast xt

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos


Linear Regression: idea

85

Body height

80

75

70

65

60

55

50

45

40

15

25

35

45

Body weight

  • express what we don’t know (= ‘dependent variable’)

  • as a linear function of what we know (= ‘indep. variable(s)’)

C. Faloutsos


Linear Auto Regression:

C. Faloutsos


90

80

70

??

60

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

Problem#1’: Forecast

  • Solution: try to express

    xt

    as a linear function of the past: xt-2, xt-2, …,

    (up to a window of w)

    Formally:

C. Faloutsos


Linear Auto Regression:

85

‘lag-plot’

80

75

70

65

Number of packets sent (t)

60

55

50

45

40

15

25

35

45

Number of packets sent (t-1)

  • lag w=1

  • Dependent variable = # of packets sent (S[t])

  • Independent variable = # of packets sent (S[t-1])

C. Faloutsos


More details:

  • Q1: Can it work with window w>1?

  • A1: YES!

xt

xt-1

xt-2

C. Faloutsos


More details:

  • Q1: Can it work with window w>1?

  • A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos


More details:

  • Q1: Can it work with window w>1?

  • A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos


Even more details

  • Q2: Can we estimate a incrementally?

  • A2: Yes, with the brilliant, classic method of ‘Recursive Least Squares’ (RLS) (see, e.g., [Chen+94], or [Yi+00], for details)

  • Q3: can we ‘down-weight’ older samples?

  • A3: yes (RLS does that easily!)

C. Faloutsos


Mini intro to A.R.

C. Faloutsos


goal: capture arbitrary periodicities

with NO human intervention

on a semi-infinite stream

How to choose ‘w’?

C. Faloutsos


Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos


Problem:

  • in a train of spikes (128 ticks apart)

  • any AR with window w < 128 will fail

    What to do, then?

C. Faloutsos


Answer (intuition)

  • Do a Wavelet transform (~ short window DFT)

  • look for patterns in every frequency

C. Faloutsos


Intuition

  • Why NOT use the short window Fourier transform (SWFT)?

  • A: how short should be the window?

freq

time

w’

C. Faloutsos


main idea: variable-length window!

wavelets

f

t

C. Faloutsos


Advantages of Wavelets

  • Better compression (better RMSE with same number of coefficients - used in JPEG-2000)

  • fast to compute (usually: O(n)!)

  • very good for ‘spikes’

  • mammalian eye and ear: Gabor wavelets

C. Faloutsos


f

value

t

time

Wavelets - intuition:

  • Q: baritone/silence/ soprano - DWT?

C. Faloutsos


f

value

t

time

Wavelets - intuition:

  • Q: baritone/soprano - DWT?

C. Faloutsos


W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

=

t

t

W3,1

t

V4,1

t

time

AWSOM

xt

C. Faloutsos


W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

t

t

W3,1

t

V4,1

t

time

AWSOM

xt

C. Faloutsos


Wl,t-2

Wl,t-1

Wl,t

Wl’,t’-2

Wl’,t’-1

AWSOM - idea

Wl,t l,1Wl,t-1l,2Wl,t-2 …

Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 …

Wl’,t’

C. Faloutsos


More details…

  • Update of wavelet coefficients

  • Update of linear models

  • Feature selection

    • Not all correlations are significant

    • Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

C. Faloutsos


Results - Synthetic data

AWSOM

AR

Seasonal AR

  • Triangle pulse

  • Mix (sine + square)

  • AR captures wrong trend (or none)

  • Seasonal AR estimation fails

C. Faloutsos


Results - Real data

  • Automobile traffic

    • Daily periodicity

    • Bursty “noise” at smaller scales

  • AR fails to capture any trend

  • Seasonal AR estimation fails

C. Faloutsos


Results - real data

  • Sunspot intensity

    • Slightly time-varying “period”

  • AR captures wrong trend

  • Seasonal ARIMA

    • wrong downward trend, despite help by human!

C. Faloutsos


Complexity

  • Model update

    Space:OlgN + mk2  OlgN

    Time:Ok2  O1

  • Where

    • N: number of points (so far)

    • k:number of regression coefficients; fixed

    • m:number of linear models; OlgN

C. Faloutsos


Conclusions

  • AWSOM: Automatic, ‘hands-off’ traffic modeling (first of its kind!)

C. Faloutsos


Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos


Co-Evolving Time Sequences

  • Given: A set of correlatedtime sequences

  • Forecast ‘Repeated(t)’

??

C. Faloutsos


Solution:

Q: what should we do?

C. Faloutsos


Solution:

Least Squares, with

  • Dep. Variable: Repeated(t)

  • Indep. Variables: Sent(t-1) … Sent(t-w); Lost(t-1) …Lost(t-w); Repeated(t-1), ...

  • (named: ‘MUSCLES’ [Yi+00])

C. Faloutsos


Examples - Experiments

  • Datasets

    • Modem pool traffic (14 modems, 1500 time-ticks; #packets per time unit)

    • AT&T WorldNet internet usage (several data streams; 980 time-ticks)

  • Measures of success

    • Accuracy : Root Mean Square Error (RMSE)

C. Faloutsos


Accuracy - “Modem”

MUSCLES outperforms AR & “yesterday”

C. Faloutsos


Accuracy - “Internet”

  • MUSCLES consistently outperforms AR & “yesterday”

C. Faloutsos


Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos


Detailed Outline

  • Non-linear forecasting

    • Problem

    • Idea

    • How-to

    • Experiments

    • Conclusions

C. Faloutsos


Recall: Problem #1

Value

Time

Given a time series {xt}, predict its future course, that is, xt+1, xt+2, ...

C. Faloutsos


How to forecast?

  • ARIMA - but: linearity assumption

  • ANSWER: ‘Delayed Coordinate Embedding’ = Lag Plots [Sauer92]

C. Faloutsos


Interpolate these…

To get the final prediction

4-NN

New Point

General Intuition (Lag Plot)

Lag = 1,k = 4 NN

xt

xt-1

C. Faloutsos


Questions:

  • Q1: How to choose lag L?

  • Q2: How to choose k (the # of NN)?

  • Q3: How to interpolate?

  • Q4: why should this work at all?

C. Faloutsos


Q1: Choosing lag L

  • Manually (16, in award winning system by [Sauer94])

  • Our proposal: choose L such that the ‘intrinsic dimension’ in the lag plot stabilizes [Chakrabarti+02]

C. Faloutsos


Fractal Dimensions

  • FD = intrinsic dimensionality

Embedding dimensionality = 3

Intrinsic dimensionality = 1

C. Faloutsos


Fractal Dimensions

  • FD = intrinsic dimensionality

log( # pairs)

C. Faloutsos

log(r)


x(t)

time

The Logistic Parabola xt = axt-1(1-xt-1) + noise

Intuition

X(t)

  • Its lag plot for lag = 1

C. Faloutsos

X(t-1)


x(t)

x(t-1)

x(t-2)

x(t)

x(t)

x(t-1)

x(t-1)

x(t-2)

x(t-2)

Intuition

x(t)

x(t-1)

C. Faloutsos


Intuition

Fractal dimension

  • The FD vs L plot does flatten out

  • L(opt) = 1

C. Faloutsos

Lag


Fractal Dimension

epsilon

Choose this

Lag (L)

Proposed Method

  • Use Fractal Dimensions to find the optimal lag length L(opt)

C. Faloutsos


Q2: Choosing number of neighbors k

  • Manually (typically ~ 1-10)

C. Faloutsos


Q3: How to interpolate?

How do we interpolate between thek nearest neighbors?

A3.1: Average

A3.2: Weighted average (weights drop with distance - how?)

C. Faloutsos


A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)

Q3: How to interpolate?

xt

Xt-1

C. Faloutsos


Q4: Any theory behind it?

A4: YES!

C. Faloutsos


Theoretical foundation

  • Based on the “Takens’ Theorem” [Takens81]

  • which says that long enough delay vectors can do prediction, even if there are unobserved variables in the dynamical system (= diff. equations)

C. Faloutsos


P

H

Skip

Theoretical foundation

Example: Lotka-Volterra equations

dH/dt = r H – a H*P dP/dt = b H*P – m P

H is count of prey (e.g., hare)P is count of predators (e.g., lynx)

Suppose only P(t) is observed (t=1, 2, …).

C. Faloutsos


P

H

Skip

Theoretical foundation

  • But the delay vector space is a faithful reconstruction of the internal system state

  • So prediction in delay vector space is as good as prediction in state space

P(t)

P(t-1)

C. Faloutsos


Detailed Outline

  • Non-linear forecasting

    • Problem

    • Idea

    • How-to

    • Experiments

    • Conclusions

C. Faloutsos


x(t)

time

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

Lag-plot

C. Faloutsos


x(t)

time

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

Lag-plot

ARIMA: fails

C. Faloutsos


Logistic Parabola

Our Prediction from here

Value

Timesteps

C. Faloutsos


Value

Logistic Parabola

Comparison of prediction to correct values

Timesteps

C. Faloutsos


Value

Datasets

LORENZ: Models convection currents in the air

dx / dt = a (y - x)

dy / dt = x (b - z) - y

dz / dt = xy - c z

C. Faloutsos


Value

LORENZ

Comparison of prediction to correct values

Timesteps

C. Faloutsos


Value

Datasets

  • LASER: fluctuations in a Laser over time (used in Santa Fe competition)

Time

C. Faloutsos


Value

Laser

Comparison of prediction to correct values

Timesteps

C. Faloutsos


Conclusions

  • Lag plots for non-linear forecasting (Takens’ theorem)

  • suitable for ‘chaotic’ signals

C. Faloutsos


Additional projects at CMU

  • Graph/Network mining

  • spatio-temporal mining - outliers

C. Faloutsos


Graph/network mining

  • Internet; web; gnutella P2P networks

  • Q: Any pattern?

  • Q: how to generate ‘realistic’ topologies?

  • Q: how to define/verify realism?

C. Faloutsos


Patterns?

  • avg degree is, say 3.3

  • pick a node at random - what is the degree you expect it to have?

count

?

avg: 3.3

degree

C. Faloutsos


Patterns?

  • avg degree is, say 3.3

  • pick a node at random - what is the degree you expect it to have?

  • A: 1!!

count

avg: 3.3

degree

C. Faloutsos


Patterns?

  • avg degree is, say 3.3

  • pick a node at random - what is the degree you expect it to have?

  • A: 1!!

count

avg: 3.3

degree

C. Faloutsos


Patterns?

log(count)

  • A: Power laws!

log {(out) degree}

C. Faloutsos


Effective Diameter

Other ‘laws’?

Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos


Effective Diameter

RMAT, to generate realistic graphs

Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos


Epidemic threshold?

  • one a real graph, will a (computer / biological) virus die out? (given

    • beta: probability that an infected node will infect its neighbor and

    • delta: probability that an infected node will recover

NO

MAYBE

YES

C. Faloutsos


Epidemic threshold?

  • one a real graph, will a (computer / biological) virus die out? (given

    • beta: probability that an infected node will infect its neighbor and

    • delta: probability that an infected node will recover

  • A: depends on largest eigenvalue of adjacency matrix! [Wang+03]

C. Faloutsos


Additional projects

  • Graph mining

  • spatio-temporal mining - outliers

C. Faloutsos


Outliers - ‘LOCI’

C. Faloutsos


finds outliers quickly,

with no human intervention

Outliers - ‘LOCI’

C. Faloutsos


Conclusions

  • AWSOM for automatic, linear forecasting

  • MUSCLES for co-evolving sequences

  • F4 for non-linear forecasting

  • Graph/Network topology: power laws and generators; epidemic threshold

  • LOCI for outlier detection

C. Faloutsos


Conclusions

  • Overarching theme: automatic discovery of patterns (outliers/rules) in

    • time sequences (sensors/streams)

    • graphs (computer/social networks)

    • multimedia (video, motion capture data etc)

      www.cs.cmu.edu/~christos

      christos@cs.cmu.edu

C. Faloutsos


Books

  • William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT)

  • C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

C. Faloutsos


Books

  • George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.)

  • Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

C. Faloutsos


Resources: software and urls

  • MUSCLES: Prof. Byoung-Kee Yi:

    http://www.postech.ac.kr/~bkyi/

    or christos@cs.cmu.edu

  • AWSOM & LOCI: spapadim@cs.cmu.edu

  • F4, RMAT: deepay@cs.cmu.edu

C. Faloutsos


Additional Reading

  • [Chakrabarti+02] Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.

  • [Chen+94] Chung-Min Chen, Nick Roussopoulos: Adaptive Selectivity Estimation Using Query Feedback. SIGMOD Conference 1994:161-172

  • [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

C. Faloutsos


Additional Reading

  • Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003

  • Spiros Papadimitriou, Hiroyuki Kitagawa, Phil Gibbons and Christos Faloutsos LOCI: Fast Outlier Detection Using the Local Correlation Integral ICDE 2003, Bangalore, India, March 5 - March 8, 2003.

  • Sauer, T. (1994). Time series prediction using delay coordinate embedding. (in book by Weigend and Gershenfeld, below) Addison-Wesley.

C. Faloutsos


Additional Reading

  • Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dynamical Systems and Turbulence. Berlin: Springer-Verlag.

  • Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint 22nd Symposium on Reliable Distributed Computing (SRDS2003) Florence, Italy, Oct. 6-8, 2003

C. Faloutsos


Additional Reading

  • Weigend, A. S. and N. A. Gerschenfeld (1994). Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley. (Excellent collection of papers on chaotic/non-linear forecasting, describing the algorithms behind the winners of the Santa Fe competition.)

  • [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)

C. Faloutsos


ad
  • Login