sensor data mining and forecasting
Download
Skip this Video
Download Presentation
Sensor data mining and forecasting

Loading in 2 Seconds...

play fullscreen
1 / 112

Sensor data mining and forecasting - PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on

Sensor data mining and forecasting. Christos Faloutsos CMU [email protected] Outline. Problem definition - motivation Linear forecasting - AR and AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects graph modeling, outliers etc. Problem definition.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sensor data mining and forecasting' - oni


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline

Problem definition - motivation

Linear forecasting - AR and AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

problem definition
Problem definition
  • Given: one or more sequences

x1 , x2 , … , xt , …

(y1, y2, … , yt, …

… )

  • Find
    • forecasts; patterns
    • clusters; outliers

C. Faloutsos

motivation applications
Motivation - Applications
  • Financial, sales, economic series
  • Medical
    • ECGs +; blood pressure etc monitoring
    • reactions to new drugs
    • elderly care

C. Faloutsos

motivation applications cont d
Motivation - Applications (cont’d)
  • ‘Smart house’
    • sensors monitor temperature, humidity, air quality
  • video surveillance

C. Faloutsos

motivation applications cont d1
Motivation - Applications (cont’d)
  • civil/automobile infrastructure
    • bridge vibrations [Oppenheim+02]
    • road conditions / traffic monitoring

C. Faloutsos

stream data automobile traffic
Automobile traffic

2000

1800

1600

1400

1200

1000

800

600

400

200

0

Stream Data: automobile traffic

# cars

time

C. Faloutsos

motivation applications cont d2
Motivation - Applications (cont’d)
  • Weather, environment/anti-pollution
    • volcano monitoring
    • air/water pollutant monitoring

C. Faloutsos

stream data sunspots
Stream Data: Sunspots

#sunspots per month

time

C. Faloutsos

motivation applications cont d3
Motivation - Applications (cont’d)
  • Computer systems
    • ‘Active Disks’ (buffering, prefetching)
    • web servers (ditto)
    • network traffic monitoring
    • ...

C. Faloutsos

stream data disk accesses
Stream Data: Disk accesses

#bytes

time

C. Faloutsos

settings applications
Settings & Applications
  • One or more sensors, collecting time-series data

C. Faloutsos

settings applications1
Settings & Applications

Each sensor collects data (x1, x2, …, xt, …)

C. Faloutsos

settings applications2
Settings & Applications

Sensors ‘report’ to a central site

C. Faloutsos

settings applications3
Settings & Applications

Problem #1:

Finding patterns

in a single time sequence

C. Faloutsos

settings applications4
Settings & Applications

Problem #2:

Finding patterns

in many time

sequences

C. Faloutsos

problem 1
Problem #1:

Goal: given a signal (eg., #packets over time)

Find: patterns, periodicities, and/or compress

count

lynx caught per year

(packets per day;

temperature per day)

year

C. Faloutsos

problem 1 forecast
Problem#1’: Forecast

Given xt, xt-1, …, forecast xt+1

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos

problem 2
Problem #2:
  • Given: A set of correlatedtime sequences
  • Forecast ‘Sent(t)’

C. Faloutsos

differences from dsp stat
Differences from DSP/Stat
  • Semi-infinite streams
    • we need on-line, ‘any-time’ algorithms
  • Can not afford human intervention
    • need automatic methods
  • sensors have limited memory / processing / transmitting power
    • need for (lossy) compression

C. Faloutsos

important observations
Important observations

Patterns, rules, compression and forecasting are closely related:

  • To do forecasting, we need
    • to find patterns/rules
  • good rules help us compress
  • to find outliers, we need to have forecasts
    • (outlier = too far away from our forecast)

C. Faloutsos

outline1
Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

mini intro to a r
Mini intro to A.R.

C. Faloutsos

forecasting
Forecasting

"Prediction is very difficult, especially about the future." - Nils Bohr

http://www.hfac.uh.edu/MediaFutures/thoughts.html

C. Faloutsos

problem 1 forecast1
Problem#1’: Forecast
  • Example: give xt-1, xt-2, …, forecast xt

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos

linear regression idea
Linear Regression: idea

85

Body height

80

75

70

65

60

55

50

45

40

15

25

35

45

Body weight

  • express what we don’t know (= ‘dependent variable’)
  • as a linear function of what we know (= ‘indep. variable(s)’)

C. Faloutsos

problem 1 forecast2
90

80

70

??

60

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

Problem#1’: Forecast
  • Solution: try to express

xt

as a linear function of the past: xt-2, xt-2, …,

(up to a window of w)

Formally:

C. Faloutsos

linear auto regression1
Linear Auto Regression:

85

‘lag-plot’

80

75

70

65

Number of packets sent (t)

60

55

50

45

40

15

25

35

45

Number of packets sent (t-1)

  • lag w=1
  • Dependent variable = # of packets sent (S[t])
  • Independent variable = # of packets sent (S[t-1])

C. Faloutsos

more details
More details:
  • Q1: Can it work with window w>1?
  • A1: YES!

xt

xt-1

xt-2

C. Faloutsos

more details1
More details:
  • Q1: Can it work with window w>1?
  • A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos

more details2
More details:
  • Q1: Can it work with window w>1?
  • A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos

even more details
Even more details
  • Q2: Can we estimate a incrementally?
  • A2: Yes, with the brilliant, classic method of ‘Recursive Least Squares’ (RLS) (see, e.g., [Chen+94], or [Yi+00], for details)
  • Q3: can we ‘down-weight’ older samples?
  • A3: yes (RLS does that easily!)

C. Faloutsos

mini intro to a r1
Mini intro to A.R.

C. Faloutsos

how to choose w
goal: capture arbitrary periodicities

with NO human intervention

on a semi-infinite stream

How to choose ‘w’?

C. Faloutsos

outline2
Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

problem
Problem:
  • in a train of spikes (128 ticks apart)
  • any AR with window w < 128 will fail

What to do, then?

C. Faloutsos

answer intuition
Answer (intuition)
  • Do a Wavelet transform (~ short window DFT)
  • look for patterns in every frequency

C. Faloutsos

intuition
Intuition
  • Why NOT use the short window Fourier transform (SWFT)?
  • A: how short should be the window?

freq

time

w’

C. Faloutsos

advantages of wavelets
Advantages of Wavelets
  • Better compression (better RMSE with same number of coefficients - used in JPEG-2000)
  • fast to compute (usually: O(n)!)
  • very good for ‘spikes’
  • mammalian eye and ear: Gabor wavelets

C. Faloutsos

wavelets intuition
f

value

t

time

Wavelets - intuition:
  • Q: baritone/silence/ soprano - DWT?

C. Faloutsos

wavelets intuition1
f

value

t

time

Wavelets - intuition:
  • Q: baritone/soprano - DWT?

C. Faloutsos

awsom
W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

=

t

t

W3,1

t

V4,1

t

time

AWSOM

xt

C. Faloutsos

awsom1
W1,3

t

W1,1

W1,4

W1,2

t

t

t

t

frequency

W2,1

W2,2

t

t

W3,1

t

V4,1

t

time

AWSOM

xt

C. Faloutsos

awsom idea
Wl,t-2

Wl,t-1

Wl,t

Wl’,t’-2

Wl’,t’-1

AWSOM - idea

Wl,t l,1Wl,t-1l,2Wl,t-2 …

Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 …

Wl’,t’

C. Faloutsos

more details3
More details…
  • Update of wavelet coefficients
  • Update of linear models
  • Feature selection
    • Not all correlations are significant
    • Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

C. Faloutsos

results synthetic data
Results - Synthetic data

AWSOM

AR

Seasonal AR

  • Triangle pulse
  • Mix (sine + square)
  • AR captures wrong trend (or none)
  • Seasonal AR estimation fails

C. Faloutsos

results real data
Results - Real data
  • Automobile traffic
    • Daily periodicity
    • Bursty “noise” at smaller scales
  • AR fails to capture any trend
  • Seasonal AR estimation fails

C. Faloutsos

results real data1
Results - real data
  • Sunspot intensity
    • Slightly time-varying “period”
  • AR captures wrong trend
  • Seasonal ARIMA
    • wrong downward trend, despite help by human!

C. Faloutsos

complexity
Complexity
  • Model update

Space:OlgN + mk2  OlgN

Time:Ok2  O1

  • Where
    • N: number of points (so far)
    • k: number of regression coefficients; fixed
    • m: number of linear models; OlgN

C. Faloutsos

conclusions
Conclusions
  • AWSOM: Automatic, ‘hands-off’ traffic modeling (first of its kind!)

C. Faloutsos

outline3
Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

co evolving time sequences
Co-Evolving Time Sequences
  • Given: A set of correlatedtime sequences
  • Forecast ‘Repeated(t)’

??

C. Faloutsos

solution
Solution:

Q: what should we do?

C. Faloutsos

solution1
Solution:

Least Squares, with

  • Dep. Variable: Repeated(t)
  • Indep. Variables: Sent(t-1) … Sent(t-w); Lost(t-1) …Lost(t-w); Repeated(t-1), ...
  • (named: ‘MUSCLES’ [Yi+00])

C. Faloutsos

examples experiments
Examples - Experiments
  • Datasets
    • Modem pool traffic (14 modems, 1500 time-ticks; #packets per time unit)
    • AT&T WorldNet internet usage (several data streams; 980 time-ticks)
  • Measures of success
    • Accuracy : Root Mean Square Error (RMSE)

C. Faloutsos

accuracy modem
Accuracy - “Modem”

MUSCLES outperforms AR & “yesterday”

C. Faloutsos

accuracy internet
Accuracy - “Internet”
  • MUSCLES consistently outperforms AR & “yesterday”

C. Faloutsos

outline4
Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

detailed outline
Detailed Outline
  • Non-linear forecasting
    • Problem
    • Idea
    • How-to
    • Experiments
    • Conclusions

C. Faloutsos

recall problem 1
Recall: Problem #1

Value

Time

Given a time series {xt}, predict its future course, that is, xt+1, xt+2, ...

C. Faloutsos

how to forecast
How to forecast?
  • ARIMA - but: linearity assumption
  • ANSWER: ‘Delayed Coordinate Embedding’ = Lag Plots [Sauer92]

C. Faloutsos

general intuition lag plot
Interpolate these…

To get the final prediction

4-NN

New Point

General Intuition (Lag Plot)

Lag = 1,k = 4 NN

xt

xt-1

C. Faloutsos

questions
Questions:
  • Q1: How to choose lag L?
  • Q2: How to choose k (the # of NN)?
  • Q3: How to interpolate?
  • Q4: why should this work at all?

C. Faloutsos

q1 choosing lag l
Q1: Choosing lag L
  • Manually (16, in award winning system by [Sauer94])
  • Our proposal: choose L such that the ‘intrinsic dimension’ in the lag plot stabilizes [Chakrabarti+02]

C. Faloutsos

fractal dimensions
Fractal Dimensions
  • FD = intrinsic dimensionality

Embedding dimensionality = 3

Intrinsic dimensionality = 1

C. Faloutsos

fractal dimensions1
Fractal Dimensions
  • FD = intrinsic dimensionality

log( # pairs)

C. Faloutsos

log(r)

intuition1
x(t)

time

The Logistic Parabola xt = axt-1(1-xt-1) + noise

Intuition

X(t)

  • Its lag plot for lag = 1

C. Faloutsos

X(t-1)

intuition2
x(t)

x(t-1)

x(t-2)

x(t)

x(t)

x(t-1)

x(t-1)

x(t-2)

x(t-2)

Intuition

x(t)

x(t-1)

C. Faloutsos

intuition3
Intuition

Fractal dimension

  • The FD vs L plot does flatten out
  • L(opt) = 1

C. Faloutsos

Lag

proposed method
Fractal Dimension

epsilon

Choose this

Lag (L)

Proposed Method
  • Use Fractal Dimensions to find the optimal lag length L(opt)

C. Faloutsos

q2 choosing number of neighbors k
Q2: Choosing number of neighbors k
  • Manually (typically ~ 1-10)

C. Faloutsos

q3 how to interpolate
Q3: How to interpolate?

How do we interpolate between thek nearest neighbors?

A3.1: Average

A3.2: Weighted average (weights drop with distance - how?)

C. Faloutsos

q3 how to interpolate1
A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)Q3: How to interpolate?

xt

Xt-1

C. Faloutsos

q4 any theory behind it
Q4: Any theory behind it?

A4: YES!

C. Faloutsos

theoretical foundation
Theoretical foundation
  • Based on the “Takens’ Theorem” [Takens81]
  • which says that long enough delay vectors can do prediction, even if there are unobserved variables in the dynamical system (= diff. equations)

C. Faloutsos

theoretical foundation1
P

H

Skip

Theoretical foundation

Example: Lotka-Volterra equations

dH/dt = r H – a H*P dP/dt = b H*P – m P

H is count of prey (e.g., hare)P is count of predators (e.g., lynx)

Suppose only P(t) is observed (t=1, 2, …).

C. Faloutsos

theoretical foundation2
P

H

Skip

Theoretical foundation
  • But the delay vector space is a faithful reconstruction of the internal system state
  • So prediction in delay vector space is as good as prediction in state space

P(t)

P(t-1)

C. Faloutsos

detailed outline1
Detailed Outline
  • Non-linear forecasting
    • Problem
    • Idea
    • How-to
    • Experiments
    • Conclusions

C. Faloutsos

datasets
x(t)

time

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

Lag-plot

C. Faloutsos

datasets1
x(t)

time

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

Lag-plot

ARIMA: fails

C. Faloutsos

logistic parabola
Logistic Parabola

Our Prediction from here

Value

Timesteps

C. Faloutsos

logistic parabola1
ValueLogistic Parabola

Comparison of prediction to correct values

Timesteps

C. Faloutsos

datasets2
ValueDatasets

LORENZ: Models convection currents in the air

dx / dt = a (y - x)

dy / dt = x (b - z) - y

dz / dt = xy - c z

C. Faloutsos

lorenz
ValueLORENZ

Comparison of prediction to correct values

Timesteps

C. Faloutsos

datasets3
ValueDatasets
  • LASER: fluctuations in a Laser over time (used in Santa Fe competition)

Time

C. Faloutsos

laser
ValueLaser

Comparison of prediction to correct values

Timesteps

C. Faloutsos

conclusions1
Conclusions
  • Lag plots for non-linear forecasting (Takens’ theorem)
  • suitable for ‘chaotic’ signals

C. Faloutsos

additional projects at cmu
Additional projects at CMU
  • Graph/Network mining
  • spatio-temporal mining - outliers

C. Faloutsos

graph network mining
Graph/network mining
  • Internet; web; gnutella P2P networks
  • Q: Any pattern?
  • Q: how to generate ‘realistic’ topologies?
  • Q: how to define/verify realism?

C. Faloutsos

patterns
Patterns?
  • avg degree is, say 3.3
  • pick a node at random - what is the degree you expect it to have?

count

?

avg: 3.3

degree

C. Faloutsos

patterns1
Patterns?
  • avg degree is, say 3.3
  • pick a node at random - what is the degree you expect it to have?
  • A: 1!!

count

avg: 3.3

degree

C. Faloutsos

patterns2
Patterns?
  • avg degree is, say 3.3
  • pick a node at random - what is the degree you expect it to have?
  • A: 1!!

count

avg: 3.3

degree

C. Faloutsos

patterns3
Patterns?

log(count)

  • A: Power laws!

log {(out) degree}

C. Faloutsos

other laws
Effective DiameterOther ‘laws’?

Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos

rmat to generate realistic graphs
Effective DiameterRMAT, to generate realistic graphs

Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos

epidemic threshold
Epidemic threshold?
  • one a real graph, will a (computer / biological) virus die out? (given
    • beta: probability that an infected node will infect its neighbor and
    • delta: probability that an infected node will recover

NO

MAYBE

YES

C. Faloutsos

epidemic threshold1
Epidemic threshold?
  • one a real graph, will a (computer / biological) virus die out? (given
    • beta: probability that an infected node will infect its neighbor and
    • delta: probability that an infected node will recover
  • A: depends on largest eigenvalue of adjacency matrix! [Wang+03]

C. Faloutsos

additional projects
Additional projects
  • Graph mining
  • spatio-temporal mining - outliers

C. Faloutsos

outliers loci1
finds outliers quickly,

with no human intervention

Outliers - ‘LOCI’

C. Faloutsos

conclusions2
Conclusions
  • AWSOM for automatic, linear forecasting
  • MUSCLES for co-evolving sequences
  • F4 for non-linear forecasting
  • Graph/Network topology: power laws and generators; epidemic threshold
  • LOCI for outlier detection

C. Faloutsos

conclusions3
Conclusions
  • Overarching theme: automatic discovery of patterns (outliers/rules) in
    • time sequences (sensors/streams)
    • graphs (computer/social networks)
    • multimedia (video, motion capture data etc)

www.cs.cmu.edu/~christos

[email protected]

C. Faloutsos

books
Books
  • William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT)
  • C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

C. Faloutsos

books1
Books
  • George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.)
  • Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

C. Faloutsos

resources software and urls
Resources: software and urls
  • MUSCLES: Prof. Byoung-Kee Yi:

http://www.postech.ac.kr/~bkyi/

or [email protected]

C. Faloutsos

additional reading
Additional Reading
  • [Chakrabarti+02] Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.
  • [Chen+94] Chung-Min Chen, Nick Roussopoulos: Adaptive Selectivity Estimation Using Query Feedback. SIGMOD Conference 1994:161-172
  • [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

C. Faloutsos

additional reading1
Additional Reading
  • Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003
  • Spiros Papadimitriou, Hiroyuki Kitagawa, Phil Gibbons and Christos Faloutsos LOCI: Fast Outlier Detection Using the Local Correlation Integral ICDE 2003, Bangalore, India, March 5 - March 8, 2003.
  • Sauer, T. (1994). Time series prediction using delay coordinate embedding. (in book by Weigend and Gershenfeld, below) Addison-Wesley.

C. Faloutsos

additional reading2
Additional Reading
  • Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dynamical Systems and Turbulence. Berlin: Springer-Verlag.
  • Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint 22nd Symposium on Reliable Distributed Computing (SRDS2003) Florence, Italy, Oct. 6-8, 2003

C. Faloutsos

additional reading3
Additional Reading
  • Weigend, A. S. and N. A. Gerschenfeld (1994). Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley. (Excellent collection of papers on chaotic/non-linear forecasting, describing the algorithms behind the winners of the Santa Fe competition.)
  • [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)

C. Faloutsos

ad