- By
**oni** - Follow User

- 77 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Sensor data mining and forecasting' - oni

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

Outline

Problem definition - motivation

Linear forecasting - AR and AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

Problem definition

- Given: one or more sequences

x1 , x2 , … , xt , …

(y1, y2, … , yt, …

… )

- Find
- forecasts; patterns
- clusters; outliers

C. Faloutsos

Motivation - Applications

- Financial, sales, economic series
- Medical
- ECGs +; blood pressure etc monitoring
- reactions to new drugs
- elderly care

C. Faloutsos

Motivation - Applications (cont’d)

- ‘Smart house’
- sensors monitor temperature, humidity, air quality
- video surveillance

C. Faloutsos

Motivation - Applications (cont’d)

- civil/automobile infrastructure
- bridge vibrations [Oppenheim+02]
- road conditions / traffic monitoring

C. Faloutsos

Automobile traffic

2000

1800

1600

1400

1200

1000

800

600

400

200

0

Stream Data: automobile traffic# cars

time

C. Faloutsos

Motivation - Applications (cont’d)

- Weather, environment/anti-pollution
- volcano monitoring
- air/water pollutant monitoring

C. Faloutsos

Motivation - Applications (cont’d)

- Computer systems
- ‘Active Disks’ (buffering, prefetching)
- web servers (ditto)
- network traffic monitoring
- ...

C. Faloutsos

Problem #1:

Goal: given a signal (eg., #packets over time)

Find: patterns, periodicities, and/or compress

count

lynx caught per year

(packets per day;

temperature per day)

year

C. Faloutsos

Problem#1’: Forecast

Given xt, xt-1, …, forecast xt+1

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos

Differences from DSP/Stat

- Semi-infinite streams
- we need on-line, ‘any-time’ algorithms
- Can not afford human intervention
- need automatic methods
- sensors have limited memory / processing / transmitting power
- need for (lossy) compression

C. Faloutsos

Important observations

Patterns, rules, compression and forecasting are closely related:

- To do forecasting, we need
- to find patterns/rules
- good rules help us compress
- to find outliers, we need to have forecasts
- (outlier = too far away from our forecast)

C. Faloutsos

Pictorial outline of the talk

C. Faloutsos

Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

Mini intro to A.R.

C. Faloutsos

Forecasting

"Prediction is very difficult, especially about the future." - Nils Bohr

http://www.hfac.uh.edu/MediaFutures/thoughts.html

C. Faloutsos

Problem#1’: Forecast

- Example: give xt-1, xt-2, …, forecast xt

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos

Linear Regression: idea

85

Body height

80

75

70

65

60

55

50

45

40

15

25

35

45

Body weight

- express what we don’t know (= ‘dependent variable’)
- as a linear function of what we know (= ‘indep. variable(s)’)

C. Faloutsos

Linear Auto Regression:

C. Faloutsos

90

80

70

??

60

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

Problem#1’: Forecast- Solution: try to express

xt

as a linear function of the past: xt-2, xt-2, …,

(up to a window of w)

Formally:

C. Faloutsos

Linear Auto Regression:

85

‘lag-plot’

80

75

70

65

Number of packets sent (t)

60

55

50

45

40

15

25

35

45

Number of packets sent (t-1)

- lag w=1
- Dependent variable = # of packets sent (S[t])
- Independent variable = # of packets sent (S[t-1])

C. Faloutsos

More details:

- Q1: Can it work with window w>1?
- A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos

More details:

- Q1: Can it work with window w>1?
- A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos

Even more details

- Q2: Can we estimate a incrementally?
- A2: Yes, with the brilliant, classic method of ‘Recursive Least Squares’ (RLS) (see, e.g., [Chen+94], or [Yi+00], for details)
- Q3: can we ‘down-weight’ older samples?
- A3: yes (RLS does that easily!)

C. Faloutsos

Mini intro to A.R.

C. Faloutsos

goal: capture arbitrary periodicities

with NO human intervention

on a semi-infinite stream

How to choose ‘w’?C. Faloutsos

Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

Problem:

- in a train of spikes (128 ticks apart)
- any AR with window w < 128 will fail

What to do, then?

C. Faloutsos

Answer (intuition)

- Do a Wavelet transform (~ short window DFT)
- look for patterns in every frequency

C. Faloutsos

Intuition

- Why NOT use the short window Fourier transform (SWFT)?
- A: how short should be the window?

freq

time

w’

C. Faloutsos

Advantages of Wavelets

- Better compression (better RMSE with same number of coefficients - used in JPEG-2000)
- fast to compute (usually: O(n)!)
- very good for ‘spikes’
- mammalian eye and ear: Gabor wavelets

C. Faloutsos

Wl,t-2

Wl,t-1

Wl,t

Wl’,t’-2

Wl’,t’-1

AWSOM - ideaWl,t l,1Wl,t-1l,2Wl,t-2 …

Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 …

Wl’,t’

C. Faloutsos

More details…

- Update of wavelet coefficients
- Update of linear models
- Feature selection
- Not all correlations are significant
- Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

C. Faloutsos

Results - Synthetic data

AWSOM

AR

Seasonal AR

- Triangle pulse
- Mix (sine + square)
- AR captures wrong trend (or none)
- Seasonal AR estimation fails

C. Faloutsos

Results - Real data

- Automobile traffic
- Daily periodicity
- Bursty “noise” at smaller scales
- AR fails to capture any trend
- Seasonal AR estimation fails

C. Faloutsos

Results - real data

- Sunspot intensity
- Slightly time-varying “period”
- AR captures wrong trend
- Seasonal ARIMA
- wrong downward trend, despite help by human!

C. Faloutsos

Complexity

- Model update

Space:OlgN + mk2 OlgN

Time:Ok2 O1

- Where
- N: number of points (so far)
- k: number of regression coefficients; fixed
- m: number of linear models; OlgN

C. Faloutsos

Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

Co-Evolving Time Sequences

- Given: A set of correlatedtime sequences
- Forecast ‘Repeated(t)’

??

C. Faloutsos

Solution:

Least Squares, with

- Dep. Variable: Repeated(t)
- Indep. Variables: Sent(t-1) … Sent(t-w); Lost(t-1) …Lost(t-w); Repeated(t-1), ...
- (named: ‘MUSCLES’ [Yi+00])

C. Faloutsos

Examples - Experiments

- Datasets
- Modem pool traffic (14 modems, 1500 time-ticks; #packets per time unit)
- AT&T WorldNet internet usage (several data streams; 980 time-ticks)
- Measures of success
- Accuracy : Root Mean Square Error (RMSE)

C. Faloutsos

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

Recall: Problem #1

Value

Time

Given a time series {xt}, predict its future course, that is, xt+1, xt+2, ...

C. Faloutsos

How to forecast?

- ARIMA - but: linearity assumption
- ANSWER: ‘Delayed Coordinate Embedding’ = Lag Plots [Sauer92]

C. Faloutsos

Interpolate these…

To get the final prediction

4-NN

New Point

General Intuition (Lag Plot)Lag = 1,k = 4 NN

xt

xt-1

C. Faloutsos

Questions:

- Q1: How to choose lag L?
- Q2: How to choose k (the # of NN)?
- Q3: How to interpolate?
- Q4: why should this work at all?

C. Faloutsos

Q1: Choosing lag L

- Manually (16, in award winning system by [Sauer94])
- Our proposal: choose L such that the ‘intrinsic dimension’ in the lag plot stabilizes [Chakrabarti+02]

C. Faloutsos

Fractal Dimensions

- FD = intrinsic dimensionality

Embedding dimensionality = 3

Intrinsic dimensionality = 1

C. Faloutsos

Fractal Dimension

epsilon

Choose this

Lag (L)

Proposed Method- Use Fractal Dimensions to find the optimal lag length L(opt)

C. Faloutsos

Q3: How to interpolate?

How do we interpolate between thek nearest neighbors?

A3.1: Average

A3.2: Weighted average (weights drop with distance - how?)

C. Faloutsos

A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)Q3: How to interpolate?

xt

Xt-1

C. Faloutsos

Theoretical foundation

- Based on the “Takens’ Theorem” [Takens81]
- which says that long enough delay vectors can do prediction, even if there are unobserved variables in the dynamical system (= diff. equations)

C. Faloutsos

P

H

Skip

Theoretical foundationExample: Lotka-Volterra equations

dH/dt = r H – a H*P dP/dt = b H*P – m P

H is count of prey (e.g., hare)P is count of predators (e.g., lynx)

Suppose only P(t) is observed (t=1, 2, …).

C. Faloutsos

P

H

Skip

Theoretical foundation- But the delay vector space is a faithful reconstruction of the internal system state
- So prediction in delay vector space is as good as prediction in state space

P(t)

P(t-1)

C. Faloutsos

Conclusions

- Lag plots for non-linear forecasting (Takens’ theorem)
- suitable for ‘chaotic’ signals

C. Faloutsos

Graph/network mining

- Internet; web; gnutella P2P networks
- Q: Any pattern?
- Q: how to generate ‘realistic’ topologies?
- Q: how to define/verify realism?

C. Faloutsos

Patterns?

- avg degree is, say 3.3
- pick a node at random - what is the degree you expect it to have?

count

?

avg: 3.3

degree

C. Faloutsos

Patterns?

- avg degree is, say 3.3
- pick a node at random - what is the degree you expect it to have?
- A: 1!!

count

avg: 3.3

degree

C. Faloutsos

Patterns?

- avg degree is, say 3.3
- pick a node at random - what is the degree you expect it to have?
- A: 1!!

count

avg: 3.3

degree

C. Faloutsos

Effective DiameterOther ‘laws’?

Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos

Effective DiameterRMAT, to generate realistic graphs

Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos

Epidemic threshold?

- one a real graph, will a (computer / biological) virus die out? (given
- beta: probability that an infected node will infect its neighbor and
- delta: probability that an infected node will recover

NO

MAYBE

YES

C. Faloutsos

Epidemic threshold?

- one a real graph, will a (computer / biological) virus die out? (given
- beta: probability that an infected node will infect its neighbor and
- delta: probability that an infected node will recover
- A: depends on largest eigenvalue of adjacency matrix! [Wang+03]

C. Faloutsos

Outliers - ‘LOCI’

C. Faloutsos

Conclusions

- AWSOM for automatic, linear forecasting
- MUSCLES for co-evolving sequences
- F4 for non-linear forecasting
- Graph/Network topology: power laws and generators; epidemic threshold
- LOCI for outlier detection

C. Faloutsos

Conclusions

- Overarching theme: automatic discovery of patterns (outliers/rules) in
- time sequences (sensors/streams)
- graphs (computer/social networks)
- multimedia (video, motion capture data etc)

www.cs.cmu.edu/~christos

C. Faloutsos

Books

- William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT)
- C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

C. Faloutsos

Books

- George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.)
- Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

C. Faloutsos

Resources: software and urls

- MUSCLES: Prof. Byoung-Kee Yi:

http://www.postech.ac.kr/~bkyi/

- AWSOM & LOCI: [email protected]
- F4, RMAT: [email protected]

C. Faloutsos

Additional Reading

- [Chakrabarti+02] Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.
- [Chen+94] Chung-Min Chen, Nick Roussopoulos: Adaptive Selectivity Estimation Using Query Feedback. SIGMOD Conference 1994:161-172
- [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

C. Faloutsos

Additional Reading

- Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003
- Spiros Papadimitriou, Hiroyuki Kitagawa, Phil Gibbons and Christos Faloutsos LOCI: Fast Outlier Detection Using the Local Correlation Integral ICDE 2003, Bangalore, India, March 5 - March 8, 2003.
- Sauer, T. (1994). Time series prediction using delay coordinate embedding. (in book by Weigend and Gershenfeld, below) Addison-Wesley.

C. Faloutsos

Additional Reading

- Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dynamical Systems and Turbulence. Berlin: Springer-Verlag.
- Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint 22nd Symposium on Reliable Distributed Computing (SRDS2003) Florence, Italy, Oct. 6-8, 2003

C. Faloutsos

Additional Reading

- Weigend, A. S. and N. A. Gerschenfeld (1994). Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley. (Excellent collection of papers on chaotic/non-linear forecasting, describing the algorithms behind the winners of the Santa Fe competition.)
- [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)

C. Faloutsos

Download Presentation

Connecting to Server..