Loading in 5 sec....

Sensor data mining and forecastingPowerPoint Presentation

Sensor data mining and forecasting

- By
**oni** - Follow User

- 77 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Sensor data mining and forecasting' - oni

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

Outline

Problem definition - motivation

Linear forecasting - AR and AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

Problem definition

- Given: one or more sequences
x1 , x2 , … , xt , …

(y1, y2, … , yt, …

… )

- Find
- forecasts; patterns
- clusters; outliers

C. Faloutsos

Motivation - Applications

- Financial, sales, economic series
- Medical
- ECGs +; blood pressure etc monitoring
- reactions to new drugs
- elderly care

C. Faloutsos

Motivation - Applications (cont’d)

- ‘Smart house’
- sensors monitor temperature, humidity, air quality

- video surveillance

C. Faloutsos

Motivation - Applications (cont’d)

- civil/automobile infrastructure
- bridge vibrations [Oppenheim+02]
- road conditions / traffic monitoring

C. Faloutsos

2000

1800

1600

1400

1200

1000

800

600

400

200

0

Stream Data: automobile traffic# cars

time

C. Faloutsos

Motivation - Applications (cont’d)

- Weather, environment/anti-pollution
- volcano monitoring
- air/water pollutant monitoring

C. Faloutsos

Motivation - Applications (cont’d)

- Computer systems
- ‘Active Disks’ (buffering, prefetching)
- web servers (ditto)
- network traffic monitoring
- ...

C. Faloutsos

Problem #1:

Goal: given a signal (eg., #packets over time)

Find: patterns, periodicities, and/or compress

count

lynx caught per year

(packets per day;

temperature per day)

year

C. Faloutsos

Problem#1’: Forecast

Given xt, xt-1, …, forecast xt+1

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos

Differences from DSP/Stat

- Semi-infinite streams
- we need on-line, ‘any-time’ algorithms

- Can not afford human intervention
- need automatic methods

- sensors have limited memory / processing / transmitting power
- need for (lossy) compression

C. Faloutsos

Important observations

Patterns, rules, compression and forecasting are closely related:

- To do forecasting, we need
- to find patterns/rules

- good rules help us compress
- to find outliers, we need to have forecasts
- (outlier = too far away from our forecast)

C. Faloutsos

Pictorial outline of the talk

C. Faloutsos

Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

Mini intro to A.R.

C. Faloutsos

Forecasting

"Prediction is very difficult, especially about the future." - Nils Bohr

http://www.hfac.uh.edu/MediaFutures/thoughts.html

C. Faloutsos

Problem#1’: Forecast

- Example: give xt-1, xt-2, …, forecast xt

90

80

70

60

Number of packets sent

??

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

C. Faloutsos

Linear Regression: idea

85

Body height

80

75

70

65

60

55

50

45

40

15

25

35

45

Body weight

- express what we don’t know (= ‘dependent variable’)
- as a linear function of what we know (= ‘indep. variable(s)’)

C. Faloutsos

Linear Auto Regression:

C. Faloutsos

80

70

??

60

50

40

30

20

10

0

1

3

5

7

9

11

Time Tick

Problem#1’: Forecast- Solution: try to express
xt

as a linear function of the past: xt-2, xt-2, …,

(up to a window of w)

Formally:

C. Faloutsos

Linear Auto Regression:

85

‘lag-plot’

80

75

70

65

Number of packets sent (t)

60

55

50

45

40

15

25

35

45

Number of packets sent (t-1)

- lag w=1
- Dependent variable = # of packets sent (S[t])
- Independent variable = # of packets sent (S[t-1])

C. Faloutsos

More details:

- Q1: Can it work with window w>1?
- A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos

More details:

- Q1: Can it work with window w>1?
- A1: YES! (we’ll fit a hyper-plane, then!)

xt

xt-1

xt-2

C. Faloutsos

Even more details

- Q2: Can we estimate a incrementally?
- A2: Yes, with the brilliant, classic method of ‘Recursive Least Squares’ (RLS) (see, e.g., [Chen+94], or [Yi+00], for details)
- Q3: can we ‘down-weight’ older samples?
- A3: yes (RLS does that easily!)

C. Faloutsos

Mini intro to A.R.

C. Faloutsos

goal: capture arbitrary periodicities

with NO human intervention

on a semi-infinite stream

How to choose ‘w’?C. Faloutsos

Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

Problem:

- in a train of spikes (128 ticks apart)
- any AR with window w < 128 will fail
What to do, then?

C. Faloutsos

Answer (intuition)

- Do a Wavelet transform (~ short window DFT)
- look for patterns in every frequency

C. Faloutsos

Intuition

- Why NOT use the short window Fourier transform (SWFT)?
- A: how short should be the window?

freq

time

w’

C. Faloutsos

Advantages of Wavelets

- Better compression (better RMSE with same number of coefficients - used in JPEG-2000)
- fast to compute (usually: O(n)!)
- very good for ‘spikes’
- mammalian eye and ear: Gabor wavelets

C. Faloutsos

Wl,t-2

Wl,t-1

Wl,t

Wl’,t’-2

Wl’,t’-1

AWSOM - ideaWl,t l,1Wl,t-1l,2Wl,t-2 …

Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 …

Wl’,t’

C. Faloutsos

More details…

- Update of wavelet coefficients
- Update of linear models
- Feature selection
- Not all correlations are significant
- Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

C. Faloutsos

Results - Synthetic data

AWSOM

AR

Seasonal AR

- Triangle pulse
- Mix (sine + square)
- AR captures wrong trend (or none)
- Seasonal AR estimation fails

C. Faloutsos

Results - Real data

- Automobile traffic
- Daily periodicity
- Bursty “noise” at smaller scales

- AR fails to capture any trend
- Seasonal AR estimation fails

C. Faloutsos

Results - real data

- Sunspot intensity
- Slightly time-varying “period”

- AR captures wrong trend
- Seasonal ARIMA
- wrong downward trend, despite help by human!

C. Faloutsos

Complexity

- Model update
Space:OlgN + mk2 OlgN

Time:Ok2 O1

- Where
- N: number of points (so far)
- k: number of regression coefficients; fixed
- m: number of linear models; OlgN

C. Faloutsos

Outline

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

Co-Evolving Time Sequences

- Given: A set of correlatedtime sequences
- Forecast ‘Repeated(t)’

??

C. Faloutsos

Solution:

Least Squares, with

- Dep. Variable: Repeated(t)
- Indep. Variables: Sent(t-1) … Sent(t-w); Lost(t-1) …Lost(t-w); Repeated(t-1), ...
- (named: ‘MUSCLES’ [Yi+00])

C. Faloutsos

Examples - Experiments

- Datasets
- Modem pool traffic (14 modems, 1500 time-ticks; #packets per time unit)
- AT&T WorldNet internet usage (several data streams; 980 time-ticks)

- Measures of success
- Accuracy : Root Mean Square Error (RMSE)

C. Faloutsos

Problem definition - motivation

Linear forecasting

AR

AWSOM

Coevolving series - MUSCLES

Fractal forecasting - F4

Other projects

graph modeling, outliers etc

C. Faloutsos

Recall: Problem #1

Value

Time

Given a time series {xt}, predict its future course, that is, xt+1, xt+2, ...

C. Faloutsos

How to forecast?

- ARIMA - but: linearity assumption
- ANSWER: ‘Delayed Coordinate Embedding’ = Lag Plots [Sauer92]

C. Faloutsos

To get the final prediction

4-NN

New Point

General Intuition (Lag Plot)Lag = 1,k = 4 NN

xt

xt-1

C. Faloutsos

Questions:

- Q1: How to choose lag L?
- Q2: How to choose k (the # of NN)?
- Q3: How to interpolate?
- Q4: why should this work at all?

C. Faloutsos

Q1: Choosing lag L

- Manually (16, in award winning system by [Sauer94])
- Our proposal: choose L such that the ‘intrinsic dimension’ in the lag plot stabilizes [Chakrabarti+02]

C. Faloutsos

Fractal Dimensions

- FD = intrinsic dimensionality

Embedding dimensionality = 3

Intrinsic dimensionality = 1

C. Faloutsos

time

The Logistic Parabola xt = axt-1(1-xt-1) + noise

IntuitionX(t)

- Its lag plot for lag = 1

C. Faloutsos

X(t-1)

epsilon

Choose this

Lag (L)

Proposed Method- Use Fractal Dimensions to find the optimal lag length L(opt)

C. Faloutsos

Q3: How to interpolate?

How do we interpolate between thek nearest neighbors?

A3.1: Average

A3.2: Weighted average (weights drop with distance - how?)

C. Faloutsos

A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)

Q3: How to interpolate?xt

Xt-1

C. Faloutsos

Theoretical foundation place in the Santa Fe forecasting competition)

- Based on the “Takens’ Theorem” [Takens81]
- which says that long enough delay vectors can do prediction, even if there are unobserved variables in the dynamical system (= diff. equations)

C. Faloutsos

P place in the Santa Fe forecasting competition)

H

Skip

Theoretical foundationExample: Lotka-Volterra equations

dH/dt = r H – a H*P dP/dt = b H*P – m P

H is count of prey (e.g., hare)P is count of predators (e.g., lynx)

Suppose only P(t) is observed (t=1, 2, …).

C. Faloutsos

P place in the Santa Fe forecasting competition)

H

Skip

Theoretical foundation- But the delay vector space is a faithful reconstruction of the internal system state
- So prediction in delay vector space is as good as prediction in state space

P(t)

P(t-1)

C. Faloutsos

Detailed Outline place in the Santa Fe forecasting competition)

- Non-linear forecasting
- Problem
- Idea
- How-to
- Experiments
- Conclusions

C. Faloutsos

x(t) place in the Santa Fe forecasting competition)

time

DatasetsLogistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

Lag-plot

C. Faloutsos

x(t) place in the Santa Fe forecasting competition)

time

DatasetsLogistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

Lag-plot

ARIMA: fails

C. Faloutsos

Logistic Parabola place in the Santa Fe forecasting competition)

Our Prediction from here

Value

Timesteps

C. Faloutsos

Value place in the Santa Fe forecasting competition)

Logistic ParabolaComparison of prediction to correct values

Timesteps

C. Faloutsos

Value place in the Santa Fe forecasting competition)

DatasetsLORENZ: Models convection currents in the air

dx / dt = a (y - x)

dy / dt = x (b - z) - y

dz / dt = xy - c z

C. Faloutsos

Value place in the Santa Fe forecasting competition)

LORENZComparison of prediction to correct values

Timesteps

C. Faloutsos

Value place in the Santa Fe forecasting competition)

Datasets- LASER: fluctuations in a Laser over time (used in Santa Fe competition)

Time

C. Faloutsos

Value place in the Santa Fe forecasting competition)

LaserComparison of prediction to correct values

Timesteps

C. Faloutsos

Conclusions place in the Santa Fe forecasting competition)

- Lag plots for non-linear forecasting (Takens’ theorem)
- suitable for ‘chaotic’ signals

C. Faloutsos

Additional projects at CMU place in the Santa Fe forecasting competition)

- Graph/Network mining
- spatio-temporal mining - outliers

C. Faloutsos

Graph/network mining place in the Santa Fe forecasting competition)

- Internet; web; gnutella P2P networks
- Q: Any pattern?
- Q: how to generate ‘realistic’ topologies?
- Q: how to define/verify realism?

C. Faloutsos

Patterns? place in the Santa Fe forecasting competition)

- avg degree is, say 3.3
- pick a node at random - what is the degree you expect it to have?

count

?

avg: 3.3

degree

C. Faloutsos

Patterns? place in the Santa Fe forecasting competition)

- avg degree is, say 3.3
- pick a node at random - what is the degree you expect it to have?
- A: 1!!

count

avg: 3.3

degree

C. Faloutsos

Patterns? place in the Santa Fe forecasting competition)

- avg degree is, say 3.3
- pick a node at random - what is the degree you expect it to have?
- A: 1!!

count

avg: 3.3

degree

C. Faloutsos

Patterns? place in the Santa Fe forecasting competition)

log(count)

- A: Power laws!

log {(out) degree}

C. Faloutsos

Effective Diameter place in the Santa Fe forecasting competition)

Other ‘laws’?Count vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos

Effective Diameter place in the Santa Fe forecasting competition)

RMAT, to generate realistic graphsCount vs Indegree

Count vs Outdegree

Hop-plot

Stress

“Network value”

Eigenvalue vs Rank

C. Faloutsos

Epidemic threshold? place in the Santa Fe forecasting competition)

- one a real graph, will a (computer / biological) virus die out? (given
- beta: probability that an infected node will infect its neighbor and
- delta: probability that an infected node will recover

NO

MAYBE

YES

C. Faloutsos

Epidemic threshold? place in the Santa Fe forecasting competition)

- one a real graph, will a (computer / biological) virus die out? (given
- beta: probability that an infected node will infect its neighbor and
- delta: probability that an infected node will recover

- A: depends on largest eigenvalue of adjacency matrix! [Wang+03]

C. Faloutsos

Additional projects place in the Santa Fe forecasting competition)

- Graph mining
- spatio-temporal mining - outliers

C. Faloutsos

Outliers - ‘LOCI’ place in the Santa Fe forecasting competition)

C. Faloutsos

finds outliers quickly, place in the Santa Fe forecasting competition)

with no human intervention

Outliers - ‘LOCI’C. Faloutsos

Conclusions place in the Santa Fe forecasting competition)

- AWSOM for automatic, linear forecasting
- MUSCLES for co-evolving sequences
- F4 for non-linear forecasting
- Graph/Network topology: power laws and generators; epidemic threshold
- LOCI for outlier detection

C. Faloutsos

Conclusions place in the Santa Fe forecasting competition)

- Overarching theme: automatic discovery of patterns (outliers/rules) in
- time sequences (sensors/streams)
- graphs (computer/social networks)
- multimedia (video, motion capture data etc)
www.cs.cmu.edu/~christos

C. Faloutsos

Books place in the Santa Fe forecasting competition)

- William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT)
- C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

C. Faloutsos

Books place in the Santa Fe forecasting competition)

- George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.)
- Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

C. Faloutsos

Resources: software and urls place in the Santa Fe forecasting competition)

- MUSCLES: Prof. Byoung-Kee Yi:
http://www.postech.ac.kr/~bkyi/

- AWSOM & LOCI: [email protected]
- F4, RMAT: [email protected]

C. Faloutsos

Additional Reading place in the Santa Fe forecasting competition)

- [Chakrabarti+02] Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.
- [Chen+94] Chung-Min Chen, Nick Roussopoulos: Adaptive Selectivity Estimation Using Query Feedback. SIGMOD Conference 1994:161-172
- [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

C. Faloutsos

Additional Reading place in the Santa Fe forecasting competition)

- Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003
- Spiros Papadimitriou, Hiroyuki Kitagawa, Phil Gibbons and Christos Faloutsos LOCI: Fast Outlier Detection Using the Local Correlation Integral ICDE 2003, Bangalore, India, March 5 - March 8, 2003.
- Sauer, T. (1994). Time series prediction using delay coordinate embedding. (in book by Weigend and Gershenfeld, below) Addison-Wesley.

C. Faloutsos

Additional Reading place in the Santa Fe forecasting competition)

- Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dynamical Systems and Turbulence. Berlin: Springer-Verlag.
- Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint 22nd Symposium on Reliable Distributed Computing (SRDS2003) Florence, Italy, Oct. 6-8, 2003

C. Faloutsos

Additional Reading place in the Santa Fe forecasting competition)

- Weigend, A. S. and N. A. Gerschenfeld (1994). Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley. (Excellent collection of papers on chaotic/non-linear forecasting, describing the algorithms behind the winners of the Santa Fe competition.)
- [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)

C. Faloutsos

Download Presentation

Connecting to Server..