Statstream statistical monitoring of thousands of data streams in real time l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time PowerPoint PPT Presentation


  • 159 Views
  • Uploaded on
  • Presentation posted in: General

StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time. P ankaj Kumar Madhukar Rakesh Kumar Singh Puspendra Kumar Project Instructor: Prof P.K.Reddy. Correlated!. Correlated!. Goal.

Download Presentation

StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Statstream statistical monitoring of thousands of data streams in real time l.jpg

StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time

Pankaj Kumar Madhukar

Rakesh Kumar Singh

Puspendra Kumar

Project Instructor:

Prof P.K.Reddy


Slide2 l.jpg

Correlated!

Correlated!

Goal

  • Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time.

  • Real time

    • high update frequency of the data stream

    • fixed response time, online


Our approach l.jpg

Our approach

  • Naive algorithm

    • N : number of streams

    • w : size of sliding window

    • space O(N) and time O(N2w) VS space O(N2) and time O(N2) .

  • Suppose that the streams are updated every second.

    • With a Pentium 4 PC, the exact computing method can only monitor 700 streams with a delay of 2 minutes.

  • Our Approach

    • Using Discrete Fourier Transform to approximate correlation

    • Using grid structure to filter out unlikely pairs

    • Our approach can monitor 10,000 streams with a delay of 2 minutes.


Roadmap l.jpg

Roadmap

  • Goal

  • StatStream

    • Data Structure

    • Correlation Approximation

    • Grid structure

  • Empirical study

  • Future work


Stream synoptic data structure l.jpg

Basic window digests:

sum

DFT coefs

Basic window digests:

sum

DFT coefs

Time point

Basic window

Sliding window

Stream synoptic data structure

  • Three level time interval hierarchy

    • Time point, Basic window, Sliding window

  • Basic window (the key to our technique)

    • The computation for basic window i must finish by the end of the basic window i+1

    • The basic window time is the system response time.

  • Digests

Basic window digests:

sum

DFT coefs

Basic window digests:

sum

DFT coefs

Sliding window digests:

sum

DFT coefs


Roadmap6 l.jpg

Roadmap

  • Motivation and Goal

  • Related work

  • StatStream

    • Data Structure

    • Correlation Approximation

    • Grid structure

  • Empirical study

  • Future work


Synchronized correlation uses basic windows l.jpg

Synchronized Correlation Uses Basic Windows

  • Inner-product of aligned basic windows

Stream x

Stream y

Basic window

Sliding window


Approximate synchronized correlation l.jpg

f1(1) f1(2) f1(3) f1(4) f1(5) f1(6) f1(7) f1(8)

y1 y2 y3 y4 y5 y6 y7 y8

f2(1) f2(2) f2(3) f2(4) f2(5) f2(6) f2(7) f2(8)

f3(1) f3(2) f3(3) f3(4) f3(5) f3(6) f3(7) f3(8)

Approximate Synchronized Correlation

  • Approximate with an orthogonal function family (e.g. DFT)

  • Inner product of the time series Inner product of the digests

  • The time and space complexity is reduced from O(b) to O(n).

    • b : size of basic window

    • n : size of the digests (n<<b)

  • e.g. 120 time points reduce to 4 digests

x1 x2 x3 x4 x5 x6 x7 x8


Approximate lagged correlation l.jpg

sliding window

sliding window

Approximate lagged Correlation

  • Inner-product with unaligned windows

  • The time complexity is reduced from O(b) to O(n2) , as opposed to O(n) for synchronized correlation.


Roadmap10 l.jpg

Roadmap

  • Motivation and Goal

  • Related work

  • StatStream

    • Data Structure

    • Correlation Approximation

    • Grid structure

  • Empirical study

  • Future work


Grid structure to avoid checking all pairs l.jpg

x

Grid Structure(to avoid checking all pairs)

  • The DFT coefficients yields a vector.

  • High correlation => closeness in the vector space

    • We can use a grid structure and look in the neighborhood, this will return a super set of highly correlated pairs.


Roadmap12 l.jpg

Roadmap

  • Motivation and Goal

  • Related work

  • StatStream

    • Data Structure

    • Correlation Approximation

    • Grid structure

  • Empirical study

  • Future work


Empirical study l.jpg

Empirical Study

  • Response time

    • Exact (naïve method): T=k0bN2


Empirical study14 l.jpg

Empirical Study

  • DFT-grid:

    • Updating Digests: T1=k1bN

    • Detecting correlation:T2=k2N2


Empirical study cont l.jpg

Empirical Study(cont.)

  • Approximation errors

    • Larger size of digests, larger size of sliding window and smaller size of basic window give better approximation

    • The approximation errors are small for the stock data.

  • Precision: the quality of the grid structure


Roadmap16 l.jpg

Roadmap

  • Motivation and Goal

  • Related work

  • StatStream

    • Data Structure

    • Correlation Approximation

    • Grid structure

  • Empirical study

  • Future work


Future work l.jpg

Future work

  • Algorithmic:

    • dynamic clustering of streams

    • outlier detection

      • a stream that becomes less correlated with the other streams in its cluster.

  • Applications:

    • Data-intensive application requiring correlation among many streams.

    • Network Traffic Monitoring:

      • The unusual high correlation between two links in a network might suggest some anomaly.

    • Medical Time Series:

      • The high correlation between the two region in the human brain during fMRI testing might suggest some functional connection.

    • Some domain specific definition of correlation might be more appropriate.

      • E.g., in fMRI time series, detrending before correlating.


  • Login