1 / 32

BRAID: Stream Mining through Group Lag Correlations

BRAID: Stream Mining through Group Lag Correlations. Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005. Introduction. Lag correlations : For example: Higher amounts of fluoride in water → fewer dental cavities some years later Goal :

veda-cote
Download Presentation

BRAID: Stream Mining through Group Lag Correlations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005

  2. Introduction • Lag correlations : • For example: Higher amounts of fluoride in water → fewer dental cavities some years later • Goal : • Monitor multiple numerical streams determine the pair correlated with lag and the value

  3. Introduction • k numerical sequences X1,…Xk , report all pair of Xi and Xj which Xi follow Xj with lag l

  4. Introduction

  5. Introduction • In this paper, propose BRAID handle data stream of semi-infinite length • Any time processing, and fast • Nimble • Accurate • Small resource consumption

  6. Proposed method • Data stream X : {x1, …, xt, ..., xn} , xn is the most recent value • R(0) : X and Y with the same length n and have zero lag • ρ Coefficient :

  7. Proposed method • For lag l ,consider common part of X and shifted Y , only n-l time ticks

  8. Proposed method

  9. Proposed method • R(l) : correlation coefficient, X is delayed by l • Score at lag l :

  10. Proposed method • R(l) for large value of lag l≈ n, the original and shifted time sequence have too few overlapping • Restrict maximum lag m to be n/2

  11. Proposed method • Naive solution : • At time n, access all value of X and Y, compute R(l) of all value lag l(=0,1,…) • Choose earliest max score above r , or report no lag • The solution based on three major step

  12. Proposed method • Need some sufficient statistics for R to computed easily • Sx(l,n) = : sum of X of length n • Sxx(l,n) = : sum of square X of length n • Sxy(l) = : sum of square X of length n

  13. Proposed method • R(l) is obtained :

  14. Proposed method • R(l) can estimate at any point time, only need to keep track five sufficient statistics • It still needs linear time to compute the cross-correlation function between two sequences

  15. Proposed method • Propose to keep track of only a geometric progression of the lag value : l= 0,1,2,..2i,. • Only O(logn) number to track of, instead of O(n) that “Naïve solution” requires • Space required grow linearly with length n

  16. Proposed method • In order to compute R(l) at any time, keep sliding window of size l, m=n/2 need O(n) space • Instead of operating on original time sequence, also compute their smoothed version by computing non-overlapping windows

  17. Proposed method • Window size : power of g=2 • X : original time sequence • Axh : smoothed version with window of length 2h • Ax0 : original sequence, Ax1 : consists of n/2 ticks ,..etc • Axh ‘s sufficient statistic need compute every 2h time ticks • At time n, need O(log n) level, for each level compute sufficient statistic

  18. Proposed method • In contrast with small lags, the larger one are sparse • Use cubic spline to interpolate the missing correlation coefficient

  19. Proposed method • Axh(t) : window average at time tick t for level h • Axh(0) ≡ xt

  20. Proposed method • Sufficient statistics:

  21. Enhanced BRAID • If two sequence of size ≈ 220, require about 5*log 220 = 5*20=100 float numbers , about 800 bytes • Large memory available, propose a solution to probe more but use O(log n) space • Use mix of arithmetic plus geometric probing

  22. Enhanced BRAID • BRAID use only one window at each smoothing level • Propose use b>1 windows, b=4 instead • Algorithm before b=1,with exception bottom level has 2b coefficient • While computing R(l), use mixture geometric and arithmetic progression:

  23. Enhanced BRAID • Example of enhanced BRAID of b=4 • The algorithm behind if b=1 also equal to the algorithm before

  24. Conclusion • Proposed BRAID to detection lag correlation on streaming data • At any time • Low resource consumption • High accuracy

  25. Thank you very much~

More Related