 Download Presentation Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

# Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

Download Presentation ## Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping Presented by John ClarkMarch 24, 2014

2. Paper Summary • Describes algorithmic changes to existing Dynamic Time Warping (DTW) calculations to increase search efficiency • Focuses upon time-series data but demonstrates application to related data mining problems • Allows unprecedented levels of data to be searched quickly • Attempts to correct erroneous belief that DTW is too slow for general data mining

3. Background • Time Series Data and Queries • Example Time Series • Dynamic Time Warping (DTW)

4. Time Series Data and Queries • Time Series (T) • An ordered list of data points:T = t1, t2, ..., tm • Contains shorter subsequences • Subsequence (Ti,k) • Contiguous subset of time series data • Starts at position i of original series T with length k • Ti,k = ti, ti+1, ..., ti+k-11 <= i <= (m-k+1) • Candidate Subsequence (C) • Subsequence of T to match against known query • |C| = k • Query (Q) • Time series input • |Q| = n • Euclidean Distance (ED) • distance between Q and C where |Q| = |C| • ED (Q, C) =

5. Example Time Series • Medical Data: EEG and ECG • Financial Data: stock prices, financial transactions • Web Data: clickstreams • Misc. Data: Video and audio sequences

6. Dynamic Time Warping (DTW) • ED is a one-to-one mapping of two sequences • DTW allows for non-linear mapping between two sequences • Formulation • construct n x n matrix • (i, j) element = ED(qi, cj) for points qi and cj • Apply path constraint: Sakoe-Chiba Band • Find warping path • Warping Path (P) • contiguous set of matrix elements that defines a mapping between Q and C • pt = (i, j)t P = p1, p2, ..., pt, ..., pT n <= T <= 2n - 1

7. Paper • Claims • Assumptions • Known Optimizations • UCR Suite • Experiments • Additional Applications • Conclusions

8. Claims • Time series data mining bottleneck is similarity search time • Most time series work plateaus at millions of objects • Large datasets can be searched exactly with DTW more quickly than current state-of-the-art Euclidean distance search algorithms • The author’s tests used the largest set of time series data ever • Design applicable to other mining problems • Allow real-time monitoring • DTW myths abound • Exact search is faster than any current approximate or indexed searches

9. Assumptions • Time Series Subsequences must be Normalized • Dynamic Time Warping is the best measure • No known distance measure better than DTW after search of 800 papers • Arbitrary Query Lengths cannot be Indexed • No known techniques support similarity search of arbitrary lengths in billion+ datasets • There exists data mining problems that we are willing to wait several hours to answer

10. Time Series Subsequences must be Normalized • Intuitive idea but not always implemented • Example: analysis of video frames • normalized analysis error rate: 0.087 • non-normalized analysis error rates when offset and scaling of +/- 10% applied: 0.326 and 0.193 • analysis error rate off by at least 50% for offset/scale of +/- 5% using real data

11. Known Optimizations • Using Squared Distance • removes expensive square root computation without changing relative rankings • Lower Bounding • Early Abandoning of ED and LB_Keogh • Early Abandoning of DTW • Exploiting Multicores • linear speedup

12. Lower Bounding • Speed up sequential search by setting up a lower bound an pruning unpromising candidates • LB_kim (modified) O(1) • LB_Keogh O(n)

13. Early Abandoning of ED and LB_Keogh • Include a best-so-far (BSF) value to aid in early termination • If sum of squared differences exceeds BSF, terminate computation

14. Early Abandoning of DTW • Compute a full LB_Keogh lower bound • Compute DTW incrementallyto form a new lower bound • intermediate lower bound = DTW(Q1:k, C1:k) + LB_Keogh(Qk+1:n, Ck+1:n) • DTW(Q1:n, C1:n) >= intermediate lower bound • If (BSF < intermediate lower bound), abandon DTW

15. UCR Suite • Early Abandoning Z-normalization • Reordering Early Abandoning • Reversing Query/Data Role in LB_Keogh • Cascading Lower Bounds

16. Early Abandoning Z-Normalization • Normalization takes longer than computing the Euclidean Distance • Approach: interleave early abandoning of ED or LB_Keogh with online Z-normalization

17. Reordering Early Abandoning • Traditionally compute distance / normalization in time-series order (left to right) • Approach • sort the indices based on absolute values of Z-normalized Q • compute distance / normalization with new order

18. Reversing the Query/Data Role in LB_Keogh • Normally LB_Keogh is computed around the query • only needs to be done once  saves time and space • Proposal: compute lower bound on the candidate in a “just-in-time” fashion • calculate only if all other lower bounds fail to prune • removes space overhead • increased time overhead offset by increased pruning of full DTW calculations

19. Cascading Lower Bounds • Multiple options for lower bounds • LB_KimFL, LB_KeoghEQ, LB_KeoghEC, Early Abandoning DTW • Suggestions is to use all of them in a cascading fashion to maximize the amount of pruning • Can prune more than 99.9999% of DTW calculations

20. Experiments • Tests • Random Walk Baseline • Supporting Long Queries: EEG • Supporting Very Long Queries: DNA • Realtime Medical and Gesture Data • Algorithms • Naive: • Z-norm, ED / DTW at each step • State-of-the-art (SOTA) • Z-norm, early abandoning, LB_Keogh for DTW • UCR Suite • all speedups • GOd’sALgorithm (GOAL) • only maintains mean and std. dev. online O(1) • lower bound on fastest possible time

21. Random Walk Results

22. EEG Results

23. DNA Results

24. Real-time Medical and Gesture Data • 8,518,554,188 ECG datapoints sampled at 256 Hz

25. Application of UCR to Existing Mining Algorithms

26. Paper’s Discussion and Conclusions • Focused on fast sequential search • Believed to be faster than all known indexing searches • Shown that UCR-DTW is faster than all current Euclidean Distance Searches (SOTA-ED) • Reason: O(n) normalization step for each subsequence for ED; UCR-DTW weighted average is less than O(n) • Compare UCR method to recent SOTA embedding-based DTW search called EBSM

27. EBSM Comparison to UCR-DTW

28. Conclusions • Well written • easy to follow • clear distinction and explanation of modifications • thorough experimentation with available source and pseudo-code • Not terribly innovative but very effective • additions are straight-forward and surprising intuitive • execution / integration of components make this algorithm stand out • Deferred a lot of explanations and theory of existing components to cited papers