Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

182 Views

Download Presentation
## Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Searching and Mining Trillions of Time Series Subsequences**under Dynamic Time Warping Presented by John ClarkMarch 24, 2014**Paper Summary**• Describes algorithmic changes to existing Dynamic Time Warping (DTW) calculations to increase search efficiency • Focuses upon time-series data but demonstrates application to related data mining problems • Allows unprecedented levels of data to be searched quickly • Attempts to correct erroneous belief that DTW is too slow for general data mining**Background**• Time Series Data and Queries • Example Time Series • Dynamic Time Warping (DTW)**Time Series Data and Queries**• Time Series (T) • An ordered list of data points:T = t1, t2, ..., tm • Contains shorter subsequences • Subsequence (Ti,k) • Contiguous subset of time series data • Starts at position i of original series T with length k • Ti,k = ti, ti+1, ..., ti+k-11 <= i <= (m-k+1) • Candidate Subsequence (C) • Subsequence of T to match against known query • |C| = k • Query (Q) • Time series input • |Q| = n • Euclidean Distance (ED) • distance between Q and C where |Q| = |C| • ED (Q, C) =**Example Time Series**• Medical Data: EEG and ECG • Financial Data: stock prices, financial transactions • Web Data: clickstreams • Misc. Data: Video and audio sequences**Dynamic Time Warping (DTW)**• ED is a one-to-one mapping of two sequences • DTW allows for non-linear mapping between two sequences • Formulation • construct n x n matrix • (i, j) element = ED(qi, cj) for points qi and cj • Apply path constraint: Sakoe-Chiba Band • Find warping path • Warping Path (P) • contiguous set of matrix elements that defines a mapping between Q and C • pt = (i, j)t P = p1, p2, ..., pt, ..., pT n <= T <= 2n - 1**Paper**• Claims • Assumptions • Known Optimizations • UCR Suite • Experiments • Additional Applications • Conclusions**Claims**• Time series data mining bottleneck is similarity search time • Most time series work plateaus at millions of objects • Large datasets can be searched exactly with DTW more quickly than current state-of-the-art Euclidean distance search algorithms • The author’s tests used the largest set of time series data ever • Design applicable to other mining problems • Allow real-time monitoring • DTW myths abound • Exact search is faster than any current approximate or indexed searches**Assumptions**• Time Series Subsequences must be Normalized • Dynamic Time Warping is the best measure • No known distance measure better than DTW after search of 800 papers • Arbitrary Query Lengths cannot be Indexed • No known techniques support similarity search of arbitrary lengths in billion+ datasets • There exists data mining problems that we are willing to wait several hours to answer**Time Series Subsequences must be Normalized**• Intuitive idea but not always implemented • Example: analysis of video frames • normalized analysis error rate: 0.087 • non-normalized analysis error rates when offset and scaling of +/- 10% applied: 0.326 and 0.193 • analysis error rate off by at least 50% for offset/scale of +/- 5% using real data**Known Optimizations**• Using Squared Distance • removes expensive square root computation without changing relative rankings • Lower Bounding • Early Abandoning of ED and LB_Keogh • Early Abandoning of DTW • Exploiting Multicores • linear speedup**Lower Bounding**• Speed up sequential search by setting up a lower bound an pruning unpromising candidates • LB_kim (modified) O(1) • LB_Keogh O(n)**Early Abandoning of ED and LB_Keogh**• Include a best-so-far (BSF) value to aid in early termination • If sum of squared differences exceeds BSF, terminate computation**Early Abandoning of DTW**• Compute a full LB_Keogh lower bound • Compute DTW incrementallyto form a new lower bound • intermediate lower bound = DTW(Q1:k, C1:k) + LB_Keogh(Qk+1:n, Ck+1:n) • DTW(Q1:n, C1:n) >= intermediate lower bound • If (BSF < intermediate lower bound), abandon DTW**UCR Suite**• Early Abandoning Z-normalization • Reordering Early Abandoning • Reversing Query/Data Role in LB_Keogh • Cascading Lower Bounds**Early Abandoning Z-Normalization**• Normalization takes longer than computing the Euclidean Distance • Approach: interleave early abandoning of ED or LB_Keogh with online Z-normalization**Reordering Early Abandoning**• Traditionally compute distance / normalization in time-series order (left to right) • Approach • sort the indices based on absolute values of Z-normalized Q • compute distance / normalization with new order**Reversing the Query/Data Role in LB_Keogh**• Normally LB_Keogh is computed around the query • only needs to be done once saves time and space • Proposal: compute lower bound on the candidate in a “just-in-time” fashion • calculate only if all other lower bounds fail to prune • removes space overhead • increased time overhead offset by increased pruning of full DTW calculations**Cascading Lower Bounds**• Multiple options for lower bounds • LB_KimFL, LB_KeoghEQ, LB_KeoghEC, Early Abandoning DTW • Suggestions is to use all of them in a cascading fashion to maximize the amount of pruning • Can prune more than 99.9999% of DTW calculations**Experiments**• Tests • Random Walk Baseline • Supporting Long Queries: EEG • Supporting Very Long Queries: DNA • Realtime Medical and Gesture Data • Algorithms • Naive: • Z-norm, ED / DTW at each step • State-of-the-art (SOTA) • Z-norm, early abandoning, LB_Keogh for DTW • UCR Suite • all speedups • GOd’sALgorithm (GOAL) • only maintains mean and std. dev. online O(1) • lower bound on fastest possible time**Real-time Medical and Gesture Data**• 8,518,554,188 ECG datapoints sampled at 256 Hz**Paper’s Discussion and Conclusions**• Focused on fast sequential search • Believed to be faster than all known indexing searches • Shown that UCR-DTW is faster than all current Euclidean Distance Searches (SOTA-ED) • Reason: O(n) normalization step for each subsequence for ED; UCR-DTW weighted average is less than O(n) • Compare UCR method to recent SOTA embedding-based DTW search called EBSM**Conclusions**• Well written • easy to follow • clear distinction and explanation of modifications • thorough experimentation with available source and pseudo-code • Not terribly innovative but very effective • additions are straight-forward and surprising intuitive • execution / integration of components make this algorithm stand out • Deferred a lot of explanations and theory of existing components to cited papers