1 / 24

Online Event-driven Subsequence Matching over Financial Data Streams

Online Event-driven Subsequence Matching over Financial Data Streams. Huanmei Wu, Betty Salzberg, Donghui Zhang SIGMOD 2004. Outline . Introduction Motivation Data Stream Processing Subsequence Matching Trend Prediction Performance Conclusion. Introduction.

makala
Download Presentation

Online Event-driven Subsequence Matching over Financial Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Event-driven Subsequence Matching over Financial Data Streams Huanmei Wu, Betty Salzberg, Donghui Zhang SIGMOD 2004

  2. Outline • Introduction • Motivation • Data Stream Processing • Subsequence Matching • Trend Prediction • Performance • Conclusion SIGMOD 2004

  3. Introduction • Subsequence matching tries to find subsequences from the large data sequences in the database that are similar to a given query sequence • It is important in data mining • Trend prediction • Pattern recognition • Dynamic clustering of multiple data streams • Rule discovery SIGMOD 2004

  4. Motivation • Existing techniques on time series subsequence matching focus on discovering the similarity between an online querying subsequence and a traditional database SIGMOD 2004

  5. S1 Price S2 4 4’ 2’ 2 5 5’ 1 3’ 1’ 3 time Motivation (cont.) • Subsequence similarity over financial data streams has its unique properties • Zigzag shape of piecewise linear representation (PLR) • Relative position of end points is important • Price change (amplitude) is more important than time interval SIGMOD 2004

  6. Data Stream Processing (1) Aggregation • Piecewise Linear Representation requires a unique value for each time interval • Aggregation of the raw data • filter out the noise before further data processing aggregated data stream SIGMOD 2004

  7. Data Stream Processing (2) Smoothing • moving average • widely used in the financial market • X(i) is the value for i = 1, 2, ..., n and n is the number of periods. MAp(i) calculates the p-interval moving average time series which assigns equal weight to every point in the averaging interval SIGMOD 2004

  8. Data Stream Processing (3) Smoothing • Bollinger Band Percent (%b) SIGMOD 2004

  9. Data Stream Processing (4) Smoothing • Bollinger Band Percent (%b) • %b is a normalized value of the real price between -1and 2 %b data stream SIGMOD 2004

  10. Data Stream Processing (5) Smoothing • segmentation over %b is more suitable than that directly over the raw price data stream • %b has a smoothed moving trend similar to the price movement • %b is a normalized value of the real price between -1and 2 • Uniform segmentation criteria • %b is very sensitive to the price change, and it will manifest the price change accurately without any delay SIGMOD 2004

  11. Data Stream Processing (6) Segmentation • PLR may not be in a zigzag shape • Finds end points of the PLR that are points at which the trend changes dramatically • All other points are considered as noise and should be eliminated SIGMOD 2004

  12. Data Stream Processing (7) Segmentation over %b Pi 10 9 12 8  Price (x) 7 11 13 6 1 Pj 2 4 5 3 Sliding Window t • In the current sliding window, where Pj(Xj,tj) is the current point, Pi(Xi, ti) is an upper end point if, • Xi = max ( X values of the current sliding window ) • Xi > Xj +  ( where  is the given error threshold ) • Pi(Xi, ti) is the last one satisfying the above two conditions SIGMOD 2004

  13. Data Stream Processing (8) Segmentation over %b • delay time • the time difference between the actually time of an end point and the time when it is identifies as an end point • A smaller will reduce the delay time but result in a larger number of short line segments • some of which may still be noise • A larger will decrease the number of line segments but with longer delay • some useful information will be filtered out SIGMOD 2004

  14. Data Stream Processing (9) Pruning • The process of removing noise-like line segment • Segmentation finds potential end points using a smaller threshold s • shorter delay time • Noise introduced by small swill be filtered out SIGMOD 2004

  15. Data Stream Processing (10)Online segmentation and pruning s: a smaller threshold for segmentation over %b pb: a larger threshold for pruning over %b pd: a threshold for pruning over raw stream data SIGMOD 2004

  16. Subsequence Similarity (1) Subsequence Permutation S = {(X1, t1), (X2, t2), …, (Xn, tn)} Separate upper and lower points S’ = { [(X1, t1), (X3, t3), …, (Xn-1, tn-1)], [(X2, t2), (X4, t4), …, (Xn, tn)] } Sort separately based on X values S” = {[(Xi1, ti1), (Xi3, ti3), …, (Xi(n-1), ti(n-1))], [(Xi2, ti2), (Xi4, ti4), …, (Xin, tin)] } Get the subsequence permutation {i1, i3, …, in-1, i2, i4, …, in} SIGMOD 2004

  17. Subsequence Similarity (2) New similarity measure S1 = {(X1, t1), (X2, t2), …, (Xn, tn)} S2 = {(X1', t1'), (X2', t2'), …, (Xn', tn')} S1 and S2 are similar if they satisfy the following two conditions : • S1 and S2 have the same permutation • d(S1, S2) < , where d(S1, S2) = ( *  ||(Xi+1 - Xi)| - |(Xi+1' - Xi')|| +  *  |(ti+1 - ti) - (ti+1' - ti')|) where , ,   0 are user defined parameters SIGMOD 2004

  18. Subsequence Similarity (3) Special cases • If a query subsequence has any pairs of upper points (or lower points) with distance under a certain predefined threshold, we consider the query subsequence to have two permutations • Subsequences of the two possible permutations are both searched SIGMOD 2004

  19. Subsequence Similarity (3) Event-driven subsequence matching • event means a new potential end point is being identified and no pruning is need • The query subsequence is the most recent n fixed and potential end points • The search algorithm finds subsequences in the historical data similar to a query subsequence Price 4 2 1 3 t t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 …… t37 t38 t39 t40 SIGMOD 2004

  20. Trend prediction (1)Definition of trend SIGMOD 2004

  21. Trend prediction (2)prediction • Subsequence similarity search returns a list of end points which are the last end points of one retrieved subsequence SIGMOD 2004

  22. Performance (1) Similarity measure 70 65 60 55 50 45 40 35 30 Correctness % Perm+Amp Perm+Euc Euc Only Amp Only Perm Only SIGMOD 2004

  23. Performance (2) Event–driven vs. Fixed time periods 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 70 65 60 55 50 45 40 35 30 Relative CPU cost Correctness % FT1 FT10 FT25 FT5 FT15 FT20 FT30 FT1 FT5 FT10 FT15 FT20 FT25 FT30 Event-driven Event-driven SIGMOD 2004

  24. Conclusion • Finding trend of E by computing the distance between E and Ek may loss important information SIGMOD 2004

More Related