1 / 156

15-826: Multimedia Databases and Data Mining

15-826: Multimedia Databases and Data Mining. Lecture #27: Time series mining and forecasting Christos Faloutsos. Must-Read Material.

napper
Download Presentation

15-826: Multimedia Databases and Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 15-826: Multimedia Databasesand Data Mining Lecture #27: Time series mining and forecasting Christos Faloutsos

  2. Must-Read Material • Byong-Kee Yi, Nikolaos D. Sidiropoulos, Theodore Johnson, H.V. Jagadish, Christos Faloutsos and Alex Biliris, Online Data Mining for Co-Evolving Time Sequences, ICDE, Feb 2000. • Chungmin Melvin Chen and Nick Roussopoulos, Adaptive Selectivity Estimation Using Query Feedbacks, SIGMOD 1994 (c) C. Faloutsos, 2017

  3. Deepay Chakrabarti (UT-Austin) Spiros Papadimitriou (Rutgers) Prof. Byoung-Kee Yi (Samsung) Thanks (c) C. Faloutsos, 2017

  4. Outline • Motivation • Similarity search – distance functions • Linear Forecasting • Bursty traffic - fractals and multifractals • Non-linear forecasting • Conclusions (c) C. Faloutsos, 2017

  5. Problem definition • Given: one or more sequences x1 , x2 , … , xt , … (y1, y2, … , yt, … … ) • Find • similar sequences; forecasts • patterns; clusters; outliers (c) C. Faloutsos, 2017

  6. Motivation - Applications • Financial, sales, economic series • Medical • ECGs +; blood pressure etc monitoring • reactions to new drugs • elderly care (c) C. Faloutsos, 2017

  7. Motivation - Applications (cont’d) • ‘Smart house’ • sensors monitor temperature, humidity, air quality • video surveillance (c) C. Faloutsos, 2017

  8. Motivation - Applications (cont’d) • civil/automobile infrastructure • bridge vibrations [Oppenheim+02] • road conditions / traffic monitoring (c) C. Faloutsos, 2017

  9. Motivation - Applications (cont’d) • Weather, environment/anti-pollution • volcano monitoring • air/water pollutant monitoring (c) C. Faloutsos, 2017

  10. Motivation - Applications (cont’d) • Computer systems • ‘Active Disks’ (buffering, prefetching) • web servers (ditto) • network traffic monitoring • ... (c) C. Faloutsos, 2017

  11. Stream Data: Disk accesses #bytes time (c) C. Faloutsos, 2017

  12. Problem #1: Goal: given a signal (e.g.., #packets over time) Find: patterns, periodicities, and/or compress count lynx caught per year (packets per day; temperature per day) year (c) C. Faloutsos, 2017

  13. Problem#2: Forecast Given xt, xt-1, …, forecast xt+1 90 80 70 60 Number of packets sent ?? 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick (c) C. Faloutsos, 2017

  14. Problem#2’: Similarity search E.g.., Find a 3-tick pattern, similar to the last one 90 80 70 60 Number of packets sent ?? 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick (c) C. Faloutsos, 2017

  15. Problem #3: • Given: A set of correlatedtime sequences • Forecast ‘Sent(t)’ (c) C. Faloutsos, 2017

  16. Important observations Patterns, rules, forecasting and similarity indexing are closely related: • To do forecasting, we need • to find patterns/rules • to find similar settings in the past • to find outliers, we need to have forecasts • (outlier = too far away from our forecast) (c) C. Faloutsos, 2017

  17. Outline • Motivation • Similarity Search and Indexing • Linear Forecasting • Bursty traffic - fractals and multifractals • Non-linear forecasting • Conclusions (c) C. Faloutsos, 2017

  18. Outline • Motivation • Similarity search and distance functions • Euclidean • Time-warping • ... (c) C. Faloutsos, 2017

  19. Importance of distance functions Subtle, but absolutely necessary: • A ‘must’ for similarity indexing (-> forecasting) • A ‘must’ for clustering Two major families • Euclidean and Lp norms • Time warping and variations (c) C. Faloutsos, 2017

  20. ... Euclidean and Lp • L1: city-block = Manhattan • L2 = Euclidean • L (c) C. Faloutsos, 2017

  21. Day-n Day-2 ... Day-1 Observation #1 • Time sequence -> n-d vector (c) C. Faloutsos, 2017

  22. Observation #2 Euclidean distance is closely related to cosine similarity dot product ‘cross-correlation’ function Day-n Day-2 ... Day-1 (c) C. Faloutsos, 2017

  23. Time Warping allow accelerations - decelerations (with or w/o penalty) THEN compute the (Euclidean) distance (+ penalty) related to the string-editing distance (c) C. Faloutsos, 2017

  24. Time Warping ‘stutters’: (c) C. Faloutsos, 2017

  25. Time warping Q: how to compute it? A: dynamic programming D( i, j ) = cost to match prefix of length i of first sequence x with prefix of length j of second sequence y (c) C. Faloutsos, 2017

  26. Full-text scanning • Approximate matching - string editing distance: d( ‘survey’, ‘surgery’) = 2 = min # of insertions, deletions, substitutions to transform the first string into the second SURVEY SURGERY (c) C. Faloutsos, 2017

  27. Time warping Thus, with no penalty for stutter, for sequences x1, x2, …, xi,; y1, y2, …, yj no stutter x-stutter y-stutter (c) C. Faloutsos, 2017

  28. Time warping VERY SIMILAR to the string-editing distance no stutter x-stutter y-stutter (c) C. Faloutsos, 2017

  29. Full-text scanning if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1) else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion ) (c) C. Faloutsos, 2017

  30. Time warping VERY SIMILAR to the string-editing distance Time-warping String editing 1 + cost(i-1, j-1) // sub. 1 + cost(i, j-1) // del. 1 + cost(i-1, j) // ins. cost(i,j) = min (c) C. Faloutsos, 2017

  31. Time warping • Complexity: O(M*N) - quadratic on the length of the strings • Many variations (penalty for stutters; limit on the number/percentage of stutters; …) • popular in voice processing [Rabiner + Juang] (c) C. Faloutsos, 2017

  32. Other Distance functions piece-wise linear/flat approx.; compare pieces [Keogh+01] [Faloutsos+97] ‘cepstrum’ (for voice [Rabiner+Juang]) do DFT; take log of amplitude; do DFT again! Allow for small gaps [Agrawal+95] See tutorial by [Gunopulos + Das, SIGMOD01] (c) C. Faloutsos, 2017

  33. Other Distance functions In [Keogh+, KDD’04]: parameter-free, MDL based (c) C. Faloutsos, 2017

  34. Conclusions Prevailing distances: Euclidean and time-warping (c) C. Faloutsos, 2017

  35. Outline • Motivation • Similarity search and distance functions • Linear Forecasting • Bursty traffic - fractals and multifractals • Non-linear forecasting • Conclusions (c) C. Faloutsos, 2017

  36. Linear Forecasting (c) C. Faloutsos, 2017

  37. Forecasting "Prediction is very difficult, especially about the future." - Nils Bohr http://www.hfac.uh.edu/MediaFutures/thoughts.html (c) C. Faloutsos, 2017

  38. Outline • Motivation • ... • Linear Forecasting • Auto-regression: Least Squares; RLS • Co-evolving time sequences • Examples • Conclusions (c) C. Faloutsos, 2017

  39. Reference [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares) (c) C. Faloutsos, 2017

  40. Problem#2: Forecast • Example: give xt-1, xt-2, …, forecast xt 90 80 70 60 Number of packets sent ?? 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick (c) C. Faloutsos, 2017

  41. Forecasting: Preprocessing MANUALLY: remove trends spot periodicities 7 days time time (c) C. Faloutsos, 2017

  42. 90 80 70 ?? 60 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick Problem#2: Forecast • Solution: try to express xt as a linear function of the past: xt-2, xt-2, …, (up to a window of w) Formally: (c) C. Faloutsos, 2017

  43. (Problem: Back-cast; interpolate) • Solution - interpolate: try to express xt as a linear function of the past AND the future: xt+1, xt+2, … xt+wfuture;xt-1, … xt-wpast (up to windows of wpast , wfuture) • EXACTLY the same algo’s ?? 90 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick (c) C. Faloutsos, 2017

  44. Linear Regression: idea 85 Body height 80 75 70 65 60 55 50 45 40 15 25 35 45 Body weight • express what we don’t know (= ‘dependent variable’) • as a linear function of what we know (= ‘indep. variable(s)’) (c) C. Faloutsos, 2017

  45. Linear Auto Regression: (c) C. Faloutsos, 2017

  46. Linear Auto Regression: 85 ‘lag-plot’ 80 75 70 65 Number of packets sent (t) 60 55 50 45 40 15 25 35 45 Number of packets sent (t-1) • lag w=1 • Dependent variable = # of packets sent (S[t]) • Independent variable = # of packets sent (S[t-1]) (c) C. Faloutsos, 2017

  47. Outline • Motivation • ... • Linear Forecasting • Auto-regression: Least Squares; RLS • Co-evolving time sequences • Examples • Conclusions (c) C. Faloutsos, 2017

  48. More details: • Q1: Can it work with window w>1? • A1: YES! xt xt-1 xt-2 (c) C. Faloutsos, 2017

  49. More details: • Q1: Can it work with window w>1? • A1: YES! (we’ll fit a hyper-plane, then!) xt xt-1 xt-2 (c) C. Faloutsos, 2017

  50. More details: • Q1: Can it work with window w>1? • A1: YES! (we’ll fit a hyper-plane, then!) xt xt-1 xt-2 (c) C. Faloutsos, 2017

More Related