1 / 95

Data Mining on Streams

Data Mining on Streams. Christos Faloutsos CMU. THANK YOU!. Prof. Panos Ipeirotis Julia Mills. Outline. Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions. Problem definition - example.

carmelita
Download Presentation

Data Mining on Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining on Streams Christos Faloutsos CMU C. Faloutsos

  2. THANK YOU! • Prof. Panos Ipeirotis • Julia Mills C. Faloutsos

  3. Outline • Problem and motivation • Single-sequence mining: AWSOM • Co-evolving sequences: SPIRIT • Lag correlations: BRAID • Conclusions C. Faloutsos

  4. Problem definition - example Each sensor collects data (x1, x2, …, xt, …) C. Faloutsos

  5. Problem definition • Given: one or more sequences x1 , x2 , … , xt , … (y1, y2, … , yt, … … ) • Find • patterns; correlations; outliers • incrementally! C. Faloutsos

  6. Find patterns using a method that is nimble: limited resources Memory Bandwidth, power, CPU incremental: on-line, ‘any-time’ response single pass (‘you get to see it only once’) automatic: no human intervention eg., in remote environments Limitations / Challenges C. Faloutsos

  7. Application domains • Sensor devices • Temperature, weather measurements • Road traffic data • Geological observations • Patient physiological data • Embedded devices • Network routers • Intelligent (active) disks C. Faloutsos

  8. Motivation - Applications (cont’d) • ‘Smart house’ • sensors monitor temperature, humidity, air quality • video surveillance C. Faloutsos

  9. Motivation - Applications (cont’d) • civil/automobile infrastructure • bridge vibrations [Oppenheim+02] • road conditions / traffic monitoring C. Faloutsos

  10. Motivation - Applications (cont’d) • Weather, environment/anti-pollution • volcano monitoring • air/water pollutant monitoring C. Faloutsos

  11. Motivation - Applications (cont’d) • Computer systems • ‘Active Disks’ (buffering, prefetching) • web servers (ditto) • network traffic monitoring • ... C. Faloutsos

  12. InteMonw/ Evan Hoke, Jimeng Sun self-* PetaByte data center at CMU

  13. Outline • Problem and motivation • Single-sequence mining: AWSOM • Co-evolving sequences: SPIRIT • Lag correlations: BRAID • conclusions C. Faloutsos

  14. Single sequence mining - AWSOM with Spiros Papadimitriou (CMU -> IBM) Anthony Brockwell (CMU/Stat) C. Faloutsos

  15. “Noise”?? Problem definition • Semi-infinite streams of values (time series) x1, x2, …, xt, … • Find patterns, forecasts, outliers… Periodicity? (twice daily) C. Faloutsos Periodicity? (daily)

  16. Requirements / Goals • Adapt and handle arbitrary periodic components and • nimble (limited resources, single pass) • on-line, any-time • automatic (no human intervention/tuning) C. Faloutsos

  17. Overview • Introduction / Related work • Background • Main idea • Experimental results C. Faloutsos

  18. W1,3 W1,1 W1,4 W1,2 t t t t xt W2,1 W2,2 t t t W3,1 t V4,1 t WaveletsExample – Haar transform “constant” frequency C. Faloutsos time

  19. Wavelets compress many real signals well: Image compression and processing Vision Astronomy, seismology, … Wavelet coefficients can be updated as new points arrive WaveletsWhy we like them C. Faloutsos

  20. Overview • Introduction / Related work • Background • Main idea • Experimental results C. Faloutsos

  21. W1,3 t W1,1 W1,4 W1,2 t t t t frequency W2,1 W2,2 = t t W3,1 t V4,1 t time AWSOM xt C. Faloutsos

  22. W1,3 t W1,1 W1,4 W1,2 t t t t frequency W2,1 W2,2 t t W3,1 t V4,1 t time AWSOM xt C. Faloutsos

  23. Wl,t-2 Wl,t-1 Wl,t Wl’,t’-2 Wl’,t’-1 AWSOM - idea Wl,t l,1Wl,t-1l,2Wl,t-2 … Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 … Wl’,t’ C. Faloutsos

  24. More details… • Update of wavelet coefficients • Update of linear models • Feature selection • Not all correlations are significant • Throw away the insignificant ones (“noise”) (incremental) (incremental; RLS) (single-pass) C. Faloutsos

  25. ? Complexity • Model update Space:OlgN + mk2 OlgN Time:Ok2 O1 Where • N: number of points (so far) • k: number of regression coefficients; fixed • m: number of linear models; OlgN C. Faloutsos

  26. Overview • Introduction / Related work • Background • Main idea • Experimental results C. Faloutsos

  27. Results - Synthetic data AWSOM AR Seasonal AR • Triangle pulse • Mix (sine + square) • AR captures wrong trend (or none) • Seasonal AR estimation fails C. Faloutsos

  28. Results - Real data • Automobile traffic • Daily periodicity • Bursty “noise” at smaller scales • AR fails to capture any trend • Seasonal AR estimation fails C. Faloutsos

  29. Results - real data  • Sunspot intensity • Slightly time-varying “period” • AR captures wrong trend • Seasonal ARIMA • wrong downward trend, despite help by human! C. Faloutsos

  30. Conclusions • Adapt and handle arbitrary periodic components and • nimble Limited memory (logarithmic) Constant-time update • on-line, any-time Single pass over the data • automatic: No human intervention/tuning C. Faloutsos

  31. Outline • Problem and motivation • Single-sequence mining: AWSOM • Co-evolving sequences: SPIRIT • Lag correlations: BRAID • conclusions C. Faloutsos

  32. Part 2 SPIRIT: Mining co-evolving streams [Papadimitriou, Sun, Faloutsos, VLDB05] C. Faloutsos

  33. Motivation • Eg., chlorine concentration in water distribution network C. Faloutsos

  34. Phase 1 Phase 2 Phase 3 : : : : : : chlorine concentrations : : : : : : Motivation water distribution network normal operation May have hundreds of measurements, but it is unlikely they are completely unrelated! C. Faloutsos

  35. Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : Motivation sensors near leak chlorine concentrations sensors away from leak water distribution network normal operation major leak C. Faloutsos

  36. Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : Motivation sensors near leak chlorine concentrations sensors away from leak water distribution network normal operation major leak C. Faloutsos

  37. Phase 1 Phase 1 : : : : : : chlorine concentrations k = 1 : : : : : : Motivation actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends C. Faloutsos

  38. : : : : : : : : : : : : Motivation Phase 1 Phase 2 Phase 1 Phase 2 chlorine concentrations k = 2 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends C. Faloutsos

  39. : : : : : : : : : : : : Motivation Phase 1 Phase 2 Phase 3 Phase 1 Phase 2 Phase 3 chlorine concentrations k = 1 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends C. Faloutsos

  40. Goals • Discover “hidden” (latent) variables for: • Summarization of main trends for users • Efficient forecasting, spotting outliers/anomalies and the usual: • nimble: Limited memory requirements • on-line, any-time: (single pass etc) • automatic: No special parameters to tune C. Faloutsos

  41. Related workStream mining • Stream SVD [Guha, Gunopulos, Koudas / KDD03] • StatStream [Zhu, Shasha / VLDB02] • Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], • Classification [Wang, Fan, et al/KDD03], [Hulten,Spencer,Domingos/KDD01] C. Faloutsos

  42. Related workStream mining • Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004] • Queries on streams [Dobra, Garofalakis, Gehrke, et al / SIGMOD02], [Madden, Franklin, Hellerstein, et al / OSDI02], [Considine, Li, Kollios, et al / ICDE04], [Hammad, Aref, Elmagarmid / SSDBM03] • … C. Faloutsos

  43. OverviewPart 2 • Method • Experiments • Conclusions & Other work C. Faloutsos

  44. Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? C. Faloutsos

  45. time 1. How to capture correlations? First sensor 30oC Temperature t1 20oC C. Faloutsos

  46. time 1. How to capture correlations? First sensor Second sensor 30oC Temperature t2 20oC C. Faloutsos

  47. 1. How to capture correlations Correlations: Let’s take a closer look at the first three value-pairs… 30oC Temperature t2 20oC 20oC 30oC C. Faloutsos Temperature t1

  48. time=3 time=2 time=1 1. How to capture correlations First three lie (almost) on a line in the space of value-pairs… 30oC Temperature t2 offset = “hidden variable”  O(n) numbers for the slope, and  One number for each value-pair (offset on line) 20oC 20oC 30oC C. Faloutsos Temperature t1

  49. 1. How to capture correlations Other pairs also follow the same pattern: they lie (approximately) on this line 30oC Temperature t2 20oC 20oC 30oC C. Faloutsos Temperature t1

  50. Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? C. Faloutsos

More Related