1 / 123

Data Mining on Streams

Data Mining on Streams. Christos Faloutsos CMU. Dr. Deepay Chakrabarti (Yahoo) Dr. Spiros Papadimitriou (IBM) Prof. Byoung-Kee Yi (Pohang U.). Prof. Dimitris Gunopulos (UCR) Dr. Mengzhi Wang (Google). Thanks. For more info:. 3h tutorial, at:

luna
Download Presentation

Data Mining on Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining on Streams Christos Faloutsos CMU

  2. Dr. Deepay Chakrabarti (Yahoo) Dr. Spiros Papadimitriou (IBM) Prof. Byoung-Kee Yi (Pohang U.) Prof. Dimitris Gunopulos (UCR) Dr. Mengzhi Wang (Google) Thanks (c) C. Faloutsos, 2006

  3. For more info: 3h tutorial, at: http://www.cs.cmu.edu/~christos/TALKS/EDBT04-tut/faloutsos-edbt04.pdf (c) C. Faloutsos, 2006

  4. Outline • Motivation • Similarity Search and Indexing • DSP (Digital Signal Processing) • Linear Forecasting • Bursty traffic - fractals and multifractals • Non-linear forecasting • Conclusions (c) C. Faloutsos, 2006

  5. Problem definition • Given: one or more sequences x1 , x2 , … , xt , … (y1, y2, … , yt, … … ) • Find • similar sequences; forecasts • patterns; clusters; outliers (c) C. Faloutsos, 2006

  6. Motivation - Applications • Financial, sales, economic series • Medical • ECGs +; blood pressure etc monitoring • reactions to new drugs • elder care (c) C. Faloutsos, 2006

  7. Motivation - Applications (cont’d) • ‘Smart house’ • sensors monitor temperature, humidity, air quality • video surveillance (c) C. Faloutsos, 2006

  8. Motivation - Applications (cont’d) • civil/automobile infrastructure • bridge vibrations [Oppenheim+02] • road conditions / traffic monitoring (c) C. Faloutsos, 2006

  9. Motivation - Applications (cont’d) • Weather, environment/anti-pollution • volcano monitoring • air/water pollutant monitoring (c) C. Faloutsos, 2006

  10. Motivation - Applications (cont’d) • Computer systems • ‘Active Disks’ (buffering, prefetching) • web servers (ditto) • network traffic monitoring • ... (c) C. Faloutsos, 2006

  11. Problem #1: Goal: given a signal (e.g.., #packets over time) Find: patterns, periodicities, and/or compress count lynx caught per year (packets per day; temperature per day) year (c) C. Faloutsos, 2006

  12. Problem#2: Forecast Given xt, xt-1, …, forecast xt+1 90 80 70 60 Number of packets sent ?? 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick (c) C. Faloutsos, 2006

  13. Problem#2’: Similarity search E.g.., Find a 3-tick pattern, similar to the last one 90 80 70 60 Number of packets sent ?? 50 40 30 20 10 0 1 3 5 7 9 11 Time Tick (c) C. Faloutsos, 2006

  14. Differences from DSP/Stat • Semi-infinite streams • we need on-line, ‘any-time’ algorithms • Can not afford human intervention • need automatic methods • sensors have limited memory / processing / transmitting power • need for (lossy) compression (c) C. Faloutsos, 2006

  15. Important observations Patterns, rules, forecasting and similarity indexing are closely related: • To do forecasting, we need • to find patterns/rules • to find similar settings in the past • to find outliers, we need to have forecasts • (outlier = too far away from our forecast) (c) C. Faloutsos, 2006

  16. Important topics NOT in this tutorial: • Continuous queries • [Babu+Widom ] [Gehrke+] [Madden+] • Categorical data streams • [Hatonen+96] • Outlier detection (discontinuities) • [Breunig+00] (c) C. Faloutsos, 2006

  17. Outline • Motivation • Similarity Search and Indexing • DSP • Linear Forecasting • Bursty traffic - fractals and multifractals • Non-linear forecasting • Conclusions (c) C. Faloutsos, 2006

  18. Outline • Motivation • Similarity Search and Indexing • distance functions: Euclidean;Time-warping • indexing • feature extraction • DSP • ... (c) C. Faloutsos, 2006

  19. ... Euclidean and Lp • L1: city-block = Manhattan • L2 = Euclidean • L (c) C. Faloutsos, 2006

  20. $price $price $price 1 1 1 365 365 365 day day day distance function: by expert (c) C. Faloutsos, 2006

  21. Idea: ‘GEMINI’ E.g.., ‘find stocks similar to MSFT’ Seq. scanning: too slow How to accelerate the search? [Faloutsos96] (c) C. Faloutsos, 2006

  22. S1 F(S1) 1 365 1 365 F(Sn) day day Sn ‘GEMINI’ - Pictorially eg,. std eg, avg (c) C. Faloutsos, 2006

  23. GEMINI Solution: Quick-and-dirty' filter: • extract n features (numbers, eg., avg., etc.) • map into a point in n-d feature space • organize points with off-the-shelf spatial access method (‘SAM’) • discard false alarms (c) C. Faloutsos, 2006

  24. Examples of GEMINI • Time sequences: DFT (up to 100 times faster) [SIGMOD94]; • [Kanellakis+], [Mendelzon+] (c) C. Faloutsos, 2006

  25. Examples of GEMINI Even on other-than-sequence data: • Images (QBIC) [JIIS94] • tumor-like shapes [VLDB96] • video [Informedia + S-R-trees] • automobile part shapes [Kriegel+97] (c) C. Faloutsos, 2006

  26. Indexing - SAMs Q: How do Spatial Access Methods (SAMs) work? A: they group nearby points (or regions) together, on nearby disk pages, and answer spatial queries quickly (‘range queries’, ‘nearest neighbor’ queries etc) For example: (c) C. Faloutsos, 2006

  27. Skip R-trees • [Guttman84] eg., w/ fanout 4: group nearby rectangles to parent MBRs; each group -> disk page I C A G H F B J E D (c) C. Faloutsos, 2006

  28. A H D F G B E I C J Skip R-trees • eg., w/ fanout 4: P1 P3 I C A G H F B J E P4 P2 D (c) C. Faloutsos, 2006

  29. P1 A D H F P2 G B E I P3 C J P4 Skip R-trees • eg., w/ fanout 4: P1 P3 I C A G H F B J E P4 P2 D (c) C. Faloutsos, 2006

  30. P1 A H D F P2 G B E I P3 C J P4 Skip R-trees - range search? P1 P3 I C A G H F B J E P4 P2 D (c) C. Faloutsos, 2006

  31. P1 A H D F P2 G B E I P3 C J P4 Skip R-trees - range search? P1 P3 I C A G H F B J E P4 P2 D (c) C. Faloutsos, 2006

  32. Conclusions • Fast indexing: through GEMINI • feature extraction and • (off the shelf) Spatial Access Methods [Gaede+98] (c) C. Faloutsos, 2006

  33. Outline • Motivation • Similarity Search and Indexing • distance functions • indexing • feature extraction • DSP • ... (c) C. Faloutsos, 2006

  34. Outline • Motivation • Similarity Search and Indexing • distance functions • indexing • feature extraction • DFT, DWT, DCT (data independent) • SVD, etc (data dependent) • MDS, FastMap (c) C. Faloutsos, 2006

  35. DFT and cousins very good for compressing real signals more details on DFT/DCT/DWT: later (c) C. Faloutsos, 2006

  36. Feature extraction • SVD (finds hidden/latent variables) • Random projections (works surprisingly well!) (c) C. Faloutsos, 2006

  37. Conclusions - Practitioner’s guide Similarity search in time sequences 1) establish/choose distance (Euclidean, time-warping,…) 2) extract features (SVD, DWT, MDS), and use a SAM (R-tree/variant) or a Metric Tree (M-tree) 2’)for high intrinsic dimensionalities, consider sequential scan (it might win…) (c) C. Faloutsos, 2006

  38. Books • William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for SVD) • C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to SVD, and GEMINI) (c) C. Faloutsos, 2006

  39. References • Agrawal, R., K.-I. Lin, et al. (Sept. 1995). Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time-Series Databases. Proc. of VLDB, Zurich, Switzerland. • Babu, S. and J. Widom (2001). “Continuous Queries over Data Streams.” SIGMOD Record 30(3): 109-120. • Breunig, M. M., H.-P. Kriegel, et al. (2000). LOF: Identifying Density-Based Local Outliers. SIGMOD Conference, Dallas,TX. • Berry, Michael: http://www.cs.utk.edu/~lsi/ (c) C. Faloutsos, 2006

  40. References • Ciaccia, P., M. Patella, et al. (1997). M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. VLDB. • Foltz, P. W. and S. T. Dumais (Dec. 1992). “Personalized Information Delivery: An Analysis of Information Filtering Methods.” Comm. of ACM (CACM) 35(12): 51-60. • Guttman, A. (June 1984). R-Trees: A Dynamic Index Structure for Spatial Searching. Proc. ACM SIGMOD, Boston, Mass. (c) C. Faloutsos, 2006

  41. References • Gaede, V. and O. Guenther (1998). “Multidimensional Access Methods.” Computing Surveys 30(2): 170-231. • Gehrke, J. E., F. Korn, et al. (May 2001). On Computing Correlated Aggregates Over Continual Data Streams. ACM Sigmod, Santa Barbara, California. (c) C. Faloutsos, 2006

  42. References • Gunopulos, D. and G. Das (2001). Time Series Similarity Measures and Time Series Indexing. SIGMOD Conference, Santa Barbara, CA. • Hatonen, K., M. Klemettinen, et al. (1996). Knowledge Discovery from Telecommunication Network Alarm Databases. ICDE, New Orleans, Louisiana. • Jolliffe, I. T. (1986). Principal Component Analysis, Springer Verlag. (c) C. Faloutsos, 2006

  43. References • Keogh, E. J., K. Chakrabarti, et al. (2001). Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. SIGMOD Conference, Santa Barbara, CA. • Eamonn J. Keogh, Stefano Lonardi, Chotirat (Ann) Ratanamahatana: Towards parameter-free data mining. KDD 2004: 206-215 • Kobla, V., D. S. Doermann, et al. (Nov. 1997). VideoTrails: Representing and Visualizing Structure in Video Sequences. ACM Multimedia 97, Seattle, WA. (c) C. Faloutsos, 2006

  44. References • Oppenheim, I. J., A. Jain, et al. (March 2002). A MEMS Ultrasonic Transducer for Resident Monitoring of Steel Structures. SPIE Smart Structures Conference SS05, San Diego. • Papadimitriou, C. H., P. Raghavan, et al. (1998). Latent Semantic Indexing: A Probabilistic Analysis. PODS, Seattle, WA. • Rabiner, L. and B.-H. Juang (1993). Fundamentals of Speech Recognition, Prentice Hall. (c) C. Faloutsos, 2006

  45. References • Traina, C., A. Traina, et al. (October 2000). Fast feature selection using the fractal dimension,. XV Brazilian Symposium on Databases (SBBD), Paraiba, Brazil. (c) C. Faloutsos, 2006

  46. References • Dennis Shasha and Yunyue Zhu High Performance Discovery in Time Series: Techniques and Case Studies Springer 2004 • Yunyue Zhu, Dennis Shasha ``StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time'‘ VLDB, August, 2002. pp. 358-369. • Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. The Design of an Acquisitional Query Processor for Sensor Networks. SIGMOD, June 2003, San Diego, CA. (c) C. Faloutsos, 2006

  47. Part 2: DSP (Digital Signal Processing) (c) C. Faloutsos, 2006

  48. Outline • Motivation • Similarity Search and Indexing • DSP (DFT, DWT) • Linear Forecasting • Bursty traffic - fractals and multifractals • Non-linear forecasting • Conclusions (c) C. Faloutsos, 2006

  49. Outline • DFT • Definition of DFT and properties • how to read the DFT spectrum • DWT • Definition of DWT and properties • how to read the DWT scalogram (c) C. Faloutsos, 2006

  50. Introduction - Problem#1 Goal: given a signal (eg., packets over time) Find: patterns and/or compress count lynx caught per year (packets per day; automobiles per hour) year (c) C. Faloutsos, 2006

More Related