1 / 52

Data Mining Meets Systems: Tools and Case Studies

Data Mining Meets Systems: Tools and Case Studies. Christos Faloutsos SCS CMU. Spiros Papadimitriou (CMU->IBM). Mengzhi Wang (CMU->Google). Thanks. Jimeng Sun (CMU -> IBM). Outline. Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining

ingo
Download Presentation

Data Mining Meets Systems: Tools and Case Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Meets Systems:Tools and Case Studies Christos Faloutsos SCS CMU

  2. Spiros Papadimitriou (CMU->IBM) Mengzhi Wang (CMU->Google) Thanks Jimeng Sun (CMU -> IBM) C. Faloutsos

  3. Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank C. Faloutsos

  4. Problem #1: Goal: given a signal (eg., #bytes over time) Find: patterns, periodicities, and/or compress #bytes Bytes per 30’ (packets per day; earthquakes per year) time C. Faloutsos

  5. Problem #1 • model bursty traffic • generate realistic traces • (Poisson does not work) # bytes Poisson time C. Faloutsos

  6. Motivation • predict queue length distributions (e.g., to give probabilistic guarantees) • “learn” traffic, for buffering, prefetching, ‘active disks’, web servers C. Faloutsos

  7. Q: any ‘pattern’? • Not Poisson • spike; silence; more spikes; more silence… • any rules? # bytes time C. Faloutsos

  8. solution: self-similarity # bytes # bytes time time C. Faloutsos

  9. But: • Q1: How to generate realistic traces; extrapolate? • Q2: How to estimate the model parameters? C. Faloutsos

  10. Approach • Q1: How to generate a sequence, that is • bursty • self-similar • and has similar queue length distributions C. Faloutsos

  11. Approach • A: ‘binomial multifractal’ [Wang+02] • ~ 80-20 ‘law’: • 80% of bytes/queries etc on first half • repeat recursively • b: bias factor (eg., 80%) C. Faloutsos

  12. binary multifractals 20 80 C. Faloutsos

  13. binary multifractals 20 80 C. Faloutsos

  14. Parameter estimation • Q2: How to estimate the bias factor b? C. Faloutsos

  15. Parameter estimation • Q2: How to estimate the bias factor b? • A: MANY ways [Crovella+96] • Hurst exponent • variance plot • even DFT amplitude spectrum! (‘periodogram’) • More robust: ‘entropy plot’ [Wang+02] Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic, ICDE 2002 C. Faloutsos

  16. Entropy plot • Rationale: • burstiness: inverse of uniformity • entropy measures uniformity of a distribution • find entropy at several granularities, to see whether/how our distribution is close to uniform. C. Faloutsos

  17. Entropy plot p1 p2 % of bytes here • Entropy E(n) after n levels of splits • n=1: E(1)= - p1 log2(p1)- p2 log2(p2) C. Faloutsos

  18. Entropy plot p2,3 p2,2 p2,4 p2,1 • Entropy E(n) after n levels of splits • n=1: E(1)= - p1 log(p1)- p2 log(p2) • n=2: E(2) = - Si p2,i * log2 (p2,i) C. Faloutsos

  19. Real traffic Entropy E(n) • Has linear entropy plot (-> self-similar) 0.73 # of levels (n) C. Faloutsos

  20. Observation - intuition: Entropy E(n) intuition: slope = intrinsic dimensionality =~ ‘degrees of freedom’ or info-bits per coordinate-bit • unif. Dataset: slope =1 • multi-point: slope = 0 0.73 # of levels (n) C. Faloutsos

  21. Some more entropy plots: • Poisson vs real 0.73 1 Poisson: slope = ~1 -> uniformly distributed C. Faloutsos

  22. B-model • b-model traffic gives perfectly linear plot • Lemma: its slope is slope = -b log2b - (1-b) log2 (1-b) • Fitting: do entropy plot; get slope; solve for b E(n) n C. Faloutsos

  23. Experimental setup • Disk traces (from HP [Wilkes 93]) • web traces from LBL http://repository.cs.vt.edu/ lbl-conn-7.tar.Z C. Faloutsos

  24. Model validation • Linear entropy plots Bias factors b: 0.6-0.8 smallest b / smoothest: nntp traffic C. Faloutsos

  25. Web traffic - results • LBL, NCDF of queue lengths (log-log scales) Prob( >l) (queue length l) C. Faloutsos

  26. Conclusions • Multifractals (80/20, ‘b-model’, Multiplicative Wavelet Model (MWM)) for analysis and synthesis of bursty traffic C. Faloutsos

  27. Books • Fractals: Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!) C. Faloutsos

  28. Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) C. Faloutsos

  29. Clusters/data center monitoring • Monitor correlations of multiple measurements • Automatically flag anomalous behavior • Intemon: intelligent monitoring system • warsteiner.db.cs.cmu.edu/demo/intemon.jsp C. Faloutsos

  30. Publication Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3):38-44. ACM Press, July 2006 C. Faloutsos

  31. Under the hood: SVD • Singular Value Decomposition • Done incrementally Spiros Papadimitriou, Jimeng Sun and Christos Faloutsos Streaming Pattern Discovery in Multiple Time-Series VLDB 2005, Trondheim, Norway. C. Faloutsos

  32. Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) LSI: S. Dumais; M. Berry KL: eg, Duda+Hart PCA: eg., Jolliffe Details: [Press+] u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos

  33. Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos

  34. Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos

  35. Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos

  36. Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) C. Faloutsos

  37. BGP updates With • Aditya Prakash (CMU) • Michalis Faloutsos (UC Riverside) • Nicholas Valler (UC Riverside) • Dave Andersen (CMU) C. Faloutsos

  38. Tool #0: Time plot Time Series: #Updates per 600s, Washington Router 09/2004-09/2006 C. Faloutsos

  39. Tool #0: Time plot • Observation #1: Missing values • Observation #2: Bursty C. Faloutsos

  40. Tool #1: Wavelets C. Faloutsos

  41. Wavelets - DWT • Short window Fourier transform (SWFT) • But: how short should be the window? freq value time time C. Faloutsos

  42. Wavelets - DWT • Answer: multiple window sizes! -> DWT Time domain DWT SWFT DFT freq time C. Faloutsos

  43. Haar Wavelets • subtract sum of left half from right half • repeat recursively for quarters, eight-ths, ... C. Faloutsos

  44. Low freq. High freq. time ‘Tornado Plot’ for Washington Router: Dark areas correspond to high energy C. Faloutsos

  45. Tornado Plot: Wavelet Transform for Washington Router 09/2004-09/2006, All coefficients and Detail levels 1-12 • Observations: • Obvious Spikes (E1): • tornados that “touch down” • 2. Prolonged Spikes (E2 and E3): • when coarser scales have high values but finer scales do not • Intermittent Waves (E4 and E5): High-energy entries at nearby scales correspond to local periodic motion C. Faloutsos

  46. Magnification of updates on 28th Aug. 2005 # updates time E2: Prolonged Spike Sustained Period of relatively high Activity C. Faloutsos

  47. Tool #2: logarithms C. Faloutsos

  48. Tool #2: logarithms Prominent `clothesline’ at ~ 50 updates per 600 secs. Culprit IP addresses: 192.211.42.0/24 216.109.38.0/24 207.157.115.0/24 All from Alabama (Supercomputing Center)! C. Faloutsos

  49. Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank C. Faloutsos

  50. Main point Two-way street: <- DM can use such infrastructures to find patterns -> DM can help such systems/networks etc to become self-healing, self-adjusting, ‘self-*’ Hot topic in Data Mining: finding patterns in Tera- and Peta-bytes C. Faloutsos

More Related