Exploring Data Mining Techniques for Performance Evaluation and Traffic Analysis
520 likes | 650 Views
This document presents fundamental insights into data mining applications addressing key issues like workload characterization, self-monitoring, BGP mining, and large graph analysis. It highlights innovative solutions including self-similar traffic modeling using multifractals and entropy plots for understanding bursty traffic. Along with case studies showcasing the real-world applicability of these techniques, the paper discusses advancements in continuous mining of sensor data and the challenges of generating realistic trace models.
Exploring Data Mining Techniques for Performance Evaluation and Traffic Analysis
E N D
Presentation Transcript
Data Mining Meets Systems:Tools and Case Studies Christos Faloutsos SCS CMU
Spiros Papadimitriou (CMU->IBM) Mengzhi Wang (CMU->Google) Thanks Jimeng Sun (CMU -> IBM) C. Faloutsos
Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank C. Faloutsos
Problem #1: Goal: given a signal (eg., #bytes over time) Find: patterns, periodicities, and/or compress #bytes Bytes per 30’ (packets per day; earthquakes per year) time C. Faloutsos
Problem #1 • model bursty traffic • generate realistic traces • (Poisson does not work) # bytes Poisson time C. Faloutsos
Motivation • predict queue length distributions (e.g., to give probabilistic guarantees) • “learn” traffic, for buffering, prefetching, ‘active disks’, web servers C. Faloutsos
Q: any ‘pattern’? • Not Poisson • spike; silence; more spikes; more silence… • any rules? # bytes time C. Faloutsos
solution: self-similarity # bytes # bytes time time C. Faloutsos
But: • Q1: How to generate realistic traces; extrapolate? • Q2: How to estimate the model parameters? C. Faloutsos
Approach • Q1: How to generate a sequence, that is • bursty • self-similar • and has similar queue length distributions C. Faloutsos
Approach • A: ‘binomial multifractal’ [Wang+02] • ~ 80-20 ‘law’: • 80% of bytes/queries etc on first half • repeat recursively • b: bias factor (eg., 80%) C. Faloutsos
binary multifractals 20 80 C. Faloutsos
binary multifractals 20 80 C. Faloutsos
Parameter estimation • Q2: How to estimate the bias factor b? C. Faloutsos
Parameter estimation • Q2: How to estimate the bias factor b? • A: MANY ways [Crovella+96] • Hurst exponent • variance plot • even DFT amplitude spectrum! (‘periodogram’) • More robust: ‘entropy plot’ [Wang+02] Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic, ICDE 2002 C. Faloutsos
Entropy plot • Rationale: • burstiness: inverse of uniformity • entropy measures uniformity of a distribution • find entropy at several granularities, to see whether/how our distribution is close to uniform. C. Faloutsos
Entropy plot p1 p2 % of bytes here • Entropy E(n) after n levels of splits • n=1: E(1)= - p1 log2(p1)- p2 log2(p2) C. Faloutsos
Entropy plot p2,3 p2,2 p2,4 p2,1 • Entropy E(n) after n levels of splits • n=1: E(1)= - p1 log(p1)- p2 log(p2) • n=2: E(2) = - Si p2,i * log2 (p2,i) C. Faloutsos
Real traffic Entropy E(n) • Has linear entropy plot (-> self-similar) 0.73 # of levels (n) C. Faloutsos
Observation - intuition: Entropy E(n) intuition: slope = intrinsic dimensionality =~ ‘degrees of freedom’ or info-bits per coordinate-bit • unif. Dataset: slope =1 • multi-point: slope = 0 0.73 # of levels (n) C. Faloutsos
Some more entropy plots: • Poisson vs real 0.73 1 Poisson: slope = ~1 -> uniformly distributed C. Faloutsos
B-model • b-model traffic gives perfectly linear plot • Lemma: its slope is slope = -b log2b - (1-b) log2 (1-b) • Fitting: do entropy plot; get slope; solve for b E(n) n C. Faloutsos
Experimental setup • Disk traces (from HP [Wilkes 93]) • web traces from LBL http://repository.cs.vt.edu/ lbl-conn-7.tar.Z C. Faloutsos
Model validation • Linear entropy plots Bias factors b: 0.6-0.8 smallest b / smoothest: nntp traffic C. Faloutsos
Web traffic - results • LBL, NCDF of queue lengths (log-log scales) Prob( >l) (queue length l) C. Faloutsos
Conclusions • Multifractals (80/20, ‘b-model’, Multiplicative Wavelet Model (MWM)) for analysis and synthesis of bursty traffic C. Faloutsos
Books • Fractals: Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!) C. Faloutsos
Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) C. Faloutsos
Clusters/data center monitoring • Monitor correlations of multiple measurements • Automatically flag anomalous behavior • Intemon: intelligent monitoring system • warsteiner.db.cs.cmu.edu/demo/intemon.jsp C. Faloutsos
Publication Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3):38-44. ACM Press, July 2006 C. Faloutsos
Under the hood: SVD • Singular Value Decomposition • Done incrementally Spiros Papadimitriou, Jimeng Sun and Christos Faloutsos Streaming Pattern Discovery in Multiple Time-Series VLDB 2005, Trondheim, Norway. C. Faloutsos
Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) LSI: S. Dumais; M. Berry KL: eg, Duda+Hart PCA: eg., Jolliffe Details: [Press+] u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos
Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos
Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos
Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos
Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) C. Faloutsos
BGP updates With • Aditya Prakash (CMU) • Michalis Faloutsos (UC Riverside) • Nicholas Valler (UC Riverside) • Dave Andersen (CMU) C. Faloutsos
Tool #0: Time plot Time Series: #Updates per 600s, Washington Router 09/2004-09/2006 C. Faloutsos
Tool #0: Time plot • Observation #1: Missing values • Observation #2: Bursty C. Faloutsos
Tool #1: Wavelets C. Faloutsos
Wavelets - DWT • Short window Fourier transform (SWFT) • But: how short should be the window? freq value time time C. Faloutsos
Wavelets - DWT • Answer: multiple window sizes! -> DWT Time domain DWT SWFT DFT freq time C. Faloutsos
Haar Wavelets • subtract sum of left half from right half • repeat recursively for quarters, eight-ths, ... C. Faloutsos
Low freq. High freq. time ‘Tornado Plot’ for Washington Router: Dark areas correspond to high energy C. Faloutsos
Tornado Plot: Wavelet Transform for Washington Router 09/2004-09/2006, All coefficients and Detail levels 1-12 • Observations: • Obvious Spikes (E1): • tornados that “touch down” • 2. Prolonged Spikes (E2 and E3): • when coarser scales have high values but finer scales do not • Intermittent Waves (E4 and E5): High-energy entries at nearby scales correspond to local periodic motion C. Faloutsos
Magnification of updates on 28th Aug. 2005 # updates time E2: Prolonged Spike Sustained Period of relatively high Activity C. Faloutsos
Tool #2: logarithms C. Faloutsos
Tool #2: logarithms Prominent `clothesline’ at ~ 50 updates per 600 secs. Culprit IP addresses: 192.211.42.0/24 216.109.38.0/24 207.157.115.0/24 All from Alabama (Supercomputing Center)! C. Faloutsos
Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank C. Faloutsos
Main point Two-way street: <- DM can use such infrastructures to find patterns -> DM can help such systems/networks etc to become self-healing, self-adjusting, ‘self-*’ Hot topic in Data Mining: finding patterns in Tera- and Peta-bytes C. Faloutsos