1 / 24

End-to-end Anomalous Event Detection in Production Networks

End-to-end Anomalous Event Detection in Production Networks. Les Cottrell , Connie Logg, Felipe Haro, Mahesh Chhaparia (SLAC), Maxim Grigoriev (FNAL), Mark Sandford (Loughborough University) Site Visit by Thomas Ndousse April 27, 2005

joy-short
Download Presentation

End-to-end Anomalous Event Detection in Production Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. End-to-end Anomalous Event Detection in Production Networks Les Cottrell, Connie Logg, Felipe Haro, Mahesh Chhaparia (SLAC), Maxim Grigoriev (FNAL), Mark Sandford (Loughborough University) Site Visit by Thomas Ndousse April 27, 2005 http://www.slac.stanford.edu/grp/scs/net/talk05/anomaly-apr05.ppt Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM)

  2. Outline • Why? • Input data • How? • First approaches • The real world • Results • Conclusions & Futures

  3. Uses of Techniques • Automated problem identification: • Alerts for network administrators, e.g. • Bandwidth changes in time-series, iperf, SNMP • Alerts for systems people • OS/Host metrics • Anomalies for security • Forecasts (are a fallout of the techniques) for Grid Middleware, e.g. replica manager, data placement

  4. Data • Uses packet pair dispersion of 20 packets to provide: • Capacity, X-traffic, available bandwidth • At 3 minute intervals • Very noisy time series data Moving averaged over 1 hour Capacity

  5. Plateau, most intuitive • Each observation: • If outside history buffer mean mh±b*sh then add to trigger buffer • Else add to history, and remove oldest from trigger buffer • When trigger buffer > t points then trigger issued • Check if (mh -mt) / mh > D& 90% trigger in last T mins then have trigger • Move trigger buffer to history buffer Observations Event * • = history length = 1 day, t = trigger length = 3 hours • = standard deviations = 2 Trigger % full History mean History mean – 2 * stdev

  6. K-S • For each observation: for the previous 100 observations with next 100 observations • Compare the vertical difference in CDFs • How does it differ from random CDFs • Expressed as % difference Compare K-S with Plateau

  7. Compare • Results between K-S & plateau very similar, using K-S threshold = 70% • Current plateau only finds negative changes • Useful to see when condition returns to normal • K-S implemented in C and executes faster than Plateau (in Perl), depends on parameters • K-S more formalized • Plateau and K-S work well for non seasonal observations (e.g. small changes day/night)

  8. Seasons & false alerts • Congestion on Monday following a quiet weekend causes a high forecast, gives an alert • Also a history buffer of not a day causes History mean to be out of sync with observations

  9. Effect on events • Change in bandwidth (drops) between 19:00 & 20:00 causes more anomalous events around this time

  10. Seasonal Changes • Use Holt-Winters (H-W) technique: • Uses triple exponential weighted moving average • EWMA(i) = Obs(i) * a + (1-a) * EWMA(i-1) • Three terms each with its own parameter (a, b, ) that take into account local smoothing, long term seasonal smoothing, and trends

  11. Example • Local smoothing 99% weight for last 24 hours • Linear trend 50% last 24 hours • Seasonal 99% for last week • Within an 80 minute window, 80% points outside deviation envelope ≡ event Observations Deviations Forecast Weekend Weekdays

  12. Evaluation • Created a library of time series for 100 days from June through Sep 2004 for 40 hosts • Analyzed using Plateau and saved all events where trigger buffer filled (no filters on size of step) • 23 hosts had 120 candidate events • Event types: steps; diurnal changes; congestion from cron jobs, bandwidth tests, flash crowds • Classify ~120 events as to whether interesting • Large, sharp drop in bandwidth, persist for >> 3hrs

  13. Results • K-S shows similar results to Plateau • As adjust parameters to reduce false positives then increase missed events • E.g. for plateau with trigger buffer = 3 hrs filled to 90% in < 220 minutes, history buffer=1 day, effect of threshold D=(mh-mt)/mh Plateau (b=2) K-S with ± 100 observations

  14. Conclusions • A few paths (10%) have strong seasonal effects • Plateau & K-S work well if only weak seasonal effects • K-S detects both step downs & up, also gives accurate time estimate of event (good for correlations) • H-W promising for seasonal effects, but • Is more complex, and requires more parameters which may not be easy to estimate • Requires regular data (interpolation step) • CPU time can depend critically on parameters chosen, e.g. increasing K-S range from ±100 to say ±400 increases CPU time by factor 14 • H-W works, still need to quantify its effectiveness • Looking at PCA to evaluate multiple metrics simultaneously (e.g. fwd & bwd traffic, RTT, multiple paths) AND multiple paths

  15. Future Work • Improve the event detection technique for Holt-Winters (HW) method. • We tried to apply KS on the residuals of HW Technique, but this does not seem to come up well. • Next we plan to apply plateau on the residuals on HW. • Future Development in PCA • Enable looking at multiple measurements simultaneously • E.g. RTT, loss, capacity …; multiple routes • Neural networks to interpolate heavyweight/infrequent measurements based on light weight more frequent

  16. More information • SLAC Plateau implementation • www.acm.org/sigs/sigcomm/sigcomm2004/workshop_papers/nts26-logg1.pdf • SLAC H-W implementation • www-iepm.slac.stanford.edu/monitoring/forecast/hw.html • Eng. Statistics Handbook • http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc435.htm • Comparison Between Mark Burgess Method & KS • http://www-iepm.slac.stanford.edu/monitoring/forecast/ksvsmb/ksvsmb.htm

  17. Diurnal Variation People arriving at work between 19:00 & 20:00 PDT (7:00 & 8:00 PK time) cause sudden drop in dynamic capacity

  18. H-W Implementation • Need regularly spaced data (else going back one season is difficult, and gets out of sync): • Interpolate data: select bin size • Average points in bin • If no points in first week bin then get data from future weeks • For following weeks, missing data bins filled from previous week • Initial values for smoothing from NIST “Engineering Statistics Handbook” • Choose parms by minimizing (1/N)Σ(Ft-yt)2 • Ft=forecast for time t as function of parameters, yt= observation at time t

  19. H-W Implementation • Three implementations evaluated (two new) • FNAL (Maxim Grigoriev) • Inspiration for evaluating this method • Part of RRD (Brutlag) • Limited control over what it produces and how it works • SLAC • Implemented NIST formulation, different formulation/parameter values from Brutlag/FNAL, also added minimize sums of squares to get parms

  20. Events • Can look at residuals (Ft – yt), or Χ2 • Could use K-S or plateau on: residuals, or on the local smoothing (i.e. after removing long term seasonal effects)

  21. Mark Burgess Method • A two dimensional time-series approach in order to classify a periodic, adaptive threshold for service level anomaly detection • An iterative algorithm is applied to history analysis on this periodic time to provide a smooth roll-off in the significance of the data with time. • This method was originally designed to detect anomalous behavior on a single host.

  22. Compare with KS Iperf from SLAC to Caltech – Feb & Mar 05 KS-Result KS Technique works Very well for the long Term anomalous Variations in internet End-to-end traffic. Mark Burgess technique detects the anomalies for each and every Unwanted huge spikes/variation (Real Time) Mark Burgess Tech-Result

  23. PCA • PCA is a coordinate transformation method that maps a given set of data points onto new axes. These axes are called the principal axes or principal components. • For network anomaly detection PCA divides the data into normal & abnormal subspace • Procedure • Arrangement of data into matrix form • Zero meaning the matrix data • Calculating the covariance matrix • Calculate principal components • Application of the formulae (I-PPT)(data-matrix) yields the result. P is the matrix of Principal Components.

  24. PCA Results PCA Results on SLAC-BINP (June-Sep, 2004) Due to 10% rise in dbcap Anomalous Good Events 10% rise in RTT • Caught all the events that were detected by HW, Plateau and KS • Can work on multiple parameters • Tested PCA on six routes so far SLAC-FZK, SLAC-DESY, SLAC-CALTECH, SLAC-NIIT, SLAC-BINP, SLAC-UMICH

More Related