lessons learned monitoring l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Lessons Learned Monitoring PowerPoint Presentation
Download Presentation
Lessons Learned Monitoring

Loading in 2 Seconds...

play fullscreen
1 / 43

Lessons Learned Monitoring - PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on

Lessons Learned Monitoring. Les Cottrell, SLAC ESnet R&D Advisory Workshop April 23, 2007 Arlington, Virginia. Partially funded by DOE and by Internet2. Uses of Measurements. Automated problem identification & trouble shooting: Alerts for network administrators, e.g.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Lessons Learned Monitoring' - nibal


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
lessons learned monitoring

Lessons Learned Monitoring

Les Cottrell, SLAC

ESnet R&D Advisory Workshop April 23, 2007

Arlington, Virginia

Partially funded by DOE and by Internet2

uses of measurements
Uses of Measurements
  • Automated problem identification & trouble shooting:
    • Alerts for network administrators, e.g.
      • Baselines, bandwidth changes in time-series, iperf, SNMP
    • Alerts for systems people
      • OS/Host metrics
  • Forecasts for Grid Middleware, e.g. replica manager, data placement
  • Engineering, planning, SLA (set & verify), expectations
  • Also (not addressed here):
    • Security: spot anomalies, intrusion detection
    • Accounting
history
History
  • PingER (1994), IEPM-BW (2001)
    • E2E, active, regular, end user view,
    • all hosts owned by individual sites,
    • core mainly centrally designed & developed (homogenous), contributions from FNAL, GATech, NIIT (close collaboration)
  • Why are you monitoring:
    • network trouble management, planning, auditing/setting SLAs, Grid forecasting are very different though may use same measurements
pinger 1994
PingER (1994)
  • PingER project originally (1995) for measuring network performance for US, Europe and Japanese HEP community - now mainly R&E sites
  • Extended this century to measure Digital Divide:
    • Collaboration with ICTP Science Dissemination Unit http://sdu.ictp.it
    • ICFA/SCIC: http://icfa-scic.web.cern.ch/ICFA-SCIC/
  • >120 countries (99% world’s connected population)
  • >35 monitor sites in 14 countries
  • Uses ubiquitous ping facility
  • Monitor 44 sites in S. Asia
  • Most extensive active E2E monitoring in world
pinger design details
PingER Design Details
  • PingER Design (1994: no web services, RRD, security not a big thing, etc.)
    • Simple, no remote software (ping everywhere), no probe development, monitor host install 0.5 day effort for sys-admin
    • Data centrally gathered, archived, analyzed, so hard jobs (archiving, analysis, viz) do NOT require distribution, only one copy
    • Database flat ASCII files, rawdata, analyzed data, file/pair/day. Compressed saves factor 6 (90GB)
    • Data available via web (lot of use, some uses unexpected, often analysis by Excel)
pinger lessons
PingER Lessons
  • Measurement code rewritten twice, once to add extra data, once to document (perldoc) / parameterize / simplify installation
  • Gathering code (uses LYNX or FTP) pull from archive, no major mods in 10 years
  • Most of development: for download, analyze data, viz, manage
    • New ways to use data: jitter, out of order, duplicates, derive throughput, MOS all required study of data then implement and integrate
    • Dirty data (pathologies not related to network) require filtering or filling before analysis
  • Had to develop easy make/install download, instructions, FAQ, still new installs require communication:
    • pre-reqs, getting name registration, getting cron jobs running, getting web server running, unblock, clarify documentation (often non-native English speakers)
  • Documentation (tutorials, help, FAQs), publicity (brochures, papers, maps, presentations/travel), get funding/proposals
  • Monitor availability of (developed tools to simplify/automate):
    • monitor sites (hosts stop working: security blocks, hosts replaced, site forgets), nudge contacts
    • critical remote sites (beacons), choose new one (automatically updates monitor sites)
  • Validate/update meta data (name, address, institute, lat/long, contact …) in database (need easy update)
iepm bw 2001
IEPM-BW (2001)
  • 40 target hosts in 13 countries
  • Bottlenecks vary from 0.5Mbits/s to 1Gbits/s
  • Traverse ~ 50 AS’, 15 major Internet providers
  • 5 targets at PoPs, rest at end sites
  • Added Sunnyvale for UltraLight
  • Covers all USATLAS tier 0, 1, 2 sites
  • Recently Added FZK, QAU
  • Main author (Connie Logg) retired
iepm design details
IEPM Design Details
  • IEPM-BW (2001):
    • More focused (than PingER), fewer sites (e.g. BaBar collaborators), more intense, more probe tools (iperf, thrulay, pathload, traceroute, owamp, bbftp …), more flexibility
    • Complete code set (measurement, archive, analyze, viz) at each monitoring site. Data distributed.
    • Needs dedicated host
    • Remote sites need code installed
      • Originally executed remote via ssh, still needed code installed
        • Security, accounts (require training), recovery problems
    • Major changes with time:
      • Use servers rather than ssh for remote hosts
      • Use mysql for configuration data bases rather than require perl scripts
      • Provide management tools for configuration data etc.
      • Add/replace probes
iepm lessons 1
IEPM Lessons (1)
  • Problems & recommendations:
    • Need right versions of mysql, gnuplot, perl (and modules) installed on hosts
    • All possible failure modes for probe tools need to be understood and accomodated
    • Timeout everything, clean up hung processes
    • Keep logfiles for day or so for debugging
    • Review how processes run with Netflow (mainly manual)
    • Scheduling:
      • don’t run file transfer, iperf, thrulay, pathload at same time on same path
      • Limit duration and frequency of intensive probes so do not impact network
    • Host loses disk, upgrades OS, loses DNS, applications upgraded (e.g. gnuplot), IEPM database zapped etc.
      • Need backup
    • Have a local host as target for sanity check (e.g. monitoring host based issues)
    • Monitor monitoring host load (e.g. Ganglia, Nagios…)
iepm lessons 2
IEPM Lessons (2)
  • Different paths need different probes (performance and interest related)
  • Experiences with probes (lot of work to understand, analyze & compare):
    • Owamp vs ping: owamp needs server and accurate time; ping only round trip available everywhere, may be blocked
    • Traceroute: need to analyze significance of results
    • Packet pair separation:
      • Abwe noisy, inaccurate especially on Gbps paths
        • Pathchirp better, pathload best (most intense approaches iperf), problems at 10Gbps, look at pathneck
    • TCP:
      • thrulay more information, more manageable than iperf,
      • need to keep TCP buffers optimized/updated
    • File transfer
      • Disk to disk close to iperf/thrulay
      • disk measures file/disk system – not network, but end user important
  • Adding new hosts still not easy
other lessons
Other Lessons
  • Traceroute no good for layers 2 &1
  • Packet pair surpasses time granularity at 10Gbps
  • Forecasting hard if path is congested, need to account for diurnal etc. variations
  • Net admin cannot review thousands of graphs each day:
    • need event detection, alert notification, and diagnosis assistance
  • Comparing QoS vs best effort requires adding path reservation
  • Keeping TCP buffer parameters optimized difficult
  • Network & configurations not static
  • Passive/Netflow valuable, complementary
perfsonar
PerfSONAR
  • Our future focus (for us 3rd Generation):
  • Open source, open community
    • Both end users (LHC, GATech, SLAC, Delaware) and network providers (ESnet, I2, GEANT, Eu NRENs, Brazil, …)
    • Many developers from multiple fields
    • Requires from the get go: shared code, documentation, collaboration
    • Hopefully not as dependent on funding as a single team, so persistent?
  • Transparent gathering and storage of measurement, both from NRENs and end users
  • Sharing of information across autonomous domains
    • Uses standard formats
    • More comprehensive view
    • AAA to provide protection for of sensitive data
    • Reduces debugging time
      • Access to multiple components of the path
      • No need to play telephone tag
  • Currently mainly middleware, needs:
    • Data mining and viz
    • Topology also at layers 1 & 2
    • Forecasting
    • Event detection and event diagnosis
e g using active iepm bw measurements
E.g. Using Active IEPM-BW measurements
  • Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model
  • Makes regular measurements with probe tools
    • ping (RTT, connectivity), owamp (1 way delay) traceroute (routes)
    • pathchirp, pathload (available bandwidth)
    • iperf (one & multi-stream), thrulay, (achievable throughput)
    • supports bbftp, bbcp (file transfer applications, not network)
      • Looking at GridFTP but complex requiring renewing certificates
    • Choice of probes depends on importance of path, e.g.
      • For major paths (tier 0, 1 & some 2) use full suite
      • For tier 3 use just ping and traceroute
  • Running at major HEP sites: CERN, SLAC, FNAL, BNL, Caltech, Taiwan, SNV to about 40 remote sites
    • http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html
iepm bw measurement topology
IEPM-BW Measurement Topology
  • 40 target hosts in 13 countries
  • Bottlenecks vary from 0.5Mbits/s to 1Gbits/s
  • Traverse ~ 50 AS’, 15 major Internet providers
  • 5 targets at PoPs, rest at end sites

Taiwan

  • Added Sunnyvale for UltraLight
  • Adding FZK Karlsruhe

TWaren

probes ping traceroute
Probes: Ping/traceroute
  • Ping still useful
    • Is path connected/node reachable?
    • RTT, jitter, loss
    • Great for low performance links (e.g. Digital Divide), e.g. AMP (NLANR)/PingER (SLAC)
    • Nothing to install, but blocking
  • OWAMP/I2 similar but One Way
    • But needs server installed at other end and good timers
    • Now built into IEPM-BW
  • Traceroute
    • Needs good visualization (traceanal/SLAC)
    • No use for dedicated λlayer 1 or 2
      • However still want to know topology of paths
probes packet pair dispersion
Probes: Packet Pair Dispersion

Bottleneck

Min spacing

At bottleneck

Spacing preserved

On higher speed links

Used by pathload, pathchirp, ABwE available bw

  • Send packets with known separation
  • See how separation changes due to bottleneck
  • Can be low network intrusive, e.g. ABwE only 20 packets/direction, also fast < 1 sec
  • From PAM paper, pathchirp more accurate than ABwE, but
    • Ten times as long (10s vs 1s)
    • More network traffic (~factor of 10)
      • Pathload factor of 10 again more
    • http://www.pam2005.org/PDF/34310310.pdf
  • IEPM-BW now supports ABwE, Pathchirp, Pathload
slide19
BUT…
  • Packet pair dispersion relies on accurate timing of inter packet separation
    • At > 1Gbps this is getting beyond resolution of Unix clocks
    • AND 10GE NICs are offloading function
      • Coalescing interrupts, Large Send & Receive Offload, TOE
      • Need to work with TOE vendors
        • Turn off offload (Neterion supports multiple channels, can eliminate offload to get more accurate timing in host)
        • Do timing in NICs
        • No standards for interfaces
  • Possibly use packet trains, e.g. pathneck
achievable throughput
Achievable Throughput
  • Use TCP or UDP to send as much data as can memory to memory from source to destination
  • Tools: iperf (bwctl/I2), netperf, thrulay (from Stas Shalunov/I2), udpmon …
  • Pseudo file copy: Bbcp also has memory to memory mode to avoid disk/file problems
slide21
BUT…
  • At 10Gbits/s on transatlantic path Slow start takes over 6 seconds
    • To get 90% of measurement in congestion avoidance need to measure for 1 minute (5.25 GBytes at 7Gbits/s (today’s typical performance)
  • Needs scheduling to scale, even then …
  • It’s not disk-to-disk or application-to application
    • So use bbcp, bbftp, or GridFTP
slide22
AND …
  • For testbeds such as UltraLight, UltraScienceNet etc. have to reserve the path
    • So the measurement infrastructure needs to add capability to reserve the path (so need API to reservation application)
    • OSCARS from ESnet developing a web services interface (http://www.es.net/oscars/):
      • For lightweight have a “persistent” capability
      • For more intrusive, must reserve just before make measurement
examples of real data
Examples of real data

Caltech: thrulay

  • Misconfigured windows
  • New path
  • Very noisy
  • Seasonal effects
    • Daily & weekly

800

Mbps

0

Nov05

Mar06

UToronto: miperf

250

Mbps

0

Jan06

Nov05

Pathchirp

UTDallas

  • Some are seasonal
  • Others are not
  • Events may affect

multiple-metrics

120

thrulay

Mbps

0

iperf

Mar-20-06

Mar-10-06

  • Events can be caused by host or site congestion
  • Few route changes result in bandwidth changes (~20%)
  • Many significant events are not associated with route changes (~50%)
scattter plots histograms
Scattter plots & histograms

Scatter plots: quickly identify correlations between metrics

Thrulay

Pathchirp

Iperf

Thrulay (Mbps)

RTT (ms)

Pathchirp & iperf (Mbps)

Throughput (Mbits/s)

Pathchirp

Thrulay

Histograms: quickly identify variability or multimodality

slide26

Changes in network topology (BGP) can result

in dramatic changes in performance

Hour

Samples of traceroute trees generated from the table

Los-Nettos (100Mbps)

Remote host

Snapshot of traceroute summary table

Notes:

1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00

2. ESnet/GEANT working on routes from 2:00 to 14:00

3. A previous occurrence went un-noticed for 2 months

4. Next step is to auto detect and notify

Drop in performance

(From original path: SLAC-CENIC-Caltech

to SLAC-Esnet-LosNettos (100Mbps) -Caltech )

Back to original path

Dynamic BW capacity (DBC)

Changes detected by

IEPM-Iperfand AbWE

Mbits/s

Available BW = (DBC-XT)

Cross-traffic (XT)

Esnet-LosNettos segment in the path

(100 Mbits/s)

ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

on the other hand
On the other hand
  • Route changes may affect the RTT (in yellow)
  • Yet have no noticeable effect on on available bandwidth or throughput

Available

Bandwidth

Achievable

Throughput

Route changes

however
However…
  • Elegant graphics are great to understand problems BUT:
    • Can be thousands of graphs to look at (many site pairs, many devices, many metrics)
    • Need automated problem recognition AND diagnosis
  • So developing tools to reliably detect significant, persistent changes in performance
    • Initially using simple plateau algorithm to detect step changes
seasonal effects on events
Seasonal Effects on events
  • Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time)
  • Causes more anomalous events around this time
forecasting
Forecasting
  • Over-provisioned paths should have pretty flat time series
    • Short/local term smoothing
    • Long term linear trends
    • Seasonal smoothing
  • But seasonal trends (diurnal, weekly need to be accounted for) on about 10% of our paths
  • Use Holt-Winters triple exponential weighted moving averages
experimental alerting
Experimental Alerting
  • Have false positives down to reasonable level (few per week), so sending alerts to developers
  • Saved in database
  • Links to traceroutes, event analysis, time-series
passive
Passive
  • Active monitoring
    • Pro: regularly spaced data on known paths, can make on-demand
    • Con: adds data to network, can interfere with real data and measurements
  • What about Passive?
netflow et al
Netflow et. al.
  • Switch identifies flow by sce/dst ports, protocol
  • Cuts record for each flow:
    • src, dst, ports, protocol, TOS, start, end time
  • Collect records and analyze
  • Can be a lot of data to collect each day, needs lot cpu
    • Hundreds of MBytes to GBytes
  • No intrusive traffic, real: traffic, collaborators, applications
  • No accounts/pwds/certs/keys
  • No reservations etc
  • Characterize traffic: top talkers, applications, flow lengths etc.
  • LHC-OPN requires edge routers to provide Netflow data
  • Internet 2 backbone
    • http://netflow.internet2.edu/weekly/
  • SLAC:
    • www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html
typical day s flows
Typical day’s flows
  • Very much work in progress
  • Look at SLAC border
  • Typical day:
    • ~ 28K flows/day
    • ~ 75 sites with > 100KB bulk-data flows
    • Few hundred flows > GByte
  • Collect records for several weeks
  • Filter 40 major collaborator sites, big (> 100KBytes) flows, bulk transport apps/ports (bbcp, bbftp, iperf, thrulay, scp, ftp …)
  • Divide by remote site, aggregate parallel streams
  • Look at throughput distribution
netflow et al35
Netflow et. al.
  • Peaks at known capacities and RTTs
    • RTTs might suggest windows not optimized, peaks at default OS window size(BW=Window/RTT)
how many sites have enough flows
How many sites have enough flows?
  • In May ’05 found 15 sites at SLAC border with > 1440 (1/30 mins) flows
    • Maybe Enough for time series forecasting for seasonal effects
  • Three sites (Caltech, BNL, CERN) were actively monitored
  • Rest were “free”
  • Only 10% sites have big seasonal effects in active measurement
  • Remainder need fewer flows
  • So promising
mining data for sites
Mining data for sites
  • Real application use (bbftp) for 4 months
  • Gives rough idea of throughput (and confidence) for 14 sites seen from SLAC
multi months
Multi months
  • Bbcp SLAC to Padova

Bbcp throughput from SLAC to Padova

  • Fairly stable with time, large variance
  • Many non network related factors
netflow limitations
Netflow limitations
  • Use of dynamic ports makes harder to detect app.
    • GridFTP, bbcp, bbftp can use fixed ports (but may not)
    • P2P often uses dynamic ports
    • Discriminate type of flow based on headers (not relying on ports)
      • Types: bulk data, interactive …
      • Discriminators: inter-arrival time, length of flow, packet length, volume of flow
      • Use machine learning/neural nets to cluster flows
      • E.g. http://www.pam2004.org/papers/166.pdf
  • Aggregation of parallel flows (needs care, but not difficult)
  • Can use for giving performance forecast
    • Unclear if can use for detecting steps in performance
conclusions
Conclusions
  • Some tools fail at higher speeds
  • Throughputs often depend on non-network factors:
    • Host: interface speeds (DSL, 10Mbps Enet, wireless), loads, resource congestion
    • Configurations (window sizes, hosts, number of parallel streams)
    • Applications (disk/file vs mem-to-mem)
  • Looking at distributions by site, often multi-modal
  • Predictions may have large standard deviations
  • Need automated assist to diagnose events
in progress
In Progress
  • Working on Netflow viz (currently at BNL & SLAC) then work with other LHC sites to deploy
  • Add support for pathneck
  • Look at other forecasters: e.g. ARMA/ARIMA, maybe Kalman filters, neural nets
  • Working on diagnosis of events
    • Multi-metrics, multi-paths
  • Signed collaborative agreement with Internet2 to collaborate with PerfSONAR
    • Provide web services access to IEPM data
    • Provide analysis forecasting and event detection to PerfSONAR data
    • Use PerfSONAR (e.g. router) data for diagnosis
    • Provide viz of PerfSONAR route information
    • Apply to LHCnet
    • Look at layer 1 & 2 information
questions more information
Questions, More information
  • Comparisons of Active Infrastructures:
    • www.slac.stanford.edu/grp/scs/net/proposals/infra-mon.html
  • Some active public measurement infrastructures:
    • www-iepm.slac.stanford.edu/
    • www-iepm.slac.stanford.edu/pinger/
    • e2epi.internet2.edu/owamp/
    • amp.nlanr.net/
  • Monitoring tools
    • www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
    • www.caida.org/tools/
    • Google for iperf, thrulay, bwctl, pathload, pathchirp
  • Event detection
    • www.slac.stanford.edu/grp/scs/net/papers/noms/noms14224-122705-d.doc
outline
Outline
  • Deployment, keeping in sync, management, timeouts, killing hung processes, host OS/env different
  • Implementation:
    • MySQL dbs for data and configuration (host, tools, plotting etc.) info
    • Scheduler, prevents backup
    • Log files, analyze for troubles
    • Local target as a sanity check on monitor