1 / 19

Terapaths Monitoring DWMI: Datagrid Wide area Monitoring Infrastructure

Terapaths Monitoring DWMI: Datagrid Wide area Monitoring Infrastructure. Les Cottrell & Yee-Ting Li, SLAC US-LHC End-To-End Networking Meeting, FNAL October 25, 2006. Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM), and by Internet2. Active E2E Monitoring.

darrelp
Download Presentation

Terapaths Monitoring DWMI: Datagrid Wide area Monitoring Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Terapaths MonitoringDWMI: Datagrid Wide area Monitoring Infrastructure Les Cottrell & Yee-Ting Li, SLAC US-LHC End-To-End Networking Meeting, FNAL October 25, 2006 Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM), and by Internet2

  2. Active E2E Monitoring

  3. Active IEPM-BW measurements • Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model • Makes regular measurements with probe tools • ping (RTT, connectivity), owamp (1 way delay) traceroute (routes) • pathchirp, pathload (available bandwidth) • iperf (one & multi-stream), thrulay, (achievable throughput) • supports bbftp, bbcp (file transfer applications, not network) • Looking at GridFTP but complex requiring renewing certificates • Choice of probes depends on importance of path, e.g. • For major paths (tier 0, 1 & some 2) use full suite • For tier 3 use just ping and traceroute • Running at major HEP sites: CERN, SLAC, FNAL, BNL, Caltech, Taiwan, SNV to about 40 remote sites • http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html

  4. IEPM-BW Measurement Topology • 40 target hosts in 13 countries • Bottlenecks vary from 0.5Mbits/s to 1Gbits/s • Traverse ~ 50 AS’, 15 major Internet providers • 5 targets at PoPs, rest at end sites • Added Sunnyvale for UltraLight • Covers all USATLAS tier 0, 1, 2 sites • Adding FZK Karlsruhe

  5. Top page

  6. Visualization & Forecasting in Real World

  7. Examples of real data Caltech: thrulay • Misconfigured windows • New path • Very noisy • Seasonal effects • Daily & weekly 800 Mbps 0 Nov05 Mar06 UToronto: miperf 250 Mbps 0 Jan06 Nov05 Pathchirp UTDallas • Some are seasonal • Others are not • Events may affect multiple-metrics 120 thrulay Mbps 0 iperf Mar-20-06 Mar-10-06 • Events can be caused by host or site congestion • Few route changes result in bandwidth changes (~20%) • Many significant events are not associated with route changes (~50%)

  8. Scattter plots & histograms Scatter plots: quickly identify correlations between metrics Thrulay Pathchirp Iperf Thrulay (Mbps) RTT (ms) Pathchirp & iperf (Mbps) Throughput (Mbits/s) Pathchirp Thrulay Histograms: quickly identify variability or multimodality

  9. Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperfand AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

  10. However… • Elegant graphics are great to understand problems BUT: • Can be thousands of graphs to look at (many site pairs, many devices, many metrics) • Need automated problem recognition AND diagnosis • So developing tools to reliably detect significant, persistent changes in performance • Initially using simple plateau algorithm to detect step changes • Holt-Winters for forecasting if seasonal effects

  11. Seasonal Effects on events • Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) • Causes more anomalous events around this time

  12. Forecasting • Over-provisioned paths should have pretty flat time series • Short/local term smoothing • Long term linear trends • Seasonal smoothing • But seasonal trends (diurnal, weekly need to be accounted for) on about 10% of our paths • Use Holt-Winters triple exponential weighted moving averages

  13. Experimental Alerting • Have false positives down to reasonable level (few per week), so sending alerts to developers • Saved in database • Links to traceroutes, event analysis, time-series

  14. Passive • Active monitoring • Pro: regularly spaced data on known paths, can make on-demand • Con: adds data to network, can interfere with real data and measurements • What about Passive?

  15. Netflow et. al. • Switch identifies flow by sce/dst ports, protocol • Cuts record for each flow: • src, dst, ports, protocol, QoS, start, end time • Collect records and analyze • Can be a lot of data to collect each day, needs lot cpu • Hundreds of MBytes to GBytes • No intrusive traffic, real: traffic, collaborators, applications • No accounts/pwds/certs/keys • No reservations etc • Characterize traffic: top talkers, applications, flow lengths etc.

  16. Application to LHCnet • LHC-OPN requires edge routers to provide Netflow data • SLAC developing Netflow visualization at BNL • Allows selection of destinations, services • Displays time series, tables, pie charts, spider plots • Will port to other LHCOPN sites Choose aggregation Choose services

  17. Netflow limitations • Use of dynamic ports makes harder to detect app. • GridFTP, bbcp, bbftp can use fixed ports (but may not) • P2P often uses dynamic ports • Discriminate type of flow based on headers (not relying on ports) • Types: bulk data, interactive … • Discriminators: inter-arrival time, length of flow, packet length, volume of flow • Use machine learning/neural nets to cluster flows • E.g. http://www.pam2004.org/papers/166.pdf • Aggregation of parallel flows (needs care, but not difficult) • Can use for giving performance forecast • Unclear if can use for detecting steps in performance

  18. perfSONAR (pS) • See Joe Metzger’s talk later today • SLAC/IEPM formally joined pS • De-centralised management of pS allows us to concentrate more on analysis rather than deployment/maintenance, provide rich source of data to analyze, leverages IEPM group skills • PerfSONAR allows transparent data access • Bring closer US HEP influence to pS • Make iepm-bw data available via pS infrastructure • Porting of our analysis tools to work with pS • Test perfSONAR api’s • Provide useful features such as analysis, visualization, event detection, alerting and diagnosis. • pS enables the unification of both end-to-end and router metric representation • Worry about finding correlations for diagnosis rather than determine ‘how’ to gather the data.

  19. Questions, More information • Comparisons of Active Infrastructures: • www.slac.stanford.edu/grp/scs/net/proposals/infra-mon.html • Some active public measurement infrastructures: • www-iepm.slac.stanford.edu/ • www-iepm.slac.stanford.edu/pinger/ • e2epi.internet2.edu/owamp/ • amp.nlanr.net/ # No longer funded • Monitoring tools • www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html • www.caida.org/tools/ • Google for iperf, thrulay, bwctl, pathload, pathchirp, pathneck • Event detection • www.slac.stanford.edu/grp/scs/net/papers/noms/noms14224-122705-d.doc • Netflow: • Internet 2 backbone • http://netflow.internet2.edu/weekly/ • SLAC: • www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html • BNL (SLAC developed, work in progress) • http://iepmbw.bnl.org/netflow/index.html

More Related