slac and perfsonar n.
Skip this Video
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 25

SLAC and PerfSONAR - PowerPoint PPT Presentation

  • Uploaded on

SLAC and PerfSONAR. Yee-Ting Li PerfSONAR developers workshop October 2006. SLAC IEPM. SLAC used to be primarily a High Energy Particle Physics institute Now beginning to diverge into other science’s Photon Science (SSRL and LCLS) Impact to chemistry and molecular biology

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'SLAC and PerfSONAR' - melania-taurus

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slac and perfsonar


Yee-Ting Li

PerfSONAR developers workshop

October 2006

slac iepm
  • SLAC used to be primarily a High Energy Particle Physics institute
  • Now beginning to diverge into other science’s
    • Photon Science (SSRL and LCLS)
    • Impact to chemistry and molecular biology
  • First US based webpage at SLAC!
  • Internet End-to-end Performance Monitoring Group
    • Focus on problem detection and long term performance/trend analysis
    • Origin’s in PingER monitoring
    • Currently deploying more intrusive IEPM-BW tests
  • PingER project originally (1995) for measuring network performance for US, Europe and Japanese HEP community
  • Extended this century to measure Digital Divide
  • Last year added monitoring sites in S. Africa, Pakistan & India
  • Uses ICMP to determine:
    • RTT
    • Loss
    • Connectivity
    • Derived TCP throughput, ie 1/sqrt(LOSS)
pinger deployment
PingER: Deployment
  • ~120 countries
  • 99% world’s connected population
  • 35 monitor sites in 14 countries
  • Over 600 nodes currently being monitored worldwide
pinger digital divide
PingER: Digital Divide

Behind Europe

6 Yrs: Russia, Latin America 7 Yrs: Mid-East, SE Asia

10 Yrs: South


11 Yrs: Cent. Asia

12 Yrs: Africa

iepm bw
  • Developed as an exhibit for SC2001
  • Conducts tests using various tools
    • Achievable BW: Iperf, thrulay
    • Estimated BW: pathchirp, pathload,abwe
    • File Transfer: bbcp, bbftp, gridftp Latency/Loss: ping, traceroute, owamp
  • MySQL backend with Web-based front end
  • Collection of scripts to:
    • start/stop deamons
    • Conduct analysis (and produce web-accessible graphs)
    • Forecasting and Event detection (and notification)
iepm bw deployment
IEPM-BW: Deployment
  • Running at CERN, SLAC, FNAL, BNL, Caltech, Taiwan to about 40 remote sites (in a semi-mesh)
  • 40 target hosts in 13 countries
  • Bottlenecks vary from 0.5Mbits/s to 10Gbits/s
  • Traverse ~50 AS’s, 15 major Internet providers
  • 5 targets at PoPs, rest at end sites
iepm bw presentation
IEPM-BW: Presentation
  • Timeseries plots
iepm bw event detection
IEPM-BW: Event Detection
  • Automated problem identification:
    • Administrator’s cannot review 100’s of graphs each day
    • Alerts for network administrators
      • Changes in time-series, loss, latency, iperf, SNMP
    • Alerts for systems people
      • OS/Host metrics
    • Anomalies for security
  • Anomalous event detection
    • A series of no measurements (network out?)
    • Determine that something ‘wrong’ has happened; measured value significantly differs from expected value
  • Forecasts
    • Given trends in previous measurements, determine what is within tolerance of being ‘okay’
event detection plateau




Trigger % full

History mean

Event Detection: Plateau
  • Circular buffer of observations
  • Define trigger buffer of results
    • Buffer fills if an observation deviates significantly from mean of circular buffer
  • Event occurs when trigger buffer exceeds threshold
  • Filters:
    • Check if (mh -mt) / mh > D& 90% trigger in last T mins then have trigger
    • Move trigger buffer to history buffer

History mean – 2 * stdev

  • = history length = 1 day,

t = trigger length = 3 hours

  • = standard deviations = 2
event detection k s
Event Detection: K-S
  • For each observation: for the previous 100 observations with next 100 observations
    • Compare the vertical difference in CDFs
    • How does it differ from random CDFs
    • Expressed as % difference
    • Define threshold for % difference
event detection holt winters
Event Detection: Holt-Winters
  • Use Holt-Winters (H-W) technique:
    • Uses triple exponential weighted moving average
    • Three parameters (a, b, ) that take into account local smoothing, long term seasonal smoothing, and trends respectively.
  • Choose parameters by minimizing (1/N)Σ(Ft-yt)2
    • Ft=forecast for time t as function of parameters, yt= observation at time t
  • H-W is a forecasting technique; need to complement with a method to identify events
    • If a percentage of residuals are outside twice the EWMA of absolute deviation, then generate event (HWE)
    • Apply Plateau on H-W residuals (PHR) and K-S on H-W residuals (KHR)
event diagnosis
Event Diagnosis
  • Once we get alert(s) of Events, how do we correlate to diagnose problems?
  • Define heuristic’s of ‘effect and cause’
    • Define probabilities to pin-point the location of the problem
  • First pass: narrows down to where the problem occurs on a high level
    • End-host or network?
  • Next step: is to define heuristics for the location of problems in a network path and subsystems on hosts
    • Interrogate using tools such as pS, ganglia, nagios
    • Cross correlate with other measurements (eg. Meshed traceroutes)
  • De-centralised network monitoring
    • Reduces overhead for us at IEPM to gather network statistics
  • Unified access to network information
    • Should enable easier methods to gather and use the network information
    • However, not all sites may provide the most useful information for our purposes
      • Define/recommend a base set of MP’s? (eg ping, traceroute, port up?…)
  • Middleware platform
    • Therefore requires applications to prove usefulness of design
    • Alarm services (event detection), trend analysis etc.
perfsonar interests to slac iepm
PerfSONAR Interests to SLAC/IEPM
  • More statistics allow us to better understand Internet performance
  • Event Diagnosis - pS enables easier gathering of network performance data
    • Backbone and End-to-end allows us to corroborate suspicions
    • First need event detection in order to identify where problems are seen
  • Grid software development
    • SLAC will become a LHC ATLAS Tier-2 site
    • Network Service’s
      • Use of network metrics to help replica management, light path reservations etc
perfsonar questions
PerfSONAR Questions
  • Test and possibly extend NMWG schemas to support the metrics that we are interested in
  • Interface for reoccurring and scheduled test initialisation
    • Waiting on AAA?
    • Conflicting tests?
  • Porting of our visualisation and analysis tools
    • Currently untie’ing and modularising analysis tools from IEPM-BW infrastructure
    • API
      • Input: use NMWG/pS
      • Output: Extend perfSONAR API to support ‘alerts’?
  • Access patterns for data:
    • We are more interested in gathering large windows of data rather than individual results
    • Too slow to gather data dynamically?
    • Should we cache data locally for our analysis?
perfsonar installation
PerfSONAR: Installation
  • Java Version
  • Relatively easy; however, I have worked with java and web-services in the past
  • Documentation could do with more detail
    • What are all the ‘extra’ packages actually for? E.g. exist
    • Had to install separately; why couldn’t the perfSONAR install do that?
    • List of prerequisites/requirements
      • Machine types
      • Security requirements/Ports opened etc
perfsonar sql ma
  • Idea was to create a IEPM-BW MA
    • Provide extra characteristics
    • Easiest way to enable NMWG compliant reports
    • Tests NMWG for our purposes
  • SQL-MA
    • All data currently in MySQL tables!
    • Installation problems
      • Different snapshots give different errors!
      • Difficult to get help due to time-zone differences
      • Security policies at SLAC prevent quick and easy access to non-SLAC users
    • Class diagrams seem to make sense
      • Will report on how easy it is to actually get it working!
perfsonar security issues
PerfSONAR: Security Issues
  • SLAC (DOE) does not allow us to run application servers individually (eg ports are blocked)
  • We are currently deploying pS on a ‘community’ tomcat installation
  • Running two instances of tomcat for LS and MA is not possible for us
  • SLAC has a ‘prove that you need it’ attitude to allow external access to network data
  • De-centralised management of pS allows us to concentrate more on analysis rather than deployment/maintenance
  • IEPM would like specific tools that have proven to be the most useful for diagnosis
    • Latency (connectivity) and traceroute
    • Extend to other metrics such as throughput etc.
  • PerfSONAR allows transparent data access
    • pS enables the unification of both end-to-end and router metric representation
      • Worry about finding correlations for diagnosis rather than determine ‘how’ to gather the data.
    • Porting of our analysis tools
      • Test perfSONAR api’s
      • Provide useful features such as event detection, other UI4 examples etc