Slac and perfsonar
1 / 25

SLAC and PerfSONAR - PowerPoint PPT Presentation

  • Uploaded on

SLAC and PerfSONAR. Yee-Ting Li PerfSONAR developers workshop October 2006. SLAC IEPM. SLAC used to be primarily a High Energy Particle Physics institute Now beginning to diverge into other science’s Photon Science (SSRL and LCLS) Impact to chemistry and molecular biology

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' SLAC and PerfSONAR' - melania-taurus

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slac and perfsonar


Yee-Ting Li

PerfSONAR developers workshop

October 2006

Slac iepm

  • SLAC used to be primarily a High Energy Particle Physics institute

  • Now beginning to diverge into other science’s

    • Photon Science (SSRL and LCLS)

    • Impact to chemistry and molecular biology

  • First US based webpage at SLAC!

  • Internet End-to-end Performance Monitoring Group

    • Focus on problem detection and long term performance/trend analysis

    • Origin’s in PingER monitoring

    • Currently deploying more intrusive IEPM-BW tests


  • PingER project originally (1995) for measuring network performance for US, Europe and Japanese HEP community

  • Extended this century to measure Digital Divide

  • Last year added monitoring sites in S. Africa, Pakistan & India

  • Uses ICMP to determine:

    • RTT

    • Loss

    • Connectivity

    • Derived TCP throughput, ie 1/sqrt(LOSS)

Pinger deployment
PingER: Deployment

  • ~120 countries

  • 99% world’s connected population

  • 35 monitor sites in 14 countries

  • Over 600 nodes currently being monitored worldwide

Pinger digital divide
PingER: Digital Divide

Behind Europe

6 Yrs: Russia, Latin America 7 Yrs: Mid-East, SE Asia

10 Yrs: South


11 Yrs: Cent. Asia

12 Yrs: Africa

Iepm bw

  • Developed as an exhibit for SC2001

  • Conducts tests using various tools

    • Achievable BW: Iperf, thrulay

    • Estimated BW: pathchirp, pathload,abwe

    • File Transfer: bbcp, bbftp, gridftp Latency/Loss: ping, traceroute, owamp

  • MySQL backend with Web-based front end

  • Collection of scripts to:

    • start/stop deamons

    • Conduct analysis (and produce web-accessible graphs)

    • Forecasting and Event detection (and notification)

Iepm bw deployment
IEPM-BW: Deployment

  • Running at CERN, SLAC, FNAL, BNL, Caltech, Taiwan to about 40 remote sites (in a semi-mesh)

  • 40 target hosts in 13 countries

  • Bottlenecks vary from 0.5Mbits/s to 10Gbits/s

  • Traverse ~50 AS’s, 15 major Internet providers

  • 5 targets at PoPs, rest at end sites

Iepm bw presentation
IEPM-BW: Presentation

  • Timeseries plots

Iepm bw presentation1
IEPM-BW: Presentation

  • Diurnal Plots

Iepm bw presentation3
IEPM-BW: Presentation

  • CDF Diagrams

Iepm bw event detection
IEPM-BW: Event Detection

  • Automated problem identification:

    • Administrator’s cannot review 100’s of graphs each day

    • Alerts for network administrators

      • Changes in time-series, loss, latency, iperf, SNMP

    • Alerts for systems people

      • OS/Host metrics

    • Anomalies for security

  • Anomalous event detection

    • A series of no measurements (network out?)

    • Determine that something ‘wrong’ has happened; measured value significantly differs from expected value

  • Forecasts

    • Given trends in previous measurements, determine what is within tolerance of being ‘okay’

Event detection plateau




Trigger % full

History mean

Event Detection: Plateau

  • Circular buffer of observations

  • Define trigger buffer of results

    • Buffer fills if an observation deviates significantly from mean of circular buffer

  • Event occurs when trigger buffer exceeds threshold

  • Filters:

    • Check if (mh -mt) / mh > D& 90% trigger in last T mins then have trigger

    • Move trigger buffer to history buffer

History mean – 2 * stdev

  • = history length = 1 day,

    t = trigger length = 3 hours

  • = standard deviations = 2

Event detection k s
Event Detection: K-S

  • For each observation: for the previous 100 observations with next 100 observations

    • Compare the vertical difference in CDFs

    • How does it differ from random CDFs

    • Expressed as % difference

    • Define threshold for % difference

Event detection holt winters
Event Detection: Holt-Winters

  • Use Holt-Winters (H-W) technique:

    • Uses triple exponential weighted moving average

    • Three parameters (a, b, ) that take into account local smoothing, long term seasonal smoothing, and trends respectively.

  • Choose parameters by minimizing (1/N)Σ(Ft-yt)2

    • Ft=forecast for time t as function of parameters, yt= observation at time t

  • H-W is a forecasting technique; need to complement with a method to identify events

    • If a percentage of residuals are outside twice the EWMA of absolute deviation, then generate event (HWE)

    • Apply Plateau on H-W residuals (PHR) and K-S on H-W residuals (KHR)

Event diagnosis
Event Diagnosis

  • Once we get alert(s) of Events, how do we correlate to diagnose problems?

  • Define heuristic’s of ‘effect and cause’

    • Define probabilities to pin-point the location of the problem

  • First pass: narrows down to where the problem occurs on a high level

    • End-host or network?

  • Next step: is to define heuristics for the location of problems in a network path and subsystems on hosts

    • Interrogate using tools such as pS, ganglia, nagios

    • Cross correlate with other measurements (eg. Meshed traceroutes)


  • De-centralised network monitoring

    • Reduces overhead for us at IEPM to gather network statistics

  • Unified access to network information

    • Should enable easier methods to gather and use the network information

    • However, not all sites may provide the most useful information for our purposes

      • Define/recommend a base set of MP’s? (eg ping, traceroute, port up?…)

  • Middleware platform

    • Therefore requires applications to prove usefulness of design

    • Alarm services (event detection), trend analysis etc.

Perfsonar interests to slac iepm
PerfSONAR Interests to SLAC/IEPM

  • More statistics allow us to better understand Internet performance

  • Event Diagnosis - pS enables easier gathering of network performance data

    • Backbone and End-to-end allows us to corroborate suspicions

    • First need event detection in order to identify where problems are seen

  • Grid software development

    • SLAC will become a LHC ATLAS Tier-2 site

    • Network Service’s

      • Use of network metrics to help replica management, light path reservations etc

Perfsonar questions
PerfSONAR Questions

  • Test and possibly extend NMWG schemas to support the metrics that we are interested in

  • Interface for reoccurring and scheduled test initialisation

    • Waiting on AAA?

    • Conflicting tests?

  • Porting of our visualisation and analysis tools

    • Currently untie’ing and modularising analysis tools from IEPM-BW infrastructure

    • API

      • Input: use NMWG/pS

      • Output: Extend perfSONAR API to support ‘alerts’?

  • Access patterns for data:

    • We are more interested in gathering large windows of data rather than individual results

    • Too slow to gather data dynamically?

    • Should we cache data locally for our analysis?

Perfsonar installation
PerfSONAR: Installation

  • Java Version

  • Relatively easy; however, I have worked with java and web-services in the past

  • Documentation could do with more detail

    • What are all the ‘extra’ packages actually for? E.g. exist

    • Had to install separately; why couldn’t the perfSONAR install do that?

    • List of prerequisites/requirements

      • Machine types

      • Security requirements/Ports opened etc

Perfsonar sql ma

  • Idea was to create a IEPM-BW MA

    • Provide extra characteristics

    • Easiest way to enable NMWG compliant reports

    • Tests NMWG for our purposes

  • SQL-MA

    • All data currently in MySQL tables!

    • Installation problems

      • Different snapshots give different errors!

      • Difficult to get help due to time-zone differences

      • Security policies at SLAC prevent quick and easy access to non-SLAC users

    • Class diagrams seem to make sense

      • Will report on how easy it is to actually get it working!

Perfsonar security issues
PerfSONAR: Security Issues

  • SLAC (DOE) does not allow us to run application servers individually (eg ports are blocked)

  • We are currently deploying pS on a ‘community’ tomcat installation

  • Running two instances of tomcat for LS and MA is not possible for us

  • SLAC has a ‘prove that you need it’ attitude to allow external access to network data


  • De-centralised management of pS allows us to concentrate more on analysis rather than deployment/maintenance

  • IEPM would like specific tools that have proven to be the most useful for diagnosis

    • Latency (connectivity) and traceroute

    • Extend to other metrics such as throughput etc.

  • PerfSONAR allows transparent data access

    • pS enables the unification of both end-to-end and router metric representation

      • Worry about finding correlations for diagnosis rather than determine ‘how’ to gather the data.

    • Porting of our analysis tools

      • Test perfSONAR api’s

      • Provide useful features such as event detection, other UI4 examples etc