Iepm bw
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

IEPM-BW PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

IEPM-BW. Warren Matthews (SLAC) Presented at the UCL Monitoring Infrastructure Workshop, London, May 15-16, 2003. Overview / Goals. IEPM-BW monitoring and results Other measurements Publishing Troubleshooting Tools Further work. IEPM-BW. SLAC package for monitoring and analysis

Download Presentation

IEPM-BW

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Iepm bw

IEPM-BW

Warren Matthews (SLAC)

Presented at the UCL Monitoring Infrastructure Workshop, London,

May 15-16, 2003.


Overview goals

Overview / Goals

  • IEPM-BW monitoring and results

  • Other measurements

  • Publishing

  • Troubleshooting Tools

  • Further work


Iepm bw1

IEPM-BW

  • SLAC package for monitoring and analysis

  • Currently 10 monitoring sites

    • SLAC, FNAL, GATech (SOX), INFN (Milan), NIKHEF, APAN (Japan)

    • Manchester, UMich, UCL, Internet2

    • 2-36 targets


Iepm bw

KEK

LANL

EDG

CERN

NIKHEF

TRIUMF

FNAL

IN2P3

NERSC

ANL

PPDG/GriPhyN

CHI

CERN

ORNL

RAL

SNV

ESnet

JLAB

NY

UManc

SLAC

UCL

SLAC

JAnet

DL

NNW

BNL

APAN

RIKEN

Stanford

INFN-Roma

APAN

Geant

INFN-Milan

CalREN

Abilene

SEA

CESnet

NY

ATL

SNV

HSTN

SOX

CLV

IPLS

Monitoring Site

CALTECH

SDSC

UTDallas

I2

UFL

UMich

Rice

NCSA


Measurement engine

Measurement Engine

  • Ping, Traceroute

  • Iperf, Bbftp, Bbcp (mem and disk)

  • Abwe

  • Gridftp, UDPmon

  • Web100

  • Passive (netflow)


Other projects u s

Other Projects (U.S.)

  • PingER (SLAC, FNAL)

    • eJDS (SLAC, ICTP)

  • AMP (NLANR)

  • NIMI (ICIR, PSC)

    • MAGGIE (ICIR, PSC, SLAC, LBL, ANL)

  • NASA, SCNM (LBL)

  • Surveyor (Internet2)

  • E2e PI and PIPES (Internet2)

  • Also SLAC has a RIPE-TT box


Publishing

Publishing

  • Web Service

    • SOAP::Lite perl module

    • Python

    • Java

  • NMWG

  • OGSA


Publishing1

Publishing

  • NMWG Properties document

    • Path.delay.roundtrip (Demo)

    • Hop.bandwidth.capacity (tracespeed)

  • Guthrie (demo)

    • Almost 1000 nodes in database

    • PingER Networks

  • Arena


Advisor

Advisor

Screenshot taken from the talk by Jim Ferguson at the e2e workshop, Miami Feb 2003.


Monalisa

MonaLisa

  • Front-end visualization

  • Vital component for development of the LHC Computing Model

  • JINI/JAVA and WSDL/SOAP

  • demo


Troubleshooting

Troubleshooting

  • RIPE-TT Testbox Alarm

  • AMP Automatic Event Detection

  • Our approach is diurnal changes


Diurnal changes 1 4

Diurnal Changes (1/4)

  • Either Performance varies during the day

  • Or it doesn’t

  • No variation is the special case of variation=0


Diurnal changes 2 4

Diurnal Changes (2/4)

  • Either performance (within the bin) is variable

  • Or it isn’t

  • No variation is the special case of variation=0


Diurnal changes 3 4

Diurnal Changes (3/4)

  • Parameterize performance in terms of hour and variability within that hourly bin

  • Measurements can be classified in terms of how they differ from historical value

  • Recent problems are flagged due to difference from historical value

  • Compare to measurement in previous bin to reduce false-positives


Diurnal changes 4 4

Diurnal Changes (4/4)

  • Calculate Median and standard deviation of last five measurement in bin

    • e.g. Monday 7pm-8pm

  • “Concerned” if latest measurement is more than 1 s.d. from median

  • “Alarmed” if latest measurement is more than 2 s.d. from median


Trouble detection

Trouble Detection

$ tail maggie.log

04/28/2003 14:58:47 (1:14) gnt4 0.51 Alarm (AThresh=38.33)

04/28/2003 16:25:45 (1:16) gnt4 3.83 Concern (CThresh=87.08)

04/28/2003 17:55:21 (1:17) gnt4 169.57 Within boundaries

Status

Throughput (iperf)

Date and Time

Bin

Node

Only write to the log if an alarm is triggered

Keep writing to the log until alarm is cleared


Trouble status

Trouble Status

  • Tempted to make color-coded web page

    • All the hard work still left to do

    • Use knowledge to see common point of failure

    • Production table would be >> 36x700

  • Instead figure out where to flag


Net rat

Net Rat

  • Alarm System

    • Multiple tools

    • Multiple measurement points

      • Cross reference

    • Trigger further measurements

    • Starting point for human intervention

    • Informant database

      • hop.performance

  • No measurement is ‘authoritative’

    • Cannot even believe a measurement


Limitations

Limitations

  • Could be over an hour before alarm is generated

  • More frequent measurements impact the network and measurements overlap

  • Low impact tools allow finer grained measurement


Where next

Where next ?

  • GLUE, OGSA, CIM

  • Work with Other Projects

    • Publishing and troubleshooting

    • Discovery

    • Security


Toward a monitoring infrastructure

Toward a Monitoring Infrastructure

  • Certainly the need

    • DOE Science Community

    • Japanese Earth Simulator

    • Grid

    • Troubleshooting / E2Epi

  • Many of the ingredients

    • Many monitoring projects

    • PIPES

    • MAGGIE


Summary

Summary

“It is widely believed that a ubiquitous monitoring infrastructure is required”.


Links

This talk

IEPM-BW

PingER

ABwE

AMP

NIMI

MAGGIE

RIPE-TT

Surveyor

E2E PI

SLAC Web Services

GGF NMWG

Arena

Monalisa

Advisor

TroubleShooting

Links


Credits

Credits

  • Les Cottrell

  • Connie Logg, Jerrod Williams

  • Jiri Navratil

  • Fabrizio Coccetti

  • Brian Tierney

  • Frank Nagy, Maxim Grigoriev

  • Eric Boyd, Jeff Boote

  • Vern Paxson, Andy Adams

  • Iosif Legrand

  • Jim Ferguson, Steve Englehart

  • Local admins and other volunteers

  • DoE/MICS


Demos

Demos

  • This is the output from the “Publishing” Demo on slide 9.

$ more soap_client.pl

#!/usr/local/bin/perl

use SOAP::Lite;

print SOAP::Lite

-> service('http://www-iepm.slac.stanford.edu/tools/soap/wsdl/profile_0002.wsdl')

-> hopBandwidthCapacity("brdr.slac.stanford.edu:i2-gateway.stanford.edu");

$ ./soap_client.pl

1000Mb


Demos1

Demos

  • This is the output from the “tracespeed” demo on slide 9.

$ ./tracespeed thunderbird.internet2.edu

0 doris 10Mb

1 core (134.79.122.32) 1000Mb

2 brdr (134.79.235.45) 1000Mb

3 i2-gateway.stanford.edu (192.68.191.83) No Data.

4 stan.pos.calren2.net (171.64.1.213) No Data.

5 sunv--stan.pos.calren2.net (198.32.249.73) No Data.

6 abilene--qsv.pos.calren2.net (198.32.249.162) No Data.

7 kscyng-snvang.abilene.ucaid.edu (198.32.8.103) No Data.

8 iplsng-kscyng.abilene.ucaid.edu (198.32.8.80) No Data.

9 so-0-2-0x1.aa1.mich.net (192.122.183.9) No Data.

10 so-0-0-0x0.ucaid2.mich.net (198.108.90.118) No Data.

11 thunderbird.internet2.edu (207.75.164.95) No Data.


Aside netrat 1 5

Aside: NetRat (1/5)

  • If last measurement was Within 1sd

    • Mark each hop as Good

    • Hop.performance = good

  • If last measurement was “Concern”

    • Mark each hop as acceptable

  • If last measurement was an “Alarm”

    • Mark Each hop as poor


Aside netrat 2 5

Aside: NetRat (2/5)

  • Measurement generates an alarm

  • Set each hop.performance = poor


Aside netrat 3 5

Aside: NetRat (3/5)

  • Other measurements from same site do not generate alarms.

  • Set each hop.performance = good

  • Immediately ruled out problem in local LAN or host machine


Aside netrat 4 5

Aside: NetRat (4/5)

  • Different site monitors same target

  • No alarm is generated

  • Set each hop.performance = good

  • Pinpointed possible problem in intermediate network.


  • Login