410 likes | 554 Views
Internet Monitoring - Results. Les Cottrell & Warren Matthews SLAC < cottrell@slac.stanford.edu> <warrenm@slac.stanford.edu> Presented at the XIWT Meeting, San Francisco, Mar 1998 http://www.slac.stanford.edu/grp/scs/net/talk/xiwt-mar98/
E N D
Internet Monitoring - Results Les Cottrell & Warren Matthews SLAC <cottrell@slac.stanford.edu> <warrenm@slac.stanford.edu> Presented at the XIWT Meeting, San Francisco, Mar 1998 http://www.slac.stanford.edu/grp/scs/net/talk/xiwt-mar98/ Partially funded by MICS joint SLAC/LBL proposal on Internet End-to-end Performance Monitoring (IEPM) \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Outline of Talk • What, why & how are we (ESnet/HENP community) measuring? • What PingER measurement reports are available and what do they show • short, intermediate & long term • Traffic volume & Traceroute measurements • Summary • Deployment/development, Internet Performance, Next Steps • Collaborations \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Why go to the effort? • Apparent quality of Internet getting worse as size and demands increases • Internet woefully under-measured & under-instrumented • Internet very diverse - no single path typical • Users need: • realistic expectations, planning information • guidelines for setting SLAs • information to help in identifying problems • help to decide where to apply resources \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Why the focus on Ping • “Universally available”, easy to understand • no software for clients to install • Low network impact • Select hosts carefully, concerns over routers, loaded hosts etc. • Provides end-to-end (user view vs network infrastructure view) loss, response time, reachability, unpredictability \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Importance of Response Time • Time is scarcest and most valuable commodity • Studies in late 70’s and early 80s showed the economic value of Rapid Response Time • 0-0.4s High productivity interactive response • 0.4-2s Fully interactive regime • 2-12s Sporadically interactive regime • 12s-600s Break in contact regime • >600s Batch regime • Threshold around 4-5s complaints increase rapidly. • Voice has threshold around 100ms \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Perception of Poor Packet Loss • Above 4-6% packet loss video conferencing becomes irritating, and non native language speakers become unable to communicate. • The occurrence of long delays of 4 seconds or more at a frequency of 4-5% or more is also irritating for interactive activities such as telnet and X windows. • Above 10-12% packet loss there is an unacceptable level of back to back loss of packets and extremely long timeouts, connections start to get broken, and video conferencing is unusable. \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Our Main Metric is Ping • “Universally available”, easy to understand • no software for clients to install • Low network impact • select hosts carefully, concerns over routers, loaded hosts etc. • Provides loss, response time, reachability, unpredictability • Provides useful real world measures \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Ping Response vs Web Response HTTP GET Response (ms) Minimum Ping Response (ms) \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Method • Measurement • Each Collection site keeps list of remote hosts to ping at sites it is interested in • Every 30 mins ping each remote host with 11 * 100 byte followed by 10 * 1000 byte pings • Min separation of pings is 1 second, timeout 20 seconds • Throw away first ping • Measure response, packet loss, host unreachable (no answer to any ping) • Record loss, min/avg/max response time and make available • Have Poisson sampling & median measurement in beta \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Architecture HTTP WWW Reports & Data E.g. HEPNRC E.g. SLAC Analysis Analysis Archive Collecting Collecting Collecting Collecting Pings Remote Cache Remote Remote Remote \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Long Term Reports • Currently only available from SLAC • Tabular reports generated automatically by SAS • Monthly averages: • Response time, packet loss for prime time (SLAC) • Quiescent frequency • Reachability & Unpredictability \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Monthly Packet Loss Click here for Excel Click here for 180 day plot Sorted to show worst at top Colored for quality \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Graphical Analysis • Use Excel manually or with macros for more detailed analysis • graphs, • means, medians, standard deviations, distributions, percentiles • fits \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Ranked packet loss for 3 months Stanford Rome UK Cincinnatti \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Sawtooth Effect 2 * capacity (+ 2Mbps) Added 45 Mbps (quadrupled capacity) 3 * capacity + 9 Mbps Holiday \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
RAL Last 180 Days plot Lines are simply cubic splines fits to aid eye Upper green and black points are response time in ms Red & blue are weekday loss Cyan are weekend loss Note weekend/weekday differences Note Xmas/New Year lull Also note quick onset of saturation at end August & September \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Ping Response & Loss between HEPNRC & Manchester Dec-Jan ‘97/98 \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Italian sites look similar to each other \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Representative International HENP Site Loss Jan-95 thru Nov-97 • Note RL (UK) saw-tooths as add UK-US bandwidth (Apr-96, Feb-97, Aug-97) \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Aggregation • Group measurements, for example: • by area (e.g. N. America E, N. America E, W. Europe/Japan, others, by country) • trans-oceanic links • separation e.g. number of hops, time zones crossed, IXPs crossed • ISP (ESnet, vBNS/I2, ...) • by monitoring site • one site seen from multiple sites • common interest/affiliation (XIWT, HENP …) • user selectable \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Group Selection for Ping Loss Plots • Allow wild cards • Allow pre-selected groups • In beta test \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Group Response Time Jan-95 Nov-97 • Improved between 1 and 2.5% / month • Response & Loss similar improvements \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Network Quiescence • Frequency of zero packet loss (for all time - not cut on prime time) \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Ping Loss Quality • Want quick to grasp indicator of link quality • Loss is the most sensitive indicator • loss of packet requires ~ 4 sec TCP retry timeout • Studies on economic value of response time by IBM showed there is a threshold around 4-5secs where complaints increase. • 0-1% = Good 1-2.5% = Acceptable • 2.5%-5% = Poor 5%-12% = Very Poor • > 12% = Bad \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Quality Distributions • ESnet median good quality • All other groups poor or very poor • Critical to have good peering \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Traffic Growth • Read out of external router Exponential growth from 2.5-6% \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Traffic Volume for Germany (DFN) DFN T1 Utilization 15 Jan ‘98 (5 min averages) Green = to US Blue = from US DFN T1 Utilization for 15 Jan ‘98 (5 min averages) # of 2 min periods in Dec-96 with peak utilization > y % To US From US # Samples \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Traffic Volume for ESnet Italy Link INFN T1 Link Utilization for 16 Jan’98 # of 2 min periods in Dec-96 with peak utilization > y % From US To US \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Traceroute • Reverse traceroute servers • provides traceroute from Web server to client • available at about 30 HENP & ESnet sites \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
TracePing Muliple routes seen \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Traceroute From TRIUMF • Reverse traceroute servers • Traceping • TopologyMap • Ellipses show node on route • Open ellipse is measurement node • Blue ellipse no reachable • Keeping history \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Summary • Deployment Development • ESnet/HENP has 14 Collection sites in 8 countries collecting data on > 500 links involving 22 countries • 600MB/month/link, 6 bps/link, .25 FTE @ analysis site, 1.5-2.5 FTE on analysis • HEPNRC gathering, archiving • Long term reports being ported to HEPNRC from SLAC • Long term analysis today requires tool like SAS • Cost of SAS (or Oracle) license problem for analysis site • XIWT/IPWT deployed ~ 6 collection sites using PingER tools \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Summary • Deployment Development • Internet Performance • Performance within ESnet is good • Performance between ESnet & other sites is poor to very poor on average • one of main causes is congestion points, so peering is critical • ESnet traffic accepted from major HENP labs growing by 2.5-6% per month • Response time improving by 1-2% / month • Packet loss improving between SLAC & other sites by 3% / month \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Summary • Deployment Development • Internet Performance (continued): • Links to sites outside N. America vary from good (KEK) to bad • Some of the bad sites are to be expected, e.g. FSU, China, Czeck Republic, some surprises such as UK • CERN, France, Germany acceptable to poor \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Summary • Deployment Development • Internet Performance • Next Steps • Improve tools: • Deploy Poisson sampling & median measurements • Extend MapPing & bring to production (work with NLANR), port traceping to Unix, extend deployment of traceroute topology map • Make long term reports at Analysis site available & understandable • Get group defining/selection going • Look at & compare site performance seen from multiple sites • Look at new visualization techniques • Look into prediction (extrapolations, develop models, configure and validate with data) • Pursue IETF Surveyor & NIMI deployment \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
National Internet Measurement Infrastructure (NIMI) • Secure, scalable infrastructure for scheduling monitoring, gathering data • Minimal amount of human intervention • Inexpensive probe built on PC FreeBSD platform • Dynamic - can add/modify measurement suites, initially includes: • Traceroute • TReno - measures bulk transfer thruput • Poip - one way ping \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Asymmetric One-way Delays 20% U Chicago to Advanced Advanced to U Chicago Loss Loss 0% 300ms Delay Delay 0ms 0 \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt 24
Summary • Deployment Development • Internet Performance • Next Steps • Lots of collaboration: • SLAC & HEPNRC • 14 collection sites, ~ 400 remote sites • Collection site tools CERN & CNAF/ICFA • Oxford/TracePing • MapPing/MAPNet/NLANR • TRIUMF Traceroute topology Map • NIMI/LBNL & Surveyor/IETF • XIWT/IPWT • Talks at IETF, XIWT, ICFA, ESCC ... \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Summary • Deployment Development • Internet Performance • Next Steps • Lots of collaboration: • To join: • Collection site needs: • perl5 & HTTP server • install timeping & pingdata (need only cgi-bin access, not root) • decide on links to monitor • Get an analysis site to retrieve & generate graphs, or at least get connectivity.pl & ping_data_plot.pl • Need volunteers to work on analysis scripts, some of it will require SAS, also need Java applets to visualize, \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
More Information • ICFA Monitoring WG home page (links to status report, meeting notes, how to access data, and code) • http://www.slac.stanford.edu/xorg/icfa/ntf/home.html • WAN Monitoring at SLAC has lots of links • http://www.slac.stanford.edu/comp/net/wan-mon.html • Tutorial on WAN Monitoring • http://www.slac.stanford.edu/comp/net/wan-mon/tutorial.html • MapPing Tool: • http://www.slac.stanford.edu/~warrenm/work/java/newjava/mapping.html • NIMI http://www.psc.edu/~mahdavi/nimi_paper/NIMI.html \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt
Internet Monitoring \\pcbackup\users\cottrell\xiwg\xiwt-mar98.ppt