End-to-end performance: issues and suggestions

End-to-end performance:issues and suggestions TERENA 5th NRENs and Grids Workshop Paris, June 2007 Mark Leese

Talk Emphasis • monALISA = a monitoring tool/framework • DANTE = a network operator • EGEE-II = a Grid • Mark = a pseudo-Grid end user • I’m not a real user, but I look at the issues from their viewpoint: • Large Hadron Collider in the UK (GridPP) • UK e-Science • OGF • Aimed at a mixed audience (NRENs and Grid users) so some network/Grid things you will already….Zzzzzzzzzzzz :) TERENA 5th NRENs & Grids Workshop, June 2007

Contents Just two things: • What makes the Grid different to other network users, wrt performance? • What are the end-to-end performance (monitoring) issues? Any suggestions? If the links in the presentation don’t work, they are listed again on the last three slides TERENA 5th NRENs & Grids Workshop, June 2007

1. What makes the Grid differentto other network users, wrt performance?

The Grid The Grid is all about: • Sharing resources: • the obvious, e.g. databases • the specialised, e.g. remotely control telescopes • and new ideas, e.g. CPU time • co-allocate resources to a task to remove the limitations of the individual resources • most basic analogy: you can move house faster if you have two vans • Sharing resources which are geographically distributed • Sharing resources efficiently • optimisation: selecting the “best” resources for the job TERENA 5th NRENs & Grids Workshop, June 2007

Grid App: Process TBs of Particle Physics data from CERN detectors Grid App: Obtain radio astronomy images from remote telescopes Grid App: Analyse the human genome Middleware: sits between the OS of the resources (below) and the applications that run on the Grid Storage Element Chemical DB Compute Elements Image courtesy of NRAO/AUI The Grid Network(s) TERENA 5th NRENs & Grids Workshop, June 2007

Grid App: Process TBs of Particle Physics data from CERN detectors Grid App: Obtain radio astronomy images from remote telescopes Grid App: Analyse the human genome Middleware: sits between the OS of the resources (below) and the applications that run on the Grid Storage Element Chemical DB Compute Elements Image courtesy of NRAO/AUI The Grid • Get apps running on the “right” resources (wherever they are) • Make disparate compute resources into a coherent whole Network(s) TERENA 5th NRENs & Grids Workshop, June 2007

Optimisation It’s a little like the checkout counters in a supermarket: • There is a line of 10 checkouts to which you can take your big shopping basket • Two checkouts you cannot use. They are for people with “five items or less” – caisse express • Another two checkouts cannot be used. They are reserved for something else (the staff’s lunch break) • Six left: how big is each queue and how long will it take each person to exit the queue (how many items in each basket)? If you choose wrong, you get delayed! You miss the train, you get home late, your partner has given your dinner to the dog • To take the analogy to extremes: hopefully your basket does not have a broken wheel :) TERENA 5th NRENs & Grids Workshop, June 2007

Scheduling • Grid job = the basic unit of work • SEs provide storage resources and access to mass storage systems • CEs provide processing power, e.g. cluster of Worker Nodes (PC farm) • Scheduling = deciding when a job will run, and with which resources • Typically there will be many CEs capable of running a job • If a CE already has lots of jobs queued, you would like to use another • File replication = proven technique for improving data access • Distribute multiple copies of the same file across a Grid • Increases number of CEs with good network connectivity to the data • Extreme example: PisaRoma or PisaFermilab? • So, typically there may also be several SEs holding the required data TERENA 5th NRENs & Grids Workshop, June 2007

Network Aware Scheduling (i) • So we have a set of CEs {a,b,c,…} and SEs {x,y,z,…} capable of running a job • We want a node from each list such that the job will complete the fastest • Take account of: • capability of CEs • size and number of jobs already waiting (queued) at CEs • performance of network link for each CE-SE combination • Further complicated by the compute/data intensity of the job: • computationally intensive job: lots of maths • data intensive job: lots and lots and lots of data • do we pull the data to the job or push the job to the data? TERENA 5th NRENs & Grids Workshop, June 2007

Network Aware Scheduling (ii) • In Utopia we would know about the current state of the network, and any future reserved bandwidth • In reality we could use monitored network performance to make an estimate • It’s not perfect, but patterns (diurnal variation, chronic poor performance…) can be identified • The following slides show iperf tests between dedicated test nodes at LHC sites in the UK (GridPP’s gridmon infrastructure) TERENA 5th NRENs & Grids Workshop, June 2007

Network Aware Scheduling (iii.a) • Transfer at00:00, yes. Transfer at 12:00, no. There’s a big difference between 500 and 200 Mbps for data intensive jobs! TERENA 5th NRENs & Grids Workshop, June 2007

Network Aware Scheduling (iii.b) • RAL Tier-2Tier-1: local transfers are likely the best performers TERENA 5th NRENs & Grids Workshop, June 2007

Network Aware Scheduling (iii.c) • Here, you have absolutely no idea what performance you would get  avoid • Summary: ignore the network at your peril :) TERENA 5th NRENs & Grids Workshop, June 2007

Network Aware Scheduling (iv) • Two good papers to read: • B. Volckaert, P. Thysebaert, M. De Leenheer, F. De Turck, B. Dhoedt, P. Demeester Network Aware Scheduling in Grids • Richard McClatchey, Ashiq Anjum, Heinz Stockinger, Arshad Ali, Ian Willers, Michael Thomas Data Intensive and Network Aware (DIANA) Grid Scheduling • We don’t consider potential uses in more detail (job placement, replica selection) because we don’t know if it will happen! TERENA 5th NRENs & Grids Workshop, June 2007

Network Aware Scheduling (v) • There are some –ve feelings: • “The network is not a problem. Over-provisioning will always keep us ahead. Either that or fibre and GigE everywhere” • Report of the International Grid Performance Workshop 2005 concluded that "Performance simply is not on the critical path for many application projects. Applications that struggle to get code to execute correctly simply do not consider whether they are using resources efficiently or achieving good performance“ • Personal experience suggests that there is so much to think about elsewhere, that the network is often the last thing to be considered • Right now, Grid apps rely on the network being good, with no real checks • And by way of real life indications… • EDG WP7 developed “network cost function”: • Returned cost of variable size file transfers between source and dest Grid elements • Based on periodic (WP7) iperf measurements • Used by WP2 Replica Optimization Service: • job placement: where to start a job so that it is as close as possible to the required data • replica selection: from where to fetch the closest replica once a job had started • EDG was not a production Grid, and the work was not taken forward TERENA 5th NRENs & Grids Workshop, June 2007

Network Aware Scheduling (vi) • In EGEE… • Tommaso Coviello and Tiziana Ferrrari proposed to use network performance data from EGEE-JRA4 CompletionTimeCEi = {JobExecutionTime + max(InputDataTransferTime,QueueTime)} • estimate file transfer times based on thruput • reject paths exhibiting packet loss • SEs selection refined based on SEs using low congestion links (jitter the suggested test) • Some prototype work, but not taken forward • QueueTime found to be unreliable • Data for 100 paths required within 0.2 seconds of receiving request • Grid Information Service was not ready to hold the data • a problem for JRA4’s Web Service interface (WS,  accessible but slow) TERENA 5th NRENs & Grids Workshop, June 2007

Network Aware Scheduling (vii) • In WLCG/EGEE (if I understand correctly)… • The “close SE” approach is applied: • Each CE must have a “close” SE: the node with the “best” access for data retrieval from that CE • These relationships are statically defined in the Grid’s Information Service, which provides information about the Grid resources and their status $ lcg-infosites --vo dteam closeSE Name of the CE: g02.phy.bg.ac.yu:2119/blah-pbs-dteam se.phy.bg.ac.yu Name of the CE: fangorn.man.poznan.pl:2119/jobmanager-lcgpbs-dteam se1.egee.man.poznan.pl se2.egee.man.poznan.pl TERENA 5th NRENs & Grids Workshop, June 2007

Network Aware Scheduling (viii) • To run a job the user submits a job description in JDL (Job Description Language) format • It defines which executable to run, any parameters, input data (Grid files) etc. • A match-making process then takes places to identify a CE to execute the job • Identify all CEs which: • can run the job, i.e. match the user’s requirements (JDL) • are “close” to an SE holding the required input Grid files • select CE with the highest rank • by default, rank = estimation of the time interval between the being job submitted and execution actually beginning • a function of the number of running and queued jobs at each CE • See gLite User Guide for more info • As already stated, the presence of replicas of data increases the number of CEs “close” to the data which can potentially execute the job • But decisions are still made on the static declaration of “close” SEs • Users are able to re-write the site selection code themselves TERENA 5th NRENs & Grids Workshop, June 2007

Difference 1 So, difference 1… The Grid may use network performance data to improve its decision making TERENA 5th NRENs & Grids Workshop, June 2007

Difference 2 Difference 2… The Grid will exercise the network TERENA 5th NRENs & Grids Workshop, June 2007

Qualitative View • By it’s very nature… • sharing lots of resources to build powerful “systems”… • to process complex, large data sets… • in geographically distributed teams • some in real-time, e.g. visualisation • so far there has been lots of “embarrassingly parallel” problems (completely independent tasks which can be executed in parallel) but what about tasks requiring inter-processor communication (MPI, Message Passing Interface)? • …= a lot of data moving across the network: • high bandwidth • low-latency • stable and guaranteed transmission rates TERENA 5th NRENs & Grids Workshop, June 2007

Quantitative View (i) • The Large Hadron Collider is a collection of four experiments based at CERN (ALICE, ATLAS, CMS and LHCb) that will monitor the collision of accelerated particles • ≈ 15 Petabytes of data generated every year • Around 100,000 standard CPUs required to process • GridPP (UK) is contributing the equivalent of 10,000 PCs TERENA 5th NRENs & Grids Workshop, June 2007

Quantitative View (ii) • My understanding is that the LHC when operational, will be pushing out 700 Mbytes/s (≈ 5 Gbps) from the Tier-0 to each Tier-1 • 11 Tier-1s, linked to CERN with 10 Gbps Optical Private Network • So no problems there • Additional variable flows ≤ 4 Gbps are expected between the Tier-1s • What about Tier-1s to Tier-2s? • > 150 Tier-2s, 18 in UK • Tier-1s and Tier-2s currently linked by standard research networks • Are you going to commission dedicated fibres or lambdas for each? TERENA 5th NRENs & Grids Workshop, June 2007

Quantitative View (iii) TERENA 5th NRENs & Grids Workshop, June 2007

Rolls Royce Networks • Lots of projects working on adding extra intelligence into the network, and/or interfacing Grid applications with network control plane for auto-provisioning of dedicated bandwidth: • Cisco’s Network Based On-demand/Grid System (NBGS) • The NAREGI project • Enlightened Computing • http://www.g-lambda.net/ • These are still development projects • Can fibre/lambdas be provided for all that need it? • Even if £$€ provided, temptation to spend on CPU power? • May still fall victim to end-system and “last mile” (e.g. firewall) problems TERENA 5th NRENs & Grids Workshop, June 2007

Is the Grid a lot of Hype? • It’s good to be skeptical about things. Every four years people say England will win the World Cup/Coupe du Monde ;-) • The Grid is ambitious… • …but so was the “World Wide Wait” • Now everyone loves the Web, and it has become important to people: • Internet banking, online shopping (flights, holidays, music, supermarket…), e-Government etc. etc. • MySpace, Facebook, YouTube • The Web also drove investment in the Net infrastructure and as a result it can now support video conferencing, VoIP etc. TERENA 5th NRENs & Grids Workshop, June 2007

Summary of Differences • Network Operations: We can safely say that greater demands will be placed on the network: • massive datasets, 1000’s of networked “resources” • geographically distributed: Long Fat Networks • high bandwidth, high availability, low latency • networks will need to be debugged for efficiency • Network Intelligence: The Grid may want to consume network performance data to improve its decision making TERENA 5th NRENs & Grids Workshop, June 2007

2. What are the end-to-endperformance (monitoring) issues?

The Overall Issue • We have seen that the Grid could use network performance data for decision making… • …but we don’t know whether it will • As a result, we concentrate on debugging the network for Grid users TERENA 5th NRENs & Grids Workshop, June 2007

End-to-End? • When I say “end-to-end” I mean PC-PC, not PoP to PoP or similar • Core and Metro Area are normally fine • Most problems are in the last mile: • End-system: • NIC • disc • TCP config • poor cabling • the application itself (e.g. older versions of scp) • I could go on for ever (“no, please don’t!”) • Site firewall • Off-site connections TERENA 5th NRENs & Grids Workshop, June 2007

So Many Issues • Beyond the basics of which tests to run, and how to control/schedule them, there are too many end-to-end performance issues to consider when monitoring. Here, I mention a few and make some suggestions. • TCP performance • Parallel TCP streams • Different data transfer protocols (e.g. GridFTP vrs HTTP) • New protocols, e.g. DDCP • TCP-IP is ubiquitous so we stick with it - we can’t necessarily wait for new protocols and network architectures • Measurement types • active vrs passive • capture logs of real GridFTP transfers…is there Grid Information Service support? • can we monitor Grid workflows in real-time? • Too many test paths. Can we plug in to VO data to test only the required paths TERENA 5th NRENs & Grids Workshop, June 2007

Over-Provisioning Q: Okay, so why don’t we just throw some more bandwidth at the problem? Upgrade the links. A: For want of a more interesting term to make sure you’re still paying attention, this is what I call the Heroin Effect… • You start off with a little, but that’s not really doing it for you; it’s not solving the problem. So you keep increasing the dose, yet it’s never as good as you thought it would be. • By analogy you keep buying more and more bandwidth to take you to new highs but it's never quite as good as you thought it would be • Simple over-provisioning is not sufficient • Doesn’t address the key issue of end-to-end performance • Network backbone in most cases is genuinely not the source of the problem • Last mile (campus networkend-user systemyour app) often cause of the problem: firewall, wiring, hard disc, application and many more potential culprits • Also, If simple over-provisioning was a total solution, there would not be so much other work going on, e.g. protocol research (high speed TCPs) TERENA 5th NRENs & Grids Workshop, June 2007

Lets Puts Fibre Everywhere (1) • Fibre is cheaper than it was, but for large deployments, it’s still expensive • We can see the benefits of fibre with the UKLight infrastructure and the ESLEA exploitation project, but it still doesn’t address the end-to-end issue. Take a real-life ESLEA example (thanks to ESLEA for the figures)… • The UK wanted to transfer data from FermiLab (Chicago) to UCL for analysis by physicists, before returning the results • datasets currently 1-50TB • 50TB would take > 6 mths on production net, or one week at 700Mbps • So a 1Gbps circuit-switched light path was provisioned • Result = disc-to-disc transfers @ 250Mbps, just 1/4 of theoretical max • Tests revealed a problem at an end site TERENA 5th NRENs & Grids Workshop, June 2007

Lets Puts Fibre Everywhere (2) • UCL: RealityGrid, for modelling complex condensed matter systems: computational steering, visualisation. • Test node: 2 * 1.8GHz Athlon, 4 GB, GigE, CentOS • DL: HPCx super computer • Test node: 3 GHz P4, 2 GB, GigE, Scientific Linux • RTT is always 9mS • TCP bandwidth is, errr.... TERENA 5th NRENs & Grids Workshop, June 2007

Mark’s Tips • There are lots of tools, frameworks, infrastructures out there. • Massive list at http://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html • Pick something that works for you - it’s a balance of: • ongoing administration • deployment effort (e.g. persuading remote sites to install tools and allow you to run tests) • how intrusive the tests are • Start your investigations in the last mile • Do put real data over the network • you can send 1 ping a second forever and see 10-8 loss • you then run an iperf test and the performance is terrible • Keep historic data: things change • you will want to look back, and you will want points of reference • When you see a problem, follow it up and get information • Not only is the problem fixed, but you get to demonstrate why this is useful which helps with deployment, support, growing user base… • Remember the social aspects - persistent but patient :) TERENA 5th NRENs & Grids Workshop, June 2007

Suggestions: Tools and Techniques • Start with the local host: • As you would expect: • uname • netstat • ifconfig (watch error counters etc.) • LISA (Localhost Information Service Agent) • a component of MonALISA • almost complete system monitoring (load, CPU, memory, disk, disk I/O, paging, processes, network traffic and connectivity...) • Check everything: • TCP configuration • machine load • disc (sas, sata, nasty old ide?) • If TCP is the problem, what UDP rates can you achieve? TERENA 5th NRENs & Grids Workshop, June 2007

Suggestions: Tools and Techniques • ping still useful but need to send much faster than 1 per second, and for a long time….10-8 loss • “back of envelope” calculation: on Saturday I ran a 10 sec iperf test which transferred 624MB in 480,000 packets. So ≈ 1.3KB per packet • 1 loss every 100,000,000 packets ≈ 128GB transferred before a loss causes your transfer rate to drop • can use Synack tool (sparingly) if icmp is blocked • traceroute and reverse traceroutes: regularly measuring the routes to your most important collaborators is very useful • dedicated monitoring boxes are useful here because they may be allowed (firewalls etc.) for icmp TERENA 5th NRENs & Grids Workshop, June 2007

Suggestions: Tools and Techniques • As we will see, time series data is probably the most useful • When did your problems start? When did things change? • Unfortunately, relies on there being proximity between your paths/devices and ones for which there is available data • If you suspect the problem is in the core you may be able to find the problem router (or rough location) through a so called "looking glass" servers: statistics of network operator performance • ping and iperf very useful here…but be wary: • In May 2004, Les Cottrell (SLAC) said… “As measured by NetFlow, 25% of the traffic on Abilene is iperf and ping type traffic” TERENA 5th NRENs & Grids Workshop, June 2007

Suggestions: Tools and Techniques • Thrulay is an iperf-like tool for measuring TCP and UDP bandwidth • useful because it also gives you the RTT seen by the transfer, not ping/traceroute’s estimate • Two “detective” type tools: • Tom Dunnigan and Rich Carlson's Network Diagnostic Tool (NDT) • client-server • useful because client can be lightweight: Java applet, runs in a Web browser on most systems • command line client (compile and install) also available • public servers (linux boxes with Web100 kernels) although I think only one outside US (thank you SWITCH) • detects problems, makes suggestions: duplex problems, TCP tuning amongst others • The SURFnet Detective TERENA 5th NRENs & Grids Workshop, June 2007

NDT’s suggestion Suggestions: Tools and Techniques TERENA 5th NRENs & Grids Workshop, June 2007

Suggestions: Tools and Techniques We could do these but don’t because there’s too much data to process/correlate: • Cisco NetFlow data – routers record details of all traffic “flows” which they see: • src and dest IP addresses and ports • start and end time • amount of traffic transferred • Parsing firewall logs: • [root@gridmon2 ~]# iperf -c hepgrid7.ph.liv.ac.uk ------------------------------------------------------------ Client connecting to hepgrid7.ph.liv.ac.uk, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 3] local 193.62.125.96 port 58316 connected with 138.253.178.107 port 5001 [ 3] 0.0-10.0 sec873 MBytes 732 Mbits/sec • Jun 10 22:12:58: NetScreen device_id=gw-fw system-notification-00257(traffic): start_time="2007-06-10 22:15:55" duration=22 service=tcp/port:5001 src zone=ESC-DMZ dst zone=Untrust action=Permit sent=948533470 rcvd=40793960 src=<hidden> dst=<hidden> src_port=58316 dst_port=5001 session_id=995619 • Not wholly accurate (22 secs not 10) and ignores overheads but can be used relative TERENA 5th NRENs & Grids Workshop, June 2007

Suggestions: Tools and Techniques • SNMP data is (understandably) impossible to obtain for non-networkers • Sharing data with the OGF NM-WG XML schemas may improve things • And now some quick examples from gridmon: • Dedicated boxes • Same spec, OS, configuration - makes life a lot easier (comparing like-for like) • If running regular tests, get the results in an SQL data – fast, repeatable queries • If no dedicated boxes available, deploy a box for: • either the best performance possible • Something representative of systems at that end-site • Sorry, no-end system examples here – we configured the boxes ourselves ;-) TERENA 5th NRENs & Grids Workshop, June 2007

Example 1 • Glasgow running transfer tests to Edinburgh over weekend 28-29th October • Experiencing poor rates (80Mbps) • 1st thing: despite transferring just 80Mbps, residual TCP bandwidth drops by ≈ 400Mbps • Warning bells TERENA 5th NRENs & Grids Workshop, June 2007

Example 1 • Traceroute data reveals suspect router… traceroute to gridmon.epcc.ed.ac.uk (129.215.175.71), 30 hops max, 38 byte packets 1 194.36.1.1 (194.36.1.1) 0.941 ms 0.882 ms 0.815 ms 2 130.209.2.1 (130.209.2.1) 0.875 ms 0.831 ms 0.830 ms 3 130.209.2.118 (130.209.2.118) 60.415 ms55.453 ms31.327 ms 4 glasgowpop-ge1-2-glasgowuni-ge1-1-v152.clyde.net.uk (194.81.62.153) 32.420 ms 34.404 ms 29.424 ms 5 glasgow-bar.ja.net (146.97.40.57) 43.467 ms 52.298 ms 39.349 ms 6 po9-0.glas-scr.ja.net (146.97.35.53) 45.856 ms 44.445 ms 41.388 ms 7 po3-0.edin-scr.ja.net (146.97.33.62) 51.509 ms 63.493 ms 31.435 ms 8 po0-0.edinburgh-bar.ja.net (146.97.35.62) 22.454 ms 25.412 ms 31.381 ms 9 146.97.40.122 (146.97.40.122) 44.602 ms 42.494 ms 35.492 ms 10 gridmon.epcc.ed.ac.uk (129.215.175.71) 33.515 ms 34.623 ms 37.694 ms TERENA 5th NRENs & Grids Workshop, June 2007

Example 1 • Reverse route confirms. Traceroutes are normal until we hit suspect router… traceroute to gppmon-gla.scotgrid.ac.uk (194.36.1.56), 30 hops max, 38 byte packets 1 vlan175.srif-kb1.net.ed.ac.uk (129.215.175.126) 0.435 ms 0.387 ms 0.380 ms 2 edinburgh-bar.ja.net (146.97.40.121) 0.357 ms 0.329 ms 0.322 ms 3 po9-0.edin-scr.ja.net (146.97.35.61) 0.564 ms 0.485 ms 0.485 ms 4 po3-0.glas-scr.ja.net (146.97.33.61) 1.656 ms 1.511 ms 1.499 ms 5 po0-0.glasgow-bar.ja.net (146.97.35.54) 1.850 ms 1.352 ms 1.422 ms 6 146.97.40.58 (146.97.40.58) 1.679 ms 1.661 ms 1.569 ms 7 glasgowuni-ge1-1-glasgowpop-ge1-2-v152.clyde.net.uk (194.81.62.154) 1.796 ms 1.677 ms 1.646 ms 8 130.209.2.117 (130.209.2.117) 31.197 ms34.615 ms29.121 ms 9 130.209.2.2 (130.209.2.2) 32.814 ms 32.158 ms 32.145 ms • gppmon-gla.scotgrid.ac.uk (194.36.1.56) 41.634 ms 37.555 ms 24.635 ms • Graphs and traceroutes provide evidence for further investigation TERENA 5th NRENs & Grids Workshop, June 2007

Example 1 • Further investigation revealed that the router had exhausted its CAM space • <see next slide if you want to know what this is> • In simple terms, the router was forced to switch in software • Because a particular lookup in a routing/switching/access table was not being hardware accelerated, problems were caused under certain flow conditions • The solution: the CAM dynamic database was re-optimised (to free up CAM space) and the unit began switching in hardware again TERENA 5th NRENs & Grids Workshop, June 2007

Example 1 • CAM = Content-Addressable Memory • Hardware (fast) implementation of an associative area • a data word (not memory address!) is used to access it • the CAM searches its entire contents to see if the data word is stored • if the word is found, the CAM returns a list of one or more corresponding storage addresses, or other data associated with those storage addresses • CAM memory is used for switching and routing, e.g. Ethernet switches store learned MAC addresses and their associated switch port in CAM MAC Address Located on Port ------------- --------------- 000039-0643f5 26 000089-01af9a 5 000102-162346 16 • When an Ethernet frame arrives at the switch with a destination address of 000089-01af9a the switch searches its CAM for that address. • The CAM will return “5” so the switch sends this Ethernet frame out on port 5 TERENA 5th NRENs & Grids Workshop, June 2007

Example 2 • Local departmental firewall reconfigured to switch off strict checking of TCP sequence numbers • Potential minefield: SACK etc. TERENA 5th NRENs & Grids Workshop, June 2007

Example 3 • Almost constant 33% UDP packet loss • Fatal to most/all applications using UDP • Occasional dip to 0% TERENA 5th NRENs & Grids Workshop, June 2007

End-to-end performance: issues and suggestions