Interactions Between Networks, Protocols & Applications

Interactions Between Networks, Protocols & Applications HPCN-RG Richard Hughes-Jones OGF20, Manchester, May 2007,

ESLEA and UKLight at SC|05 2

Reverse TCP ESLEA and UKLight • 6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR • Disk-to-disk transfers with bbcp • Seattle to UK • Set TCP buffer / application to give ~850Mbit/s • One stream of data 840 Mbit/s • Stream UDP VLBI data • UK to Seattle • 620 Mbit/s • No packet loss worked well 3

PCI-X bus with RAID Controller Read from diskfor 44 ms every 100ms PCI-X bus with Ethernet NIC Write to Network for 72 ms SC|05 HEP: Moving data with bbcp • What is the end-host doing with your network protocol? • Look at the PCI-X • 3Ware 9000 controller RAID0 • 1 Gbit Ethernet link • 2.4 GHz dual Xeon • ~660 Mbit/s • Power needed in the end hosts • Careful Application design 4

SC2004: Disk-Disk bbftp • bbftp file transfer program uses TCP/IP • UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 • MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off • Move a 2 Gbyte file • Web100 plots: • Standard TCP • Average 825 Mbit/s • (bbcp: 670 Mbit/s) • Scalable TCP • Average 875 Mbit/s • (bbcp: 701 Mbit/s~4.5s of overhead) • Disk-TCP-Disk at 1Gbit/sworks !! 5

% CPU kernel mode Disk write 1735 Mbit/s Disk write + 1500 MTU UDP 1218 Mbit/s Drop of 30% Disk write + 9000 MTU UDP 1400 Mbit/s Drop of 19% Network & Disk Interactions • Hosts: • Supermicro X5DPE-G2 motherboards • dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory • 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 • six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size • Measure memory to RAID0 transfer rates with & without UDP traffic 6

Remote Computing Farms in the ATLAS TDAQ Experiment 7

ATLAS Remote Farms – Network Connectivity 8

SFI and SFO Event Filter Daemon EFD Request event Send event data Request-Response time (Histogram) Process event Request Buffer Send OK Send processed event ●●● Time ATLAS Remote Computing: Application Protocol • Event Request • EFD requests an event from SFI • SFI replies with the event ~2Mbytes • Processing of event • Return of computation • EF asks SFO for buffer space • SFO sends OK • EF transfers results of the computation • tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication. 9

TCP Congestion windowgets re-set on each Request • TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity • Even after 10s, each response takes 13 rtt or ~260 ms • Transfer achievable throughput120 Mbit/s TCP Activity Manc-CERN Req-Resp • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP in slow start • 1st event takes 19 rtt or ~ 380 ms 10

TCP Activity Manc-cern Req-RespTCP stack no cwnd reduction • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP starts in slow start • 1st event takes 19 rtt or ~ 380 ms • TCP Congestion windowgrows nicely • Response takes 2 rtt after ~1.5s • Rate ~10/s (with 50ms wait) • Transfer achievable throughputgrows to 800 Mbit/s • Data transferred WHEN theapplication requires the data 3 Round Trips 2 Round Trips 11

TCP Activity Alberta-CERN Req-RespTCP stack no Cwnd reduction • Round trip time 150 ms • 64 byte Request green1 Mbyte Response blue • TCP starts in slow start • 1st event takes 11 rtt or ~ 1.67 s • TCP Congestion windowin slow start to ~1.8s then congestion avoidance • Response in 2 rtt after ~2.5s • Rate 2.2/s (with 50ms wait) • Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s 12

Moving Constant Bit-rate Data in Real-Timefor Very Long Baseline Interferometry Stephen Kershaw, Ralph Spencer, Matt Strong, Simon Casey, Richard Hughes-Jones, The University of Manchester 13

Resolution Baseline Sensitivity Bandwidth B is as important as time τ: Can use as many Gigabits as we can get! What is VLBI ? • VLBI signal wave front • Data wave front send over the network to the Correlator 14

European e-VLBI Test Topology Gbit link Chalmers University of Technology, Gothenburg Metsähovi Finland OnsalaSweden Jodrell BankUK Gbit link TorunPoland 2* 1 Gbit links DedicatedDWDM link Dwingeloo Netherlands MedicinaItaly 15

CBR Test Setup 16

CBR over TCP When there is packet loss TCP decreases the rate. TCP buffer 0.9 MB (BDP) RTT 15.2 ms Effect of loss rate on message arrival time. TCP buffer 1.8 MB (BDP) RTT 27 ms Timely arrivalof data Can TCP deliver the data on time? 17

Delay in stream Packet loss Expected arrival time at CBR Resynchronisation Arrival time Message number / Time 18

CBR over TCP – Large TCP Buffer • Message size: 1448 Bytes • Data Rate: 525 Mbit/s • Route:Manchester - JIVE • RTT 15.2 ms • TCP buffer 160 MB • Drop 1 in 1.12 million packets • Throughput increases • Peak throughput ~ 734 Mbit/s • Min. throughput ~ 252 Mbit/s 19

CBR over TCP – Message Delay • Message size: 1448 Bytes • Data Rate: 525 Mbit/s • Route:Manchester - JIVE • RTT 15.2 ms • TCP buffer 160 MB • Drop 1 in 1.12 million packets • Peak Delay ~2.5s 20

Summary & Conclusions • Standard TCP not optimum for high throughput long distance links • Packet loss is a killer for TCP • Check on campus links & equipment, and access links to backbones • Users need to collaborate with the Campus Network Teams • Dante Pert • New stacks are stable and give better response & performance • Still need to set the TCP buffer sizes ! • Check other kernel settings e.g. window-scale maximum • Watch for“TCP Stack implementation Enhancements” • TCP tries to be fair • Large MTU has an advantage • Short distances, small RTT, have an advantage • TCP does not share bandwidth well with other streams • The End Hosts themselves • Plenty of CPU power is required for the TCP/IP stack as well and the application • Packets can be lost in the IP stack due to lack of processing power • Interaction between HW, protocol processing, and disk sub-system complex • Application architecture & implementation are also important • The TCP protocol dynamics strongly influence the behaviour of the Application. • Users arenow able to perform sustained 1 Gbit/s transfers 22

Any Questions? 23

Network switch limits behaviour • End2end UDP packets from udpmon • Only 700 Mbit/s throughput • Lots of packet loss • Packet loss distributionshows throughput limited 24

LightPath Topologies 25

Lab to Lab Lightpaths • Many application share • Classic congestion points • TCP stream sharing and recovery • Advanced TCP stacks Switched LightPaths [1] • Lightpaths are a fixed point to point path or circuit • Optical links (with FEC) have a BER 10-16 i.e. a packet loss rate 10-12 or 1 loss in about 160 days • In SJ5 LightPaths known as Bandwidth Channels • Host to host Lightpath • One Application • No congestion • Advanced TCP stacks for large Delay Bandwidth Products 26

Switched LightPaths [2] • User Controlled Lightpaths • Grid Scheduling ofCPUs & Network • Many Application flows • No congestion on each path • Lightweight framing possible • Some applications suffer when using TCP may prefer to use UDP DCCP XCP … • E.g. With e-VLBI the data wave-front gets distorted and correlation fails 27

bottleneck SLAC TCP/UDP CERN Iperf or UDT iperf Ping 1/s ICMP/ping traffic 4 mins 2 mins Test of TCP Sharing: Methodology (1Gbit/s) • Chose 3 paths from SLAC (California) • Caltech (10ms), Univ Florida (80ms), CERN (180ms) • Used iperf/TCP and UDT/UDP to generate traffic • Each run was 16 minutes, in 7 regions Les Cottrell & RHJ PFLDnet 2005 28

Remaining flows do not take up slack when flow removed Increase recovery rate RTT increases when achieves best throughput Congestion has a dramatic effect Recovery is slow TCP Reno single stream Les Cottrell & RHJ PFLDnet 2005 • Low performance on fast long distance paths • AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) • Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput • Unequal sharing SLAC to CERN 29

> 2 flows appears less stable Appears to need >1 flow to achieve best throughput Two flows share equally SLAC-CERN Hamilton TCP • One of the best performers • Throughput is high • Big effects on RTT when achieves best throughput • Flows share equally 30

Interactions Between Networks, Protocols & Applications