TCP/IP on High Bandwidth Long Distance Paths or So TCP works … but still the users ask: Where is my throughput?

TCP/IP on High Bandwidth Long Distance PathsorSo TCP works … but still the users ask:Where is my throughput? Richard Hughes-Jones The University of Manchesterwww.hep.man.ac.uk/~rich/ then “Talks” and look for Haystack 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Layers & IP 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

The Network Layer 3: IP • IP Layer properties: • Provides best effort delivery • It is unreliable • Packet may be lost • Duplicated • Out of order • Connection less • Provides logical addresses • Provides routing • Demultiplex data on protocol number 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

31 0 24 4 8 16 19 Vers Hlen Type of serv. Total length Identification Flags Fragment offset TTL Protocol Header Checksum Source IP address Destination IP address IP Options (if any) Padding The Internet datagram Frame header Transport FCS IP header 20 Bytes 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

IP Datagram Format (cont.) • Type of Service – TOS:now being used for QoS • Total length: length of datagram in bytes, includes header and data • Time to live – TTL: specifies how long datagram is allowed to remain in internet • Routers decrement by 1 • When TTL = 0 router discards datagram • Prevents infinite loops • Protocol: specifies the format of the data area • Protocol numbers administered by central authority to guarantee agreement, e.g. ICMP=1, TCP=6, UDP=17 … • Source & destination IP address: (32 bits each) contain IP address of sender and intended recipient • Options: (variable length) Mainly used to record a route, or timestamps, or specify routing 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

The Transport Layer 4: UDP • UDP Provides : • Connection less service over IP • No setup teardown • One packet at a time • Minimal overhead – high performance • Provides best effort delivery • It is unreliable: • Packet may be lost • Duplicated • Out of order • Application is responsible for • Data reliability • Flow control • Error handling 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

0 24 8 16 31 Source port Destination port UDP message len Checksum (opt.) UDP Datagram format Frame header FCS IP header UDP header Application data • Source/destination port: port numbers identify sending & receiving processes • Port number & IP address allow any application on Internet to be uniquely identified • Ports can be static or dynamic • Static (< 1024) assigned centrally, known as well known ports • Dynamic • Message length: in bytes includes the UDP header and data (min 8 max 65,535) 8 Bytes 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

The Transport Layer 4: TCP • TCP RFC 768 RFC 1122 Provides : • Connection orientated service over IP • During setup the two ends agree on details • Explicit teardown • Multiple connections allowed • Reliable end-to-end Byte Stream delivery over unreliable network • It takes care of: • Lost packets • Duplicated packets • Out of order packets • TCP provides • Data buffering • Flow control • Error detection & handling • Limits network congestion 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Frame header FCS IP header TCP header Application data 24 8 16 0 4 10 31 Source port Destination port Sequence number Acknowledgement number Hlen Resv Code Window Checksum Urgent ptr Options (if any) Padding The TCP Segment Format 20 Bytes 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Source port Destination port Sequence number Acknowledgement number Hlen Resv Code Window Checksum Urgent ptr Options (if any) Padding TCP Segment Format – cont. • Source/Dest port: TCP port numbers to ID applications at both ends of connection • Sequence number:First byte in segment from sender’s byte stream • Acknowledgement: identifies the number of the byte the sender of this (ACK) segment expects to receive next • Code: used to determine segment purpose, e.g. SYN, ACK, FIN, URG • Window: Advertises how much data this station is willing to accept. Can depend on buffer space remaining. • Options: used for window scaling, SACK, timestamps, maximum segment size etc. 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Receiver Sender Segment n Sequence 1024 Length 1024 RTT ACK of Segment n Ack 2048 Segment n+1 Sequence 2048 Length 1024 RTT ACK of Segment n +1 Ack 3072 Time TCP – providing reliability • Positive acknowledgement (ACK) of each received segment • Sender keeps record of each segment sent • Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” • Sender starts timer when it sends segment – so can re-transmit • Inefficient – sender has to wait 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

TCP Cwnd slides Data to be sent, waiting for window to open. Application writes here Unsent Data may be transmitted immediately Data sent and ACKed Sent Data buffered waiting ACK Receiver’s advertised window advances leading edge Sending host advances marker as data transmitted Received ACK advances trailing edge Flow Control: Sender – Congestion Window • Uses Congestion window, cwnd, a sliding window to control the data flow • Byte count giving highest byte that can be sent with out an ACK • Transmit buffer size and Advertised Receive buffer size important. • ACK gives next sequence no to receive ANDThe available space in the receive buffer • Timer kept for each packet 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Lost data Application reads here Window slides Received butnot ACKed Data given to application ACKed but not given to user Receiver’s advertised window advances leading edge Last ACK given Next byte expected Expected sequence no. Flow Control: Receiver – Lost Data • If new data is received with a sequence number ≠ next byte expected Duplicate ACK is send with the expected sequence number 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

packet loss timeout CWND slow start: exponential increase retransmit: slow start again congestion avoidance: linear increase time How it works: TCP Slowstart • Probe the network - get a rough estimate of the optimal congestion window size • The larger the window size, the higher the throughput • Throughput = Window size / Round-trip Time • exponentially increase the congestion window size until a packet is lost • cwnd initially 1 MTU then increased by 1 MTU for each ACK received • Send 1st packet get 1 ACK increase cwnd to 2 • Send 2 packets get 2 ACKs increase cwnd to 4 • Time to reach cwnd size W TW= RTT*log2(W) (not exactly slow!) • Rate doubles each RTT 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

TCP Slowstart Animated Toby Rodwell Dante • Growth of CWND related to RTT • (Most important in Congestion Avoidance phase) Source CWND= 1 CWND= 1 CWND= 2 CWND= 2 CWND= 4 Sink 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

packet loss timeout CWND slow start: exponential increase retransmit: slow start again congestion avoidance: linear increase time How it works: TCP Congestion Avoidance • additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth • cwnd increased by 1 segment per rtt • cwnd increased by 1 /cwnd for each ACK – linear increase in rate • TCP takes packet loss as indication of congestion ! • multiplicative decrease: cut the congestion window size aggressively if a packet is lost • Standard TCP reduces cwnd by 0.5 • Slow start to Congestion Avoidance transition determined by ssthresh 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

TCP Fast Retransmit & Recovery • Duplicate ACKs are due to lost segments or segments out of order. • Fast Retransmit: If the receiver transmits 3 duplicate ACKs (i.e. it received 3 additional segments without getting the one expected) • Sender re-transmits the missing segment • Set ssthresh to 0.5*cwnd – so enter congestion avoidance phase • Set cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs • Increase cwnd by 1 segment when get duplicate ACKs • Keep sending new data if allowed by cwnd • Set cwnd to half original value on new ACK • no need to go into “slow start” again • At the steady state, cwnd oscillates around the optimal window size • With a retransmission timeout, slow start is triggered again 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Receiver Sender RTT ACK Segment time on wire = bits in segment/BW Time TCP: Simple Tuning - Filling the Pipe • Remember, TCP has to hold a copy of data in flight • Optimal (TCP buffer) window size depends on: • Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth • Round Trip Time (RTT) • The number of bytes in flight to fill the entire path: • Bandwidth*Delay Product BDP = RTT*BW • Can increase bandwidth by orders of magnitude • Windows also used for flow control 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Standard TCP (Reno) – What’s the problem? • TCP has 2 phases: • Slowstart Probe the network to estimate the Available BWExponential growth • Congestion AvoidanceMain data transfer phase – transfer rate glows “slowly” • AIMD and High Bandwidth – Long Distance networks Poor performance of TCP in high bandwidth wide area networks is due in part to the TCP congestion control algorithm. • For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 • For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ • Packet loss is a killer !! 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

TCP (Reno) – Details of problem #1 • Time for TCP to recover its throughput from 1 lost 1500 byte packet given by: • for rtt of ~200 ms @ 1 Gbit/s: 2 min UK 6 msEurope 25 msUSA 150 ms1.6 s26 s 28min 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Investigation of new TCP Stacks • The AIMD Algorithm – Standard TCP (Reno) • For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 • For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ • High Speed TCP a and b vary depending on current cwnd using a table • a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner for the network path • b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. • Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd • a = 1/100 – the increase is greater than TCP Reno • b = 1/8 – the decrease on loss is less than TCP Reno • Scalable over any link speed. • Fast TCP Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput. • HSTCP-LP, H-TCP, BiC-TCP 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Lets Check out this theory about new TCP stacks Does it matter ? Does it work? 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Problem #1 Packet Loss Is it important ? 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

MB-NG Managed Bandwidth Packet Loss with new TCP Stacks • TCP Response Function • Throughput vs Loss Rate – further to right: faster recovery • Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Packet Loss and new TCP Stacks • TCP Response Function • UKLight London-Chicago-London rtt 177 ms • 2.6.6 Kernel • Agreement withtheory good • Some new stacksgood at high loss rates 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Send data with TCP Drop Packets Monitor TCP with Web100 man03 lon01 High Throughput Demonstrations London (Chicago) Manchester rtt 6.2 ms(Geneva) rtt 128 ms Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz Cisco GSR Cisco GSR Cisco 7609 Cisco 7609 1 GEth 1 GEth 2.5 Gbit SDH MB-NG Core 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

MB-NG Managed Bandwidth High Performance TCP – MB-NG • Drop 1 in 25,000 • rtt 6.2 ms • Recover in 1.6 s Standard HighSpeed Scalable 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

High Performance TCP – DataTAG • Different TCP stacks tested on the DataTAG Network • rtt 128 ms • Drop 1 in 106 • High-Speed • Rapid recovery • Scalable • Very fast recovery • Standard • Recovery would take ~ 20 mins 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Layer 2 path Layer 2/3 path Photonic Switch Photonic Switch FAST demo via OMNInet and Datatag NU-E (Leverone) San Diego Workstations FAST display 2 x GE Nortel Passport 8600 A. Adriaanse, C. Jin, D. Wei (Caltech) 10GE FAST Demo Cheng Jin, David Wei Caltech J. Mambretti, F. Yeh (Northwestern) OMNInet StarLight-Chicago Nortel Passport 8600 10GE CERN -Geneva Workstations 2 x GE 2 x GE 7,000 km 2 x GE 2 x GE OC-48 DataTAG CERN Cisco 7609 CalTech Cisco 7609 Alcatel 1670 Alcatel 1670 S. Ravot (Caltech/CERN) 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Traffic flowChannel #2: FAST • Traffic flow Channel #1 : newReno Utilization: 90% Utilization: 70% FAST TCP vs newReno 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Problem #2 Is TCP fair? look at Round Trip Times & Max Transfer Unit 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

CERN (GVA) Starlight (Chi) Host #1 1 GE 1 GE Host #1 1 GE POS 2.5Gbps GbE Switch Host #2 Host #2 1 GE Bottleneck R R MTU and Fairness • Two TCP streams share a 1 Gb/s bottleneck • RTT=117 ms • MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s • MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s • Link utilization : 70,7 % Sylvain Ravot DataTag 2003 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

CERN (GVA) Starlight (Chi) Sunnyvale Host #1 1 GE 10GE 1 GE GbE Switch POS 2.5Gb/s POS 10Gb/s Host #2 Host #2 1 GE 1 GE Bottleneck Host #1 R R R R RTT and Fairness • Two TCP streams share a 1 Gb/s bottleneck • CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s • CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s • MTU = 9000 bytes • Link utilization = 71,6 % Sylvain Ravot DataTag 2003 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Problem #n Do TCP Flows Share the Bandwidth ? 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

bottleneck SLAC TCP/UDP Caltech/UFL/CERN Iperf or UDT iperf Ping 1/s ICMP/ping traffic 4 mins 2 mins Test of TCP Sharing: Methodology (1Gbit/s) • Chose 3 paths from SLAC (California) • Caltech (10ms), Univ Florida (80ms), CERN (180ms) • Used iperf/TCP and UDT/UDP to generate traffic • Each run was 16 minutes, in 7 regions Les Cottrell PFLDnet 2005 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Remaining flows do not take up slack when flow removed Increase recovery rate RTT increases when achieves best throughput Congestion has a dramatic effect Recovery is slow Les Cottrell PFLDnet 2005 TCP Reno single stream • Low performance on fast long distance paths • AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) • Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput • Unequal sharing SLAC to CERN 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

2nd flow never gets equal share of bandwidth Big drops in throughput which take several seconds to recover from SLAC-CERN Fast • As well as packet loss, FAST uses RTT to detect congestion • RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

> 2 flows appears less stable Appears to need >1 flow to achieve best throughput Two flows share equally SLAC-CERN Hamilton TCP • One of the best performers • Throughput is high • Big effects on RTT when achieves best throughput • Flows share equally 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Problem #n+1 To SACK or not to SACK ? 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

SACKs updated rtt 150ms Standard SACKs rtt 150ms HS-TCP Dell 1650 2.8 GHz PCI-X 133 MHz Intel Pro/1000 Doug Leith Yee-Ting Li The SACK Algorithm • SACK Rational • Non-continuous blocks of data can be ACKed • Sender transmits just lost packets • Helps when multiple packets lost in one TCP window • The SACK Processing is inefficient for large bandwidth delay products • Sender write queue (linked list) walked for: • Each SACK block • To mark lost packets • To re-transmit • Processing so long input Q becomes full • Get Timeouts 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

SACK … • Look into what’s happening at the algorithmic level with web100: • Strange hiccups in cwnd  only correlation is SACK arrivals Scalable TCP on MB-NG with 200mbit/sec CBR Background Yee-Ting Li 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Real Applications on Real Networks • Disk-2-disk applications on real networks • Memory-2-memory tests • Transatlantic disk-2-disk at Gigabit speeds • HEP&VLBI at SC|05 • Remote Computing Farms • The effect of TCP • The effect of distance • Radio Astronomy e-VLBI • Leave for the talk later in the meeting 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

BaBar on Production network • Standard TCP • 425 Mbit/s • DupACKs 350-400 – re-transmits iperf Throughput + Web100 • SuperMicro on MB-NG network • HighSpeed TCP • Linespeed 940 Mbit/s • DupACK ? <10 (expect ~400) 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Applications: Throughput Mbit/s • HighSpeed TCP • 2 GByte file RAID5 • SuperMicro + SuperJANET • bbcp • bbftp • Apachie • Gridftp • Previous work used RAID0(not disk limited) 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Transatlantic Disk to Disk Transfers With UKLight SuperComputing 2004 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

BaBar + SuperJANET • Instantaneous 200 – 600 Mbit/s • Disk-mem~ 590 Mbit/srememberthe end host bbftp: What else is going on? Scalable TCP • SuperMicro + SuperJANET • Instantaneous 0 - 550 Mbit/s • Congestion window – duplicate ACK • Throughput variation not TCP related? • Disk speed / bus transfer • Application architecture 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

MB-NG Managed Bandwidth Amsterdam SC2004 UKLIGHT Overview SLAC Booth SC2004 Cisco 6509 MB-NG 7600 OSR Manchester Caltech Booth UltraLight IP UCL network UCL HEP NLR Lambda NLR-PITT-STAR-10GE-16 ULCC UKLight K2 K2 Ci UKLight 10G Four 1GE channels Ci Caltech 7600 UKLight 10G Surfnet/ EuroLink 10G Two 1GE channels Chicago Starlight K2 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

Transatlantic Ethernet: TCP Throughput Tests • Supermicro X5DPE-G2 PCs • Dual 2.9 GHz Xenon CPU FSB 533 MHz • 1500 byte MTU • 2.6.6 Linux Kernel • Memory-memory TCP throughput • Standard TCP • Wire rate throughput of 940 Mbit/s • First 10 sec • Work in progress to study: • Implementation detail • Advanced stacks • Effect of packet loss • Sharing 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

SC2004 Disk-Disk bbftp • bbftp file transfer program uses TCP/IP • UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 • MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off • Move a 2 GByte file • Web100 plots: • Standard TCP • Average 825 Mbit/s • (bbcp: 670 Mbit/s) • Scalable TCP • Average 875 Mbit/s • (bbcp: 701 Mbit/s~4.5s of overhead) • Disk-TCP-Disk at 1Gbit/s 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

% CPU kernel mode Disk write 1735 Mbit/s Disk write + 1500 MTU UDP 1218 Mbit/s Drop of 30% Disk write + 9000 MTU UDP 1400 Mbit/s Drop of 19% Network & Disk Interactions (work in progress) • Hosts: • Supermicro X5DPE-G2 motherboards • dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory • 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 • six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size • Measure memory to RAID0 transfer rates with & without UDP traffic 5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory R. Hughes-Jones Manchester

TCP/IP on High Bandwidth Long Distance Paths or So TCP works … but still the users ask: Where is my throughput?

TCP/IP on High Bandwidth Long Distance Paths or So TCP works … but still the users ask: Where is my throughput?

Presentation Transcript