slide1
Download
Skip this Video
Download Presentation
TCP/IP on High Bandwidth Long Distance Paths or So TCP works … but still the users ask: Where is my throughput?

Loading in 2 Seconds...

play fullscreen
1 / 78

TCP/IP on High Bandwidth Long Distance Paths or So TCP works … but still the users ask: Where is my throughput? - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

TCP/IP on High Bandwidth Long Distance Paths or So TCP works … but still the users ask: Where is my throughput?. Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks” and look for Haystack. Layers & IP. The Network Layer 3: IP. IP Layer properties:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'TCP/IP on High Bandwidth Long Distance Paths or So TCP works … but still the users ask: Where is my throughput?' - kueng


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

TCP/IP on High Bandwidth Long Distance PathsorSo TCP works … but still the users ask:Where is my throughput?

Richard Hughes-Jones The University of Manchesterwww.hep.man.ac.uk/~rich/ then “Talks” and look for Haystack

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

layers ip
Layers & IP

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

the network layer 3 ip
The Network Layer 3: IP
  • IP Layer properties:
    • Provides best effort delivery
    • It is unreliable
      • Packet may be lost
      • Duplicated
      • Out of order
    • Connection less
    • Provides logical addresses
    • Provides routing
    • Demultiplex data on protocol number

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

the internet datagram
31

0

24

4

8

16

19

Vers

Hlen

Type of serv.

Total length

Identification

Flags

Fragment offset

TTL

Protocol

Header Checksum

Source IP address

Destination IP address

IP Options (if any)

Padding

The Internet datagram

Frame header

Transport

FCS

IP header

20 Bytes

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

ip datagram format cont
IP Datagram Format (cont.)
  • Type of Service – TOS:now being used for QoS
  • Total length: length of datagram in bytes, includes header and data
  • Time to live – TTL: specifies how long datagram is allowed to remain in internet
    • Routers decrement by 1
    • When TTL = 0 router discards datagram
    • Prevents infinite loops
  • Protocol: specifies the format of the data area
    • Protocol numbers administered by central authority to guarantee agreement, e.g. ICMP=1, TCP=6, UDP=17 …
  • Source & destination IP address: (32 bits each) contain IP address of sender and intended recipient
  • Options: (variable length) Mainly used to record a route, or timestamps, or specify routing

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

the transport layer 4 udp
The Transport Layer 4: UDP
  • UDP Provides :
    • Connection less service over IP
      • No setup teardown
      • One packet at a time
    • Minimal overhead – high performance
    • Provides best effort delivery
    • It is unreliable:
      • Packet may be lost
      • Duplicated
      • Out of order
    • Application is responsible for
      • Data reliability
      • Flow control
      • Error handling

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

udp datagram format
0

24

8

16

31

Source port

Destination port

UDP message len

Checksum (opt.)

UDP Datagram format

Frame header

FCS

IP header

UDP header

Application data

  • Source/destination port: port numbers identify sending & receiving processes
    • Port number & IP address allow any application on Internet to be uniquely identified
    • Ports can be static or dynamic
      • Static (< 1024) assigned centrally, known as well known ports
      • Dynamic
  • Message length: in bytes includes the UDP header and data (min 8 max 65,535)

8 Bytes

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

the transport layer 4 tcp
The Transport Layer 4: TCP
  • TCP RFC 768 RFC 1122 Provides :
    • Connection orientated service over IP
      • During setup the two ends agree on details
      • Explicit teardown
      • Multiple connections allowed
    • Reliable end-to-end Byte Stream delivery over unreliable network
    • It takes care of:
      • Lost packets
      • Duplicated packets
      • Out of order packets
    • TCP provides
      • Data buffering
      • Flow control
      • Error detection & handling
      • Limits network congestion

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

the tcp segment format
Frame header

FCS

IP header

TCP header

Application data

24

8

16

0

4

10

31

Source port

Destination port

Sequence number

Acknowledgement number

Hlen

Resv

Code

Window

Checksum

Urgent ptr

Options (if any)

Padding

The TCP Segment Format

20 Bytes

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcp segment format cont
Source port

Destination port

Sequence number

Acknowledgement number

Hlen

Resv

Code

Window

Checksum

Urgent ptr

Options (if any)

Padding

TCP Segment Format – cont.
  • Source/Dest port: TCP port numbers to ID applications at both ends of connection
  • Sequence number:First byte in segment from sender’s byte stream
  • Acknowledgement: identifies the number of the byte the sender of this (ACK) segment expects to receive next
  • Code: used to determine segment purpose, e.g. SYN, ACK, FIN, URG
  • Window: Advertises how much data this station is willing to accept. Can depend on buffer space remaining.
  • Options: used for window scaling, SACK, timestamps, maximum segment size etc.

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcp providing reliability
Receiver

Sender

Segment n

Sequence 1024

Length 1024

RTT

ACK of Segment n

Ack 2048

Segment n+1

Sequence 2048

Length 1024

RTT

ACK of Segment n +1

Ack 3072

Time

TCP – providing reliability
  • Positive acknowledgement (ACK) of each received segment
    • Sender keeps record of each segment sent
    • Sender awaits an ACK – “I am ready to receive byte 2048 and beyond”
    • Sender starts timer when it sends segment – so can re-transmit
  • Inefficient – sender has to wait

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

flow control sender congestion window
TCP Cwnd slides

Data to be sent,

waiting for window

to open.

Application writes here

Unsent Data

may be transmitted immediately

Data sent and ACKed

Sent Data

buffered waiting ACK

Receiver’s advertised

window advances

leading edge

Sending host

advances marker

as data transmitted

Received ACK

advances trailing edge

Flow Control: Sender – Congestion Window
  • Uses Congestion window, cwnd, a sliding window to control the data flow
    • Byte count giving highest byte that can be sent with out an ACK
    • Transmit buffer size and Advertised Receive buffer size important.
    • ACK gives next sequence no to receive ANDThe available space in the receive buffer
    • Timer kept for each packet

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

flow control receiver lost data
Lost data

Application reads here

Window slides

Received butnot ACKed

Data given to application

ACKed but not given to user

Receiver’s advertised

window advances

leading edge

Last ACK given

Next byte expected

Expected sequence no.

Flow Control: Receiver – Lost Data
  • If new data is received with a sequence number ≠ next byte expected Duplicate ACK is send with the expected sequence number

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

how it works tcp slowstart
packet loss

timeout

CWND

slow start: exponential increase

retransmit: slow start again

congestion avoidance: linear increase

time

How it works: TCP Slowstart
  • Probe the network - get a rough estimate of the optimal congestion window size
  • The larger the window size, the higher the throughput
    • Throughput = Window size / Round-trip Time
  • exponentially increase the congestion window size until a packet is lost
    • cwnd initially 1 MTU then increased by 1 MTU for each ACK received
      • Send 1st packet get 1 ACK increase cwnd to 2
      • Send 2 packets get 2 ACKs increase cwnd to 4
      • Time to reach cwnd size W TW= RTT*log2(W) (not exactly slow!)
    • Rate doubles each RTT

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide15
TCP Slowstart Animated

Toby Rodwell Dante

  • Growth of CWND related to RTT
  • (Most important in Congestion Avoidance phase)

Source

CWND= 1

CWND= 1

CWND= 2

CWND= 2

CWND= 4

Sink

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide16
packet loss

timeout

CWND

slow start: exponential increase

retransmit: slow start again

congestion avoidance: linear increase

time

How it works: TCP Congestion Avoidance

  • additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth
    • cwnd increased by 1 segment per rtt
    • cwnd increased by 1 /cwnd for each ACK – linear increase in rate
  • TCP takes packet loss as indication of congestion !
  • multiplicative decrease: cut the congestion window size aggressively if a packet is lost
    • Standard TCP reduces cwnd by 0.5
    • Slow start to Congestion Avoidance transition determined by ssthresh

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcp fast retransmit recovery
TCP Fast Retransmit & Recovery
  • Duplicate ACKs are due to lost segments or segments out of order.
  • Fast Retransmit: If the receiver transmits 3 duplicate ACKs (i.e. it received 3 additional segments without getting the one expected)
    • Sender re-transmits the missing segment
      • Set ssthresh to 0.5*cwnd – so enter congestion avoidance phase
      • Set cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs
      • Increase cwnd by 1 segment when get duplicate ACKs
      • Keep sending new data if allowed by cwnd
      • Set cwnd to half original value on new ACK
    • no need to go into “slow start” again
  • At the steady state, cwnd oscillates around the optimal window size
  • With a retransmission timeout, slow start is triggered again

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcp simple tuning filling the pipe
Receiver

Sender

RTT

ACK

Segment time on wire =

bits in segment/BW

Time

TCP: Simple Tuning - Filling the Pipe
  • Remember, TCP has to hold a copy of data in flight
  • Optimal (TCP buffer) window size depends on:
    • Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth
    • Round Trip Time (RTT)
  • The number of bytes in flight to fill the entire path:
    • Bandwidth*Delay Product BDP = RTT*BW
    • Can increase bandwidth by

orders of magnitude

  • Windows also used for flow control

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide19
Standard TCP (Reno) – What’s the problem?
  • TCP has 2 phases:
    • Slowstart Probe the network to estimate the Available BWExponential growth
    • Congestion AvoidanceMain data transfer phase – transfer rate glows “slowly”
  • AIMD and High Bandwidth – Long Distance networks

Poor performance of TCP in high bandwidth wide area networks is due

in part to the TCP congestion control algorithm.

    • For each ack in a RTT without loss:

cwnd -> cwnd + a / cwnd - Additive Increase, a=1

    • For each window experiencing loss:

cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½

  • Packet loss is a killer !!

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide20
TCP (Reno) – Details of problem #1
  • Time for TCP to recover its throughput from 1 lost 1500 byte packet given by:
  • for rtt of ~200 ms @ 1 Gbit/s:

2 min

UK 6 msEurope 25 msUSA 150 ms1.6 s26 s 28min

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide21
Investigation of new TCP Stacks
  • The AIMD Algorithm – Standard TCP (Reno)
    • For each ack in a RTT without loss:

cwnd -> cwnd + a / cwnd - Additive Increase, a=1

    • For each window experiencing loss:

cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½

  • High Speed TCP

a and b vary depending on current cwnd using a table

    • a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner for the network path
    • b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput.
  • Scalable TCP

a and b are fixed adjustments for the increase and decrease of cwnd

    • a = 1/100 – the increase is greater than TCP Reno
    • b = 1/8 – the decrease on loss is less than TCP Reno
    • Scalable over any link speed.
  • Fast TCP

Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.

  • HSTCP-LP, H-TCP, BiC-TCP

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide22
Lets Check out this

theory about new TCP stacks

Does it matter ?

Does it work?

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide23
Problem #1

Packet Loss

Is it important ?

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

packet loss with new tcp stacks
MB-NG

Managed Bandwidth

Packet Loss with new TCP Stacks
  • TCP Response Function
    • Throughput vs Loss Rate – further to right: faster recovery
    • Drop packets in kernel

MB-NG rtt 6ms DataTAG rtt 120 ms

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

packet loss and new tcp stacks
Packet Loss and new TCP Stacks
  • TCP Response Function
    • UKLight London-Chicago-London rtt 177 ms
    • 2.6.6 Kernel
    • Agreement withtheory good
    • Some new stacksgood at high loss rates

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

high throughput demonstrations
Send data with TCP

Drop Packets

Monitor TCP with Web100

man03

lon01

High Throughput Demonstrations

London (Chicago)

Manchester rtt 6.2 ms(Geneva) rtt 128 ms

Dual Zeon 2.2 GHz

Dual Zeon 2.2 GHz

Cisco GSR

Cisco GSR

Cisco

7609

Cisco

7609

1 GEth

1 GEth

2.5 Gbit SDH

MB-NG Core

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide27
MB-NG

Managed Bandwidth

High Performance TCP – MB-NG

  • Drop 1 in 25,000
  • rtt 6.2 ms
  • Recover in 1.6 s

Standard HighSpeed Scalable

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

high performance tcp datatag
High Performance TCP – DataTAG
  • Different TCP stacks tested on the DataTAG Network
  • rtt 128 ms
  • Drop 1 in 106
  • High-Speed
    • Rapid recovery
  • Scalable
    • Very fast recovery
  • Standard
    • Recovery would take ~ 20 mins

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

fast demo via omninet and datatag
Layer 2 path

Layer 2/3 path

Photonic Switch

Photonic Switch

FAST demo via OMNInet and Datatag

NU-E (Leverone)

San Diego

Workstations

FAST

display

2 x GE

Nortel

Passport 8600

A. Adriaanse, C. Jin, D. Wei (Caltech)

10GE

FAST Demo

Cheng Jin, David Wei

Caltech

J. Mambretti, F. Yeh (Northwestern)

OMNInet

StarLight-Chicago

Nortel

Passport 8600

10GE

CERN -Geneva

Workstations

2 x GE

2 x GE

7,000 km

2 x GE

2 x GE

OC-48

DataTAG

CERN

Cisco 7609

CalTech

Cisco 7609

Alcatel

1670

Alcatel

1670

S. Ravot (Caltech/CERN)

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

fast tcp vs newreno
Traffic flowChannel #2: FAST
  • Traffic flow Channel #1 : newReno

Utilization:

90%

Utilization: 70%

FAST TCP vs newReno

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide31
Problem #2

Is TCP fair?

look at

Round Trip Times & Max Transfer Unit

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

mtu and fairness
CERN (GVA)

Starlight (Chi)

Host #1

1 GE

1 GE

Host #1

1 GE

POS 2.5Gbps

GbE Switch

Host #2

Host #2

1 GE

Bottleneck

R

R

MTU and Fairness
  • Two TCP streams share a 1 Gb/s bottleneck
  • RTT=117 ms
  • MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s
  • MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s
  • Link utilization : 70,7 %

Sylvain Ravot DataTag 2003

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

rtt and fairness
CERN (GVA)

Starlight (Chi)

Sunnyvale

Host #1

1 GE

10GE

1 GE

GbE Switch

POS 2.5Gb/s

POS 10Gb/s

Host #2

Host #2

1 GE

1 GE

Bottleneck

Host #1

R

R

R

R

RTT and Fairness
  • Two TCP streams share a 1 Gb/s bottleneck
  • CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s
  • CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s
  • MTU = 9000 bytes
  • Link utilization = 71,6 %

Sylvain Ravot DataTag 2003

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide34
Problem #n

Do TCP Flows Share the Bandwidth ?

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

test of tcp sharing methodology 1gbit s
bottleneck

SLAC

TCP/UDP

Caltech/UFL/CERN

Iperf or UDT

iperf

Ping 1/s

ICMP/ping traffic

4 mins

2 mins

Test of TCP Sharing: Methodology (1Gbit/s)
  • Chose 3 paths from SLAC (California)
    • Caltech (10ms), Univ Florida (80ms), CERN (180ms)
  • Used iperf/TCP and UDT/UDP to generate traffic
  • Each run was 16 minutes, in 7 regions

Les Cottrell PFLDnet 2005

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcp reno single stream
Remaining flows do not take up slack when flow removed

Increase recovery rate

RTT increases when achieves best throughput

Congestion has a dramatic effect

Recovery is slow

Les Cottrell PFLDnet 2005

TCP Reno single stream
  • Low performance on fast long distance paths
    • AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion)
    • Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput
    • Unequal sharing

SLAC to CERN

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide37
2nd flow never gets equal share of bandwidth

Big drops in throughput which take several seconds to recover from

SLAC-CERN

Fast
  • As well as packet loss, FAST uses RTT to detect congestion
    • RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

hamilton tcp
> 2 flows appears less stable

Appears to need

>1 flow to

achieve best throughput

Two flows share equally

SLAC-CERN

Hamilton TCP
  • One of the best performers
    • Throughput is high
    • Big effects on RTT when achieves best throughput
    • Flows share equally

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide39
Problem #n+1

To SACK or not to SACK ?

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

the sack algorithm
SACKs updated rtt 150ms

Standard SACKs rtt 150ms

HS-TCP

Dell 1650 2.8 GHz

PCI-X 133 MHz

Intel Pro/1000

Doug Leith

Yee-Ting Li

The SACK Algorithm
  • SACK Rational
    • Non-continuous blocks of data can be ACKed
    • Sender transmits just lost packets
    • Helps when multiple packets lost in one TCP window
  • The SACK Processing is inefficient for large bandwidth delay products
    • Sender write queue (linked list) walked for:
      • Each SACK block
      • To mark lost packets
      • To re-transmit
    • Processing so long input Q becomes full
    • Get Timeouts

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide41
SACK …
  • Look into what’s happening at the algorithmic level with web100:
  • Strange hiccups in cwnd  only correlation is SACK arrivals

Scalable TCP on MB-NG with 200mbit/sec CBR Background

Yee-Ting Li

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide42
Real Applications on Real Networks
  • Disk-2-disk applications on real networks
    • Memory-2-memory tests
    • Transatlantic disk-2-disk at Gigabit speeds
    • HEP&VLBI at SC|05
  • Remote Computing Farms
    • The effect of TCP
    • The effect of distance
  • Radio Astronomy e-VLBI
    • Leave for the talk later in the meeting

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

iperf throughput web100
BaBar on Production network
  • Standard TCP
  • 425 Mbit/s
  • DupACKs 350-400 – re-transmits
iperf Throughput + Web100
  • SuperMicro on MB-NG network
  • HighSpeed TCP
  • Linespeed 940 Mbit/s
  • DupACK ? <10 (expect ~400)

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

applications throughput mbit s
Applications: Throughput Mbit/s
  • HighSpeed TCP
  • 2 GByte file RAID5
  • SuperMicro + SuperJANET
  • bbcp
  • bbftp
  • Apachie
  • Gridftp
  • Previous work used RAID0(not disk limited)

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide45
Transatlantic Disk to Disk Transfers

With UKLight

SuperComputing 2004

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

bbftp what else is going on
BaBar + SuperJANET
    • Instantaneous 200 – 600 Mbit/s
  • Disk-mem~ 590 Mbit/srememberthe end host
bbftp: What else is going on?

Scalable TCP

  • SuperMicro + SuperJANET
    • Instantaneous 0 - 550 Mbit/s
  • Congestion window – duplicate ACK
  • Throughput variation not TCP related?
    • Disk speed / bus transfer
    • Application architecture

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

sc2004 uklight overview
MB-NG

Managed Bandwidth

Amsterdam

SC2004 UKLIGHT Overview

SLAC Booth

SC2004

Cisco 6509

MB-NG 7600 OSR

Manchester

Caltech Booth

UltraLight IP

UCL network

UCL HEP

NLR Lambda

NLR-PITT-STAR-10GE-16

ULCC UKLight

K2

K2

Ci

UKLight 10G

Four 1GE channels

Ci

Caltech 7600

UKLight 10G

Surfnet/ EuroLink 10G

Two 1GE channels

Chicago Starlight

K2

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

transatlantic ethernet tcp throughput tests
Transatlantic Ethernet: TCP Throughput Tests
  • Supermicro X5DPE-G2 PCs
  • Dual 2.9 GHz Xenon CPU FSB 533 MHz
  • 1500 byte MTU
  • 2.6.6 Linux Kernel
  • Memory-memory TCP throughput
  • Standard TCP
  • Wire rate throughput of 940 Mbit/s
  • First 10 sec
  • Work in progress to study:
    • Implementation detail
    • Advanced stacks
    • Effect of packet loss
    • Sharing

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

sc2004 disk disk bbftp
SC2004 Disk-Disk bbftp
  • bbftp file transfer program uses TCP/IP
  • UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0
  • MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off
  • Move a 2 GByte file
  • Web100 plots:
  • Standard TCP
  • Average 825 Mbit/s
  • (bbcp: 670 Mbit/s)
  • Scalable TCP
  • Average 875 Mbit/s
  • (bbcp: 701 Mbit/s~4.5s of overhead)
  • Disk-TCP-Disk at 1Gbit/s

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

network disk interactions work in progress
% CPU kernel mode

Disk write

1735 Mbit/s

Disk write +

1500 MTU UDP

1218 Mbit/s

Drop of 30%

Disk write +

9000 MTU UDP

1400 Mbit/s

Drop of 19%

Network & Disk Interactions (work in progress)
  • Hosts:
    • Supermicro X5DPE-G2 motherboards
    • dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory
    • 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0
    • six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
  • Measure memory to RAID0 transfer rates with & without UDP traffic

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide51
Transatlantic Transfers

With UKLight

SuperComputing 2005

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

eslea and uklight
Reverse TCPESLEA and UKLight
  • 6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR
  • Disk-to-disk transfers with bbcp
    • Seattle to UK
    • Set TCP buffer and application to give ~850Mbit/s
    • One stream of data 840-620 Mbit/s
  • Stream UDP VLBI data
    • UK to Seattle
    • 620 Mbit/s

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

sc 05 slac 10 gigabit ethernet
SC|05 – SLAC 10 Gigabit Ethernet
  • 2 Lightpaths:
    • Routed over ESnet
    • Layer 2 over Ultra Science Net
  • 6 Sun V20Z systems per λ
    • 3 Transmit 3 Receive
  • dcache remote disk data access
    • 100 processes per node
    • Node sends or receives
    • One data stream 20-30 Mbit/s
  • Used Netweion NICs & Chelsio TOE
  • Data also sent to StorCloudusing fibre channel links
  • Traffic on the 10 GE link for 2 nodes: 3-4 Gbit per nodes 8.5-9Gbit on Trunk

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide54
Remote Computing Farms in the ATLAS TDAQ Experiment

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide55
ATLAS Remote Farms – Network Connectivity

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

atlas application protocol
SFI and SFO

Event Filter Daemon EFD

Request event

Send event data

Request-Response time (Histogram)

Process event

Request Buffer

Send OK

Send processed event

●●●

Time

ATLAS Application Protocol
  • Event Request
    • EFD requests an event from SFI
    • SFI replies with the event ~2Mbytes
  • Processing of event
  • Return of computation
    • EF asks SFO for buffer space
    • SFO sends OK
    • EF transfers results of the computation
  • tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcpmon tcp activity manc cern req resp
TCP Congestion windowgets re-set on each Request
  • TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity
  • Even after 10s, each response takes 13 rtt or ~260 ms
  • Transfer achievable throughput120 Mbit/s
tcpmon: TCP Activity Manc-CERN Req-Resp
  • Round trip time 20 ms
  • 64 byte Request green1 Mbyte Response blue
  • TCP in slow start
  • 1st event takes 19 rtt or ~ 380 ms

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcpmon tcp activity manc cern req resp tcp stack tuned
tcpmon: TCP Activity Manc-CERN Req-RespTCP stack tuned
  • Round trip time 20 ms
  • 64 byte Request green1 Mbyte Response blue
  • TCP starts in slow start
  • 1st event takes 19 rtt or ~ 380 ms
  • TCP Congestion windowgrows nicely
  • Response takes 2 rtt after ~1.5s
  • Rate ~10/s (with 50ms wait)
  • Transfer achievable throughputgrows to 800 Mbit/s
  • Data transferred WHEN theapplication requires the data

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcpmon tcp activity alberta cern req resp tcp stack tuned
tcpmon: TCP Activity Alberta-CERN Req-RespTCP stack tuned
  • Round trip time 150 ms
  • 64 byte Request green1 Mbyte Response blue
  • TCP starts in slow start
  • 1st event takes 11 rtt or ~ 1.67 s
  • TCP Congestion windowin slow start to ~1.8s then congestion avoidance
  • Response in 2 rtt after ~2.5s
  • Rate 2.2/s (with 50ms wait)
  • Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

summary conclusions
Summary & Conclusions
  • Standard TCP not optimum for high throughput long distance links
  • Packet loss is a killer for TCP
    • Check on campus links & equipment, and access links to backbones
    • Users need to collaborate with the Campus Network Teams
    • Dante Pert
  • New stacks are stable and give better response & performance
    • Still need to set the TCP buffer sizes !
    • Check other kernel settings e.g. window-scale maximum
    • Watch for“TCP Stack implementation Enhancements”
  • TCP tries to be fair
    • Large MTU has an advantage
    • Short distances, small RTT, have an advantage
  • TCP does not share bandwidth well with other streams
  • The End Hosts themselves
    • Plenty of CPU power is required for the TCP/IP stack as well and the application
    • Packets can be lost in the IP stack due to lack of processing power
    • Interaction between HW, protocol processing, and disk sub-system complex
  • Application architecture & implementation are also important
    • The TCP protocol dynamics strongly influence the behaviour of the Application.
  • Users arenow able to perform sustained 1 Gbit/s transfers

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

more information some urls 1
More Information Some URLs 1
  • UKLight web site: http://www.uklight.ac.uk
  • MB-NG project web site:http://www.mb-ng.net/
  • DataTAG project web site: http://www.datatag.org/
  • UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net
  • Motherboard and NIC Tests:

http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/

“Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004

http:// www.hep.man.ac.uk/~rich/

  • TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html& http://www.psc.edu/networking/perf_tune.html
  • TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004
  • PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/
  • Dante PERT http://www.geant2.net/server/show/nav.00d00h002

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide62
More Information Some URLs 2
  • Lectures, tutorials etc. on TCP/IP:
    • www.nv.cc.va.us/home/joney/tcp_ip.htm
    • www.cs.pdx.edu/~jrb/tcpip.lectures.html
    • www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS
    • www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm
    • www.cis.ohio-state.edu/htbin/rfc/rfc1180.html
    • www.jbmelectronics.com/tcp.htm
  • Encylopaedia
    • http://www.freesoft.org/CIE/index.htm
  • TCP/IP Resources
    • www.private.org.il/tcpip_rl.html
  • Understanding IP addresses
    • http://www.3com.com/solutions/en_US/ncs/501302.html
  • Configuring TCP (RFC 1122)
    • ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt
  • Assigned protocols, ports etc (RFC 1010)
    • http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide63
Any Questions?

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide64
Backup Slides

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

latency measurements
Latency Measurements
  • UDP/IP packets sent between back-to-back systems
    • Processed in a similar manner to TCP/IP
    • Not subject to flow control & congestion avoidance algorithms
    • Used UDPmon test program
  • Latency
  • Round trip times measured using Request-Response UDP frames
  • Latency as a function of frame size
    • Slope is given by:
      • Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)
    • Intercept indicates: processing times + HW latencies
  • Histograms of ‘singleton’ measurements
  • Tells us about:
    • Behavior of the IP stack
    • The way the HW operates
    • Interrupt coalescence

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

throughput measurements
Sender

Receiver

Zero stats

OK done

Send data frames at regular intervals

Inter-packet time

(Histogram)

●●●

●●●

Time to receive

Time to send

Get remote statistics

Send statistics:

No. received

No. lost + loss pattern

No. out-of-order

CPU load & no. int

1-way delay

Signal end of test

OK done

Time

Number of packets

n bytes



time

Wait time

Throughput Measurements
  • UDP Throughput
  • Send a controlled stream of UDP frames spaced at regular intervals

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

pci bus gigabit ethernet activity
Gigabit

Ethernet

Probe

CPU

CPU

NIC

NIC

PCI bus

PCI bus

chipset

chipset

mem

mem

Logic Analyser

Display

Possible Bottlenecks

PCI Bus & Gigabit Ethernet Activity
  • PCI Activity
  • Logic Analyzer with
    • PCI Probe cards in sending PC
    • Gigabit Ethernet Fiber Probe Card
    • PCI Probe cards in receiving PC

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

network switch limits behaviour
Network switch limits behaviour
  • End2end UDP packets from udpmon
    • Only 700 Mbit/s throughput
    • Lots of packet loss
    • Packet loss distributionshows throughput limited

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

server quality motherboards
“Server Quality” Motherboards
  • SuperMicro P4DP8-2G (P4DP6)
  • Dual Xeon
  • 400/522 MHz Front side bus
  • 6 PCI PCI-X slots
  • 4 independent PCI buses
    • 64 bit 66 MHz PCI
    • 100 MHz PCI-X
    • 133 MHz PCI-X
  • Dual Gigabit Ethernet
  • Adaptec AIC-7899W dual channel SCSI
  • UDMA/100 bus master/EIDE channels
    • data transfer rates of 100 MB/sec burst

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

server quality motherboards70
“Server Quality” Motherboards
  • Boston/Supermicro H8DAR
  • Two Dual Core Opterons
  • 200 MHz DDR Memory
    • Theory BW: 6.4Gbit
  • HyperTransport
  • 2 independent PCI buses
    • 133 MHz PCI-X
  • 2 Gigabit Ethernet
  • SATA
  • ( PCI-e )

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

10 gigabit ethernet udp throughput
10 Gigabit Ethernet: UDP Throughput
  • 1500 byte MTU gives ~ 2 Gbit/s
  • Used 16144 byte MTU max user length 16080
  • DataTAG Supermicro PCs
  • Dual 2.2 GHz Xenon CPU FSB 400 MHz
  • PCI-X mmrbc 512 bytes
  • wire rate throughput of 2.9 Gbit/s
  • CERN OpenLab HP Itanium PCs
  • Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz
  • PCI-X mmrbc 4096 bytes
  • wire rate of 5.7 Gbit/s
  • SLAC Dell PCs giving a
  • Dual 3.0 GHz Xenon CPU FSB 533 MHz
  • PCI-X mmrbc 4096 bytes
  • wire rate of 5.4 Gbit/s

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

slide72
mmrbc

512 bytes

mmrbc

1024 bytes

mmrbc

2048 bytes

CSR Access

PCI-X Sequence

Data Transfer

Interrupt & CSR Update

mmrbc

4096 bytes

5.7Gbit/s

10 Gigabit Ethernet: Tuning PCI-X

  • 16080 byte packets every 200 µs
  • Intel PRO/10GbE LR Adapter
  • PCI-X bus occupancy vs mmrbc
    • Measured times
    • Times based on PCI-X times from the logic analyser
    • Expected throughput ~7 Gbit/s
    • Measured 5.7 Gbit/s

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

congestion control ack clocking
Congestion control: ACK clocking

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

end hosts nics cern nat manc
End Hosts & NICs CERN-nat-Manc.

Throughput Packet Loss Re-Order

  • Use UDP packets to characterise Host, NIC & Network
    • SuperMicro P4DP8 motherboard
    • Dual Xenon 2.2GHz CPU
    • 400 MHz System bus
    • 64 bit 66 MHz PCI / 133 MHz PCI-X bus

Request-Response Latency

  • The network can sustain 1Gbps of UDP traffic
  • The average server can loose smaller packets
  • Packet loss caused by lack of power in the PC receiving the traffic
  • Out of order packets due to WAN routers
  • Lightpaths look like extended LANShave no re-ordering

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcpdump tcptrace
tcpdump / tcptrace
  • tcpdump: dump all TCP header information for a specified source/destination
    • ftp://ftp.ee.lbl.gov/
  • tcptrace: format tcpdump output for analysis using xplot
    • http://www.tcptrace.org/
    • NLANR TCP Testrig : Nice wrapper for tcpdump and tcptrace tools
      • http://www.ncne.nlanr.net/TCP/testrig/
  • Sample use:

tcpdump -s 100 -w /tmp/tcpdump.out host hostname

tcptrace -Sl /tmp/tcpdump.out

xplot /tmp/a2b_tsg.xpl

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcptrace and xplot
tcptrace and xplot
  • X axis is time
  • Y axis is sequence number
  • the slope of this curve gives the throughput over time.
  • xplot tool make it easy to zoom in

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

zoomed in view
Zoomed In View
  • Green Line: ACK values received from the receiver
  • Yellow Line tracks the receive window advertised from the receiver
  • Green Ticks track the duplicate ACKs received.
  • Yellow Ticks track the window advertisements that were the same as the last advertisement.
  • White Arrows represent segments sent.
  • Red Arrows (R) represent retransmitted segments

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

tcp slow start
TCP Slow Start

5 Annual e-VLBI Workshop, 17-20 September 2006, Haystack Observatory

R. Hughes-Jones Manchester

ad