TCPEnd-To-End Congestion Control Wanida Putthividhya Dept. of Computer Science Iowa State University Jan, 27th 2002 (May, 25th 2001)
Contents : - TCP Congestion Control Concepts - TCP Flavors
- Avoid ‘congestion collapses’ : “ The severe drop of the network throughput caused by the congestion ” - Obey a ‘packet conservation’ principle : “ In equilibrium, a new packet is not put into the network until an old packet leaves ” TCP Congestion Control
- A collection of collaborating mechanisms : Slow-Start Accurate Retransmission Timeout Estimation Congestion Avoidance Fast Retransmit Fast Recovery Selective Acknowledgement
- Congestion Window (cwnd) : - Advertised Window (Receiver Window) : “ A TCP state variable that limits the amount of data a TCP can send” “ The available buffer size at the receiver site ” TCP Basics “ The window at the sender site controlled by congestion control and avoidance algorithms ”
- Sender’s maximum window (maxwin) : - Sender’s usable window : “ min(cwnd, advertised window) ” “ maxwin - unacknowledged segments ” - TCP maintains a Retransmission Timer for each packet, say x, which has been sent and not yet acknowledged. If the ACK for the packet x does not reach the sender before its timer is expired, the packet x is assumed to be lost and the sender will retransmit the packet x.
Self-Clocking - The ‘packet conservation’ property can be expressed in the sense that: “ The sender uses ACKs as a ‘clock’ to strobe new packets into the network ” “ The sender will be able to inject a new data packet into the network only if it receives an ‘ACK’ from the receiver “ So, the protocol is self-clocking!
- However, how is the clock started ? The problem is : “ An ACK is generated when the receiver receives a data packet correctly “ and “ To make the system robust, the data packet will be injected into the network only when there is an ACK triggering the sender to do so ” - Answer: “ A new algorithm called ‘Slow-Start’ has been introduced to gradually increase the amount of data in transit”
Pr Pb receiver sender Ab As Ar Pb : the minimum packet spacing (the inter-packet interval) on the bottleneck link Pr : the receiver’s network packet spacing [Pb = Pr] Ar : the spacing between acks on the receiver’s network [if the processing time is the same for all packets, Pb = Pr = Ar] Ab : the ack spacing on the bottleneck link As : the ack spacing on the sender’s network [As = Pb]
Getting to Equilibrium: Slow-Start Algorithm - When starting, initialize ‘cwnd’ to 1 When restarting after a loss, set ‘cwnd’ to 1 cwnd = 1 - Every time the sender sends data packets: min ( cwnd, advertised window) – # unacked paeket - Upon receiving an ACK for new data, increase congestion window by one cwnd = cwnd + 1
one RTT 0R 1 one pkt time 1R 1 2 3 2R 2 3 4 6 5 7 3R 4 5 6 7 8 10 12 14 9 11 13 15
- However, the slow-start is not that slow to increase the congestion window of the sender site: “ Let W be the window size (packets) Let RTT be the round-trip time it takes time RTT * log2W to open the congestion window from 1 to W ” - Therefore, the window is increased fast enough to have negligible effect on performance
- too short RTT => unnecessary retransmission too long RTT => low throughput - What model should be used to estimate the RTT ? “ Estimated RTT must be adaptive due to the condition of the network, but not too fast and not too slow ” Conservation at equilibrium: round-trip timing - Once data is flowing reliably, the problem that the sender injects a new packet before an old packet has exited must represent a failure of sender’s retransmission timer - TCP decided to estimate the retransmission timer for each packet in term of RTT ( wait at least one RTT before retransmitting ! )
New RTT = * old RTT + (1 - ) * M where M : a round trip time measurement from the most recently acked data packet (Round Trip Sample) : a filter gain constant with suggested value of 0.9 RTO = * New RTT where : accounts for RTT variation with suggested value of 2 - Initial RTO estimator:
A B A B Original transmission Original transmission ACK Sample RTT Sample RTT retransmission retransmission ACK Acknowledgement Ambiguity phenomenon - How to measure accurately Round Trip Samples? Complication arises because TCP’s acknowledgement refers to data received, not to the instance of a specific datagram that carried the data
As usual, to compute an initial timeout value, use the formula : New RTT = * old RTT + (1 - ) * M RTO = * New RTT - Karn’s RTO estimator Accounts for the Acknowledgement Ambiguity phenomenon Combination of the initial RTO estimator and a timer back off strategy.
New RTO = * old RTO If the timer expires and causes retransmission, TCP does not count RTT sample for that segment but keeps back-off the timeout on each retransmission by the formula : until it can successfully transfer a segment The suggested value for is 2
- Jakobson’s RTO estimator Key Observations: At high load, there is a wide range of variation in delay Queuing theory suggested that by using the formula and limiting to the suggested value of 2, the RTO estimation can adapt to loads of at most 30 % RTO = * New RTT
Solutions: Estimate both average round trip time and the variance, and use the estimated variance in place of the constant DIFF = SAMPLE - old RTT Smoothed RTT = old RTT + * DIFF DEV = old DEV + * ( |DIFF| - old DEV ) Timeout = Smoothed RTT + * DEV where DEV : the estimated mean deviation : a fraction between o and 1 that controls how quickly the new sample affects the weighted average (Smoothed RTT) : a fraction between o and 1 that controls how quickly the new sample affects the mean deviation : a factor that controls how much the deviation affects the RTO (suggested value of is 4)
Adapting to the path: Congestion Avoidance - Use coarse grained timeout to indicate congestion in the network - If loss occurs (timeout) when cwnd = W The network can absorb up to W segments Set cwnd to 0.5 * W (multiplicative decrease) - Upon receiving an ACK, Increase cwnd by 1/cwnd (additive increase)
Review: Congestion control algorithms must obey the “ Packet Conservation Principle ”. * to get to the equilibrium state, to get high utilization of the network BW, but not want to bomb the network with a big burst, USE ‘SLOWSTART’ algorithm * to maintain the equilibrium state (not inject a new packet into the network until an old packet has been taken out), USE an unambiguous situation to measure RTT (Karn’s algorithm) & USE an accurate model to calculate RTO (Jacobson’s model) * to adapt to the network condition, USE a mechanism to detect occurring of loss (coarse-grained timeout) USE congestion avoidance to avoid exceeding the available BW
The combined slow-start with congestion avoidance algorithm - If a packet is dropped, we loss self-clocking - We need to implement both algorithms together to avoid loosing a packet as much as we can. - Use 2 state variables : cwnd : the congestion window at the sender site ssthresh : the threshold used to switch between the two algorithms - The sender always sends min(cwnd, advertised window) - # unacked packet
- The algorithm starts with slow-start; on a timeout, ssthresh = cwnd/2 cwnd = 1 - Now, upon receiving an ACK if (cwnd < ssthresh) cwnd += 1 ; /* implement slow-start */ else cwnd += 1/cwnd ; /* implement congestion avoidance */
PKT#0 SENDER RECEIVER #1 . . . #2 #3 #7 #4 #8 #5 #9 #6 #10 Timeout Retx #13 ACK #26, wait for #27 (2) #27 #28 . . . Slow-Start and Congestion Avoidance SENDER RECEIVER (1) #15 #16 ACK #0, wait for #1 (2) #26 (4) (8) ssthresh = 15/2 = 7 ( cwnd = 1 ) “start slow-start again” #11 #12 #13 #14 (14) ACK #12 (4) dup ACK #12
SENDER RECEIVER SENDER RECEIVER #31 #37 #33 #29 #42 #44 #35 #43 #34 #30 #36 #45 #38 #32 #39 (8) Timeout Retx #41 ACK #47, wait for #48 (2) #48 #49 #40 #41 (8.125) “enter congestion avoidance” (7) #46 #47 . . . ssthresh = 8/2 = 4 (cwnd = 1 ) “start slow-start again” “enter congestion avoidance” (4) . . .
Timeout Timeout W1 W2 0.5 W1 0.5 W2 1 The congestion window for slow-start/congestion avoidance algorithm Congestion window time
Impacts of timeout - Timeout can cause sender to: Slow-start Retransmit a portion of window (possibly large) - Employ duplicate ACKs to signal the sender Fast Retransmit : use a number of duplicate ACKs to signal the sender about the packet loss (shorten the idle time for waiting for the timeout) Fast Recovery : advance congestion window more aggressively to reach high utilization faster
Fast Retransmit - Duplicate ACKs can be caused by: Segment Dropped Segment Re-ordering - TCP receiver should send an immediate duplicate ACK when an out-of-order segment arrives - TCP receiver should send an immediate ACK when an incoming segment fills in all or part of a gap in the sequence space.
“3 duplicate ACKs” means 4 identical ACKs without the arrival of any other intervening ACK packets Wait for a non-duplicate ACK and continue with slow-start - Assume that segment re-ordering is infrequent, TCP sender uses receipt of 3 duplicate ACKs as an indication of a segment has been lost Set ssthresh = 0.5 * current cwnd, cwnd = 1, and retransmit the dropped segment before timeout
- Fast Retransmit removes the idle time the sender waits for the coarse grained timeout, since the sender can retransmit the dropped segment upon receiving the third duplicate ACK - However, the throughput of the system is still suffered from the fact that the sender has to enter slow-start every time a retransmission occurs - Moreover, Fast Retransmit causes unnecessary retransmission when multiple drops in a single window occur
A duplicate ACK is caused by a receipt of a segment at the receiver site If n duplicate ACKs arrive at the sender, advance cwnd by n Fast Recovery - Key Observation: In another word, each duplicate ACK corresponds to taking one segment out of the network So, it is possible to use the duplicate ACKs to clock the sending of segments - Solution:
Retransmit segment N (Fast Retransmit) Set ssthresh = 0.5 * current cwnd Set cwnd = ssthresh + 3 (Fast Recovery) Set cwnd = ssthresh (the value in step 1) and continue with congestion avoidance Fast Retransmit & Fast Recovery - Upon receiving the third duplicate ACK of segment X, - After that, upon receiving a duplicate ACK, inflate the congestion window by one - If the sender’s usable window allows, send new data segment - Upon receiving a non-duplicate ACK, exit Fast Recovery
- Fast Recovery helps enhancing the throughput of the system reasonably since duplicate ACKs are used to clock sending(s) - However, it is suffered a lot if multiple drops in a single window occur. The throughput is dramatically dropped especially when there are 3 non-consecutive drops in a window
Fast Recovery is suffered from multiple drops since it has to enter Fast Recovery several times Change the sender’s behavior during Fast Recovery when a partial ACK is received A partial ACK is the one that acknowledges some but not all of the segments that were outstanding at the start of the Fast Recovery period Modified Fast Recovery (Conservative version) - Key Observation: - Solution:
In the original Fast Recovery, partial ACKs cause TCP sender to exit Fast Recovery by deflating the congestion window back to the size of ssthresh In the modified Fast Recovery, partial ACKs do not take TCP sender out of Fast Recovery Instead, partial ACKs received during Fast Recovery trigger the sender to retransmit the segment immediately following the acknowledged segment TCP sender remains in Fast Recovery until all of the data outstanding when Fast Recovery was initiated has been acknowledged
The 1st block is required to report the most recently received segment The additional SACK blocks repeat the most recently reported SACK blocks Selective Acknowledgement (SACK) - TCP receiver provides more information about hole(s) in the sequence buffer to the sender - The SACK option field contains a number of SACK blocks, where each SACK block reports a non-contiguous set of data that has been received and queued.
- The minimum number of SACK blocks in the SACK option field is two. It can have more than two blocks depending on the other option fields implemented in TCP. - The simulation referenced by this presentation used assumed to have three blocks in the SACK option field
- SACK TCP Sender enters Fast Recovery upon receiving 3rd duplicate ACK of a certain segment. Like the regular Fast Recovery, the sender cuts cwnd are cut in half and retransmit the dropped segment - During Fast Recovery, SACK maintains a variable, named ‘pipe’, representing the estimated number of segments outstanding in the path - The sender also maintains a data structure, called ‘scoreboard’ , which remembers acknowledgements from previous SACK options
- The sender only sends new or retransmitted data when “pipe < cwnd” - ‘pipe’ is incremented by one when the sender either sends a new segment or retransmits an old packet - ‘pipe’ is decremented by one when the sender receives a dup ACK packet with a SACK option reporting that new data has been received at the receiver - Upon receiving a partial ACK, ‘pipe’ is decremented by two - The sender exits Fast Recovery when it receives a recovery acknowledgement acknowledging all data that was outstanding when it enters Fast Recovery
It retransmits the next segment inferred to be missing If no such segments and the advertised window is sufficiently large, the sender sends a new packet - When the sender is allowed to send a segment, - When the retransmitted packet is itself dropped, the TCP sender detects drop with RTO, retransmits the dropped segment and then slow-starts.
Slow-start (exponential increase congestion window) Congestion Avoidance (additive increase) Fast Retransmit (use 3 dup ACKs) TCP Flavors - Tahoe, Reno, New-Reno, Vegas - TCP Tahoe (distributed with 4.3 BSD Unix) includes:
All mechanisms in Tahoe Fast Recovery ( governing the transmission after retransmit the lost segment ) Delayed Acknowledgement ( to avoid silly window syndrome ) Makes a small change in responding to partial ACKs during Fast Recovery - TCP Reno (1990) includes : - TCP New Reno :
SENDER RECEIVER SENDER RECEIVER (1) #0 (2) #3 #1 #15 #11 #9 #29 #5 #26 #7 #17 . . . #12 #10 #2 #6 #4 #8 #16 #27 #18 #30 #28 ACK #29 - #30 ACK #1 - #2 (4) (4) “enter fast retransmit” ssthresh = 15/2 = 7 (cwnd = 1) “continue with slow-start” 3 dup ACKs #13 . . . 14th dup ACK #13 ACK #3 - #6 (8) (2) Retx #13 ACK #28 #13 #14 ACK #7 - #13 (15) Tahoe: 1 drop
SENDER RECEIVER #31 #39 #37 #35 #33 #40 #32 #38 #36 #34 ACK #35 - #41 ACK #31 - #34 #41 “enter congestion avoidance” (7) (8) . . .
SENDER RECEIVER SENDER RECEIVER (1) #0 (2) #3 #1 #26 #11 #9 #5 #15 #7 #17 . . . #2 #4 #6 #10 #18 #27 #12 #16 #8 #28 ACK #1 - #2 (4) “enter fast recovery” ssthresh = 15/2 = 7 (cwnd = 7) 3 dup ACKs #13 4th dup ACK #13 5th dup ACK #13 (11) 6th dup ACK #13 (12) 7th dup ACK #13 (13) ACK #3 - #6 8th dup ACK #13 (14) (8) 9th dup ACK #13 (15) 10th dup ACK #13 (16) #29 11th dup ACK #13 (17) #30 12th dup ACK #13 (18) #31 13th dup ACK #13 (19) #32 14th dup ACK #13 (20) #33 (21) #34 #13 #14 “exit fast recovery” ssthresh = 7 (cwnd = 7) continue with congestion avoidance ! ACK #28 ACK #7 - #13 #35 (15) Reno : 1 drop
SENDER RECEIVER ACK #29 - #35 ACK #36 - #43 (8) (9) #36 #38 #40 #37 #39 #41 #42 #43 . . .
SENDER RECEIVER SENDER RECEIVER (1) #0 3 dup ACKs #6 (2) . . . “enter fast retransmit” ssthresh = 8/2 = 4 (cwnd = 1) continue with slow-start #5 #3 #1 #15 #16 #6 #4 #2 6th dup ACK #13 ACK #1 - #2 (2) (4) #9 (retx) #10 ACK #14 1st dup ACK #14 (3) #17 ACK #3 - #6 (8) #7 Retx #7 ACK #15 - #17 #8 “enter congestion avoidance” (4.67) #9 #10 ACK #8 #11 . . . #12 #13 #14 Tahoe: 2 drops
SENDER RECEIVER SENDER RECEIVER (1) #0 (2) #1 #17 #3 #5 #6 #4 #18 #2 ACK #1 - #2 (4) #15 #16 ACK #3 - #6 (8) #7 Timeout #8 #9 #10 #11 #12 #13 #14 Reno : 2 drops (causing “retransmission timeout”) SENDER RECEIVER “enter fast recovery” ssthresh = 8/2 = 4 (cwnd = 4) 3 dup ACKs #6 . . . 4th dup ACK #6 5th dup ACK #6 (8) (9) 6th dup ACK #6 Retx #7 (10) “exit fast recovery” ssthresh = 4 (cwnd = 4) cannot send more data since the outstanding no. of segments is 8 ACK #8 1st dup ACK #8 2nd dup ACK #8 “enter slow-start” Retx #9 (cwnd = 1) ACK #16 (2) ACK #17 - #18 (4)
SENDER RECEIVER SENDER RECEIVER (1) #0 (2) #3 #1 #26 #11 #9 #5 #15 #7 #17 . . . #10 #2 #4 #6 #16 #8 #12 #18 #27 #28 ACK #1 - #2 (4) ACK #3 - #6 (8) #13 #14 ACK #7 - #13 (15) Reno : 2 drops (causing “two successive Fast Recovery”) “enter fast recovery” ssthresh = 15/2 = 7 (cwnd = 7) 3 dup ACKs #13 4th dup ACK #13 5th dup ACK #13 Retx#14 (11) 6th dup ACK #13 (12) 7th dup ACK #13 (13) 8th dup ACK #13 (14) 9th dup ACK #13 (15) 10th dup ACK #13 (16) #29 11th dup ACK #13 (17) #30 12th dup ACK #13 (18) #31 13th dup ACK #13 (19) #32 “exit fast recovery” ssthresh = 7 (cwnd = 7) (20) #33 ACK#27 #34
SENDER RECEIVER 3 dup ACKs #27 4th dup ACK #27 5th dup ACK #27 #42 #39 #43 #38 #45 #44 #40 #41 “enter fast recovery” ssthresh = 7/2 = 3 (cwnd = 3) (7) Retx#28 6th dup ACK #27 (8) (9) ACK#34 “exit fast recovery” ssthresh = 3 (cwnd = 3) continue with congestion avoidance #35 #36 #37 ACK#35 ACK#36 ACK#37 (4) ACK#38 ACK#39 ACK#40 ACK#41 (5)