Congestion management for data centers ieee 802 1 ethernet standard
1 / 60

Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard - PowerPoint PPT Presentation

  • Uploaded on

Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard. Balaji Prabhakar Departments of EE and CS Stanford University. Background. Data Centers see the true convergence of L3 and L2 transport

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard' - nixie

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Congestion management for data centers ieee 802 1 ethernet standard l.jpg

Congestion Management for Data Centers:IEEE 802.1 Ethernet Standard

Balaji Prabhakar

Departments of EE and CS

Stanford University

Slide2 l.jpg


  • Data Centers see the true convergence of L3 and L2 transport

    • While TCP is the dominant L3 transport protocol, and a significant amount of L2 traffic uses it, there is other L2 traffic; notably, storage and media

    • This, and other reasons, have prompted the IEEE 802.1 standards body to develop an Ethernet congestion management standard

    • In this lecture, we shall see the development of the QCN (Quantized Congestion Notification) algorithm for standardization in the IEEE 802.1 Data Center Bridging standards

    • We will also review the technical background of congestion control research

  • The lecture has 3 parts

    • A brief overview of the relevant congestion control background

    • A description of the QCN algorithm and its performance

    • The Averaging Principle: A new control-theoretic idea underlying the QCN and BIC-TCP algorithms which stabilizes them when loop delays increase; very useful for operating high-speed links with shallow buffers---the situation in 10+ Gbps Ethernets

Managing congestion l.jpg
Managing Congestion

  • Congestion is a standard feature of networked systems; in data networks,

    • Congestion occurs when links are oversubscribed when traffic and/or link bandwidth changes

    • A congestion notification mechanism allows switches/routers to directly control the rate of the ultimate sources of the traffic

  • We’ve been involved in developing QCN (for Quantized Congestion Notification) for standardization in the Data Center Bridging track of the IEEE 802.1 Ethernet standards

    • For deployment in 10 (and 40 and 100) Gbps Data Center Ethernets

  • Complete information on the QCN algorithm (p-code, draft of standard, detailed simulations of lots of scenarios) available at

Congestion control in the internet l.jpg
Congestion control in the Internet

  • In the Internet

    • Queue management schemes (e.g. RED) at the links signal congestion by either dropping or marking packets using ECN

    • TCP at end-systems uses these signals to vary the sending rate

    • There exists a rich history of algorithm development, control-theoretic analysis and detailed simulation of queue management schemes and congestion control algorithms for the Internet

      • Jacobson, Floyd et al, Kelly et al, Low et al, Srikant et al, Misra et al, Katabi et al …

  • TCP is excellent, so why look for another algorithm?

    • There is other traffic on Ethernet than TCP; so, native Ethernet congestion management is needed

    • TCP’s “one size fits all” approach makes it too conservative for high bandwidth-delay product networks

    • A hardware-based algorithm is needed for the very high speeds of operation encountered in 10, 40 and 100 Gbps

    • Ethernet and the Internet have very different operating conditions

Switched ethernet vs the internet l.jpg
Switched Ethernet vs. the Internet

  • Some significant differences …

    • No per-packet acks in Ethernet, unlike in the Internet

      • Not possible to know round trip time!

      • So congestion must be signaled to the source by switches

      • Algorithm not automatically self-clocked (like TCP)

    • Links can be paused; i.e. packets may not be dropped

    • No sequence numbering of L2 packets

    • Sources do not start transmission gently (like TCP slow-start); they can potentially come on at the full line rate of 10Gbps

    • Ethernet switch buffers are much smaller than router buffers (100s of KBs vs 100s of MBs)

    • Most importantly, algorithm should be simple enough to be implemented completely in hardware

  • Note: The QCN algorithm we have developed has Internet relatives; notably BIC-TCP at the source and the REM/PI controllers at switches

Slide6 l.jpg

L2 Transport: IEEE 802.1

  • IEEE 802.1 Data Center Bridging standards: Enhancements to Ethernet

    • Reliable delivery (802.1Qbb): Link-level flow control (PAUSE) prevents congestion drops

    • Ethernet congestion management (802.1Qau): Prevents congestion spreading due to PAUSE

  • Consequences

    • Hardware-friendly algorithms: can operate on 10—100Gbps links

    • Partial offload of CPU: no packet retransmissions

    • Corruption losses require abort/restart; 10G over copper uses short cables to keep low BER

    • PAUSE absorption buffers: proportional to bdwdth x delay of links, high memory bandwidth

    • NOTE: Recent work addresses the last two points; this is not covered in the course

Pause absorption buffers

Congestion spreading



Stability l.jpg

  • Congestion control algorithms aim to

    • deliver high throughput, maintain low latencies/backlogs, be fair to all flows, be simple to implement and easy to deploy

  • Performance is related to stability of control loop

    • “Stability” refers to the non-oscillatory or non-exploding behavior of congestion control loops. In real terms, stability refers to the non-oscillatory behavior of the queues at the switch.

      • If the switch buffers are short, oscillating queues can overflow (hence drop packets/pause the link) or underflow (hence lose utilization)

      • In either case, links cannot be fully utilized, throughput is lost, flow transfers take longer

      • So stability is an important property, especially for networks with high bandwidth-delay products operating with shallow buffers

Unit step response of the network l.jpg
Unit step response of the network

  • The control loops are not easy to analyze

    • They are described by non-linear, delay differential equations which are usually impossible to analyze

    • So linearized analyses are performed using Nyquist or Bode theory

  • Is linearized analysis useful?

    • Yes! It is not difficult to know if a zero-delay non-linear system is stable. As the delay increases, linearization can be used to tell if the system is stable for delay (or number of sources) in some range; i.e. we get sufficient conditions

  • The above stability theory is essentially studying the “unit step response” of a network

    • Apply many “infinitely long flows” at time 0 and see how long the network takes to settle them to the correct collective and individual rate; the first is about throughput, the second is about fairness

Tcp red a basic control loop l.jpg
TCP--RED: A basic control loop









RED: Drop probability, p, increases as

the congestion level goes up

TCP: Slow start +

Congestion avoidance

Congestion avoidance: AIMD

No loss: increase window by 1;

Pkt loss: cut window by half

Tcp red l.jpg

  • Two ways to analyze and understand this control loop

    • Simulations: ns-2

    • Theory: Delay-differential equations

  • ns-2: A widely used event-driven simulator for the Internet

    • Very detailed and accurate

    • Different types of transport protocols: TCP, UDP, …

    • Router mechanisms and algorithms: RED, DRR, …

    • Web traffic: sessions, flows, power law flow sizes, …

    • Different types of network: wired, wireless, satellite, mobility,…

The simulation setup l.jpg
The simulation setup


# of TCP flows




# of TCP flows



















# of TCP flows








Tcp red analytical model l.jpg
TCP--RED: Analytical model









TCP Control

RED Control

Tcp red analytical model15 l.jpg
TCP--RED: Analytical model




W: window size; RTT: round trip time; C: link capacity

q: queue length; qa: ave queue length p: drop probability

*By V. Misra, W. Dong and D. Towsley at SIGCOMM 2000

*Fluid model concept originated by F. Kelly, A. Maullo and D. Tan at Jour. Oper. Res. Society, 1998

Accuracy of analytical model l.jpg
Accuracy of analytical model

Recall the ns-2 simulation from earlier: Delay at Link 1

Why are the diff eqn models so accurate l.jpg
Why are the Diff Eqn models so accurate?

  • They’ve been developed in Physics, where they are called Mean Field Models

  • The main idea

    • very difficult to model large-scale systems: there are simply too many events, too many random quantities

    • but, it is quite easy to model the mean or average behavior of such systems

    • interestingly, when the size of the system grows, its behavior gets closer and closer to that predicted by the mean-field model!

    • physicists have been exploiting this feature to model large magnetic materials, gases, etc.

    • just as a few electrons/particles don’t have a very big influence on a system, so is Internet resource usage not heavily influenced by a few packets: aggregates matter more

Tcp red stability analysis l.jpg
TCP--RED: Stability analysis

  • Given the differential equations, in principle one can figure out whether the TCP--RED control loop is stable

  • However, the differential equations are very complicated

    • 3rd or 4th order, nonlinear, with delays

    • There is no general theory, specific case treatments exist

  • “Linearize and analyze”

    • Linearize equations around the (unique) operating point

    • Analyze resultant linear, delay-differential equations using Nyquist or Bode theory

  • End result:

    • Design stable control loops

    • Determine stability conditions (RTT limits, number of users, etc)

    • Obtain control loop parameters: gains, drop functions, …

Instability of tcp red l.jpg
Instability of TCP--RED

  • As the bandwidth-delay-product increases, the TCP--RED control loop becomes unstable

  • Parameters: 50 sources, link capacity = 9000 pkts/sec, TCP--RED

  • Source: S. Low et. al. Infocom 2002

Summary l.jpg

  • We saw a very brief overview of research on the analysis of congestion control systems

  • As loop lags increase, the control loop becomes very oscillatory

    • This is true of any control scheme, not just congestion control schemes

    • In networks, oscillatory queue sizes tend to underflow buffers, causing to a loss of throughput; especially true for high BDP networks with shallow buffers

    • This has led to much research on developing algorithms for high BDP networks; e.g. High-Speed TCP, XCP, RCP, Scalable TCP, BIC-TCP, etc

    • We shall return to this later, after describing the QCN algorithm we have developed for the IEEE 802.1 standard

Slide23 l.jpg

Quantized Congestion Notification (QCN):

Congestion control for Ethernet

Joint work with:

Mohammad Alizadeh, BerkAtikoglu and Abdul Kabbani, Stanford University

AshvinLakshmikantha, Broadcom

Rong Pan, Cisco Systems

Mick Seaman, Chair, Security Group; Ex-Chair, Interworking Group, IEEE 802.1

Overview l.jpg

  • The description of QCN is brief, restricted to the main points of the algorithm

    • A fuller description is available at the IEEE 802.1 Data Center Bridging Task Group’s website, including extensive simulations and pseudo-code

  • We will describe the congestion control loop

    • How is congestion measured at the switches?

    • What is the signal? And, how does the switch send it? (Remember there are no per-packet acks in Ethernet)

    • What does the source do when it receives a congestion signal?

  • Terminology:

    • Congestion Point: Where congestion occurs, mainly switches

    • Reaction Point: Source of traffic, mainly rate limiters in Ethernet NICs

Qcn congestion point dynamics l.jpg
QCN: Congestion Point Dynamics

  • Consider the single-source, single-switch loop below

  • Congestion Point (Switch) Dynamics: Sample packets, compute feedback (Fb), quantize Fb to 6 bits, and reflect only negative Fb values back to Reaction Point with a probability proportional to Fb.






Fb = -(Q-Qeq+ w . dQ/dt)

= -(queue offset + w.rate offset)



Qcn reaction point l.jpg
QCN: Reaction Point


Target Rate







Current Rate


Congestion message recd

  • Source (reaction point): Transmit regular Ethernet frames. When congestion message arrives:

    • Multiplicative Decrease:

    • Fast Recovery similar to BIC-TCP: gives high performance in high bandwidth-delay product networks, while being very simple.

    • Active Probing

Fast Recovery

Active Probing

Timer supported qcn l.jpg
Timer-supported QCN

  • Byte-Counter

    • 5 cycles of FR (150KB per cycle)

    • AI cycles afterwards (75KB per cycle)

    • Fb < 0 sends timer to FR


  • RL

    • In FR if both byte-ctr and timer in FR

    • In AI if only one of byte-ctr or timer in AI

    • In HAI if both byte-ctr and timer in AI

  • Note: RL goes to HAI only after 500 pkts have been sent



  • Timer

    • 5 cycles of FR (T msec per cycle)

    • AI cycles afterwards (T/2 msec/cycle)

    • Fb < 0 sends timer to FR

Simulations basic case l.jpg
Simulations: Basic Case

  • Parameters

    • 10 sources share a 10 G link, whose capacity drops to 0.5G during 2-4 secs

    • Max offered rate per source: 1.05G

    • RTT = 50 usec

    • Buffer size = 100 pkts (150KB); Qeq = 22

    • T = 10 msecs

    • RAI = 5 Mbps

    • RHAI = 50 Mbps

10 G

10 G

Source 1

Source 2


Source 10

Recovery time l.jpg
Recovery Time

Recovery time = 80 msec

Fluid model for qcn l.jpg
Fluid Model for QCN

P = Φ(Fb)

  • Assume N flows pass through a single queue at a switch. State variables are TRi(t), CRi(t), q(t), p(t).




Accuracy equations vs ns 2 simulations l.jpg
Accuracy:Equations vs. ns-2 simulations

N = 10, RTT = 100 us

N = 100, RTT = 500 us

N = 10, RTT = 1 ms

N = 10, RTT = 2 ms

Summary32 l.jpg

  • The algorithm has been extensively tested in deployment scenarios of interest

    • Esp. interoperability with link-level PAUSE and TCP

    • All presentations are available at the IEEE 802.1 website:

  • The theoretical development is interesting, but most notably because QCN (and BIC-TCP) display strong stability in the face of increasing lags, or, equivalently in high bandwidth-delay product networks

  • While attempting to understand why these schemes perform so well, we have uncovered a method for improving the stability of any congestion control scheme; we present this next

Background to the ap l.jpg
Background to the AP

  • When the lags in a control loop increase, the system becomes oscillatory and eventually becomes unstable

  • Feedback compensation is applied to restore stability; the two main flavors of feedback compensation in are:

    • Determine lags (round trip times), apply the correct “gains” for the loop to be stable (e.g. XCP, RCP, FAST).

    • Include higher order queue derivatives in the congestion information fed back to the source (e.g. REM/PI, BCN).

    • Method 1 is not suitable for us, we don’t know RTTs in Ethernet

    • Method 2 requires a change to the switch implementation

  • The Averaging Principle is a different method

    • It is suited to Ethernet where round trip times are unavailable

    • It doesn’t need more feedback, hence switch implementations don’t have to change

    • QCN and BIC-TCP already turn out to employ it

The averaging principle ap l.jpg
The Averaging Principle (AP)

  • A source in a congestion control loop is instructed by the network to decrease or increase its sending rate (randomly) periodically

  • AP: a source obeys the network whenever instructed to change rate, and then voluntarily performs averaging as below

TR = Target Rate

CR = Current Rate

Recall qcn does 5 steps of averaging l.jpg
Recall: QCN does 5 steps of Averaging



Target Rate






Current Rate


Congestion message recd

  • The Fast Recovery portion of QCN, there are 5 steps of averaging

  • In fact, QCN and BIC-TCP are the Ave Prin applied to TCP!

Active Probing

Applying the ap rcp rate control protocol dukkipatti and mckeown l.jpg
Applying the APRCP: Rate Control ProtocolDukkipatti and McKeown

  • A router computes an upper bound R on the rate of all flows traversing it.

  • R recomputed every T (= 10) msec as follows:

  • Where

    • d0: Round trip time estimate (set constant= 10 msec in our case)‏

    • C: link capacity (= 2.4 Gbps)

    • Q: Current queue size at the switch

    • y(t): incoming rate

    • α = 0.1

    • ß = 1

  • A flow chooses the smallest advertised rate on its path.

  • We consider a scenario where 10 RCP sources share a single link.

Ap rcp stability l.jpg
AP-RCP Stability

RTT = 60 msec

RTT = 65 msec

Ap rcp stability cont d l.jpg
AP-RCP Stability cont’d

RTT = 120 msec

RTT = 130 msec

Ap rcp stability cont d40 l.jpg
AP-RCP Stability cont’d

RTT = 230 msec

RTT = 240 msec

Understanding the ap l.jpg
Understanding the AP

  • As mentioned earlier, the two major flavors of feedback compensation are:

    • Determine lags, chose appropriate gains

    • Feedback higher derivatives of state

  • We prove that the AP is sense equivalent to both of the above!

    • This is great because we don’t need to change network routers and switches

    • And the AP is really very easy to apply; no lag-dependent optimizations of gain parameters needed

Ap equivalence single source case l.jpg
AP Equivalence: Single Source Case


does AP




0.5 Fb + 0.25 T dFb/dt

  • Systems 1 and 2 are discrete-time models for an AP enabled source, and a regular source respectively.

  • Main Result: Systems 1 and 2 are algebraically equivalent. That is, given identical input sequences, they produce identical output sequences.

    • Therefore the AP is equivalent to adding a derivative to the feedback and reducing the gain!

    • Thus, the AP does both known forms of feedback compensation without knowing RTTs or changing switch implementations

Ap rcp vs pd rcp l.jpg

RTT = 120 msec

RTT = 130 msec

A generic control example l.jpg
A Generic Control Example

  • As an example, we consider the plant transfer function:

    P(s) = (s+1)/(s3+1.6s2+0.8s+0.6)

Step response basic ap no delay l.jpg
Step ResponseBasic AP, No Delay

Step response basic ap delay 8 seconds l.jpg
Step ResponseBasic AP, Delay = 8 seconds

Step response two step ap delay 14 seconds l.jpg
Step Response Two-step AP, Delay = 14 seconds

Step response two step ap delay 25 seconds l.jpg
Step Response Two-step AP, Delay = 25 seconds

Two-step AP is even more stable than

Basic AP

Summary of ap l.jpg
Summary of AP

  • The AP is a simple method for making many control loops (not just congestion control loops) more robust to increasing lags

  • Gives a clear understanding as to the reason why the BIC-TCP and QCN algorithms have such good delay tolerance: they do averaging repeatedly

    • There is a theorem which deals explicitly with the QCN-type loop

  • Variations of the basic principle are possible; i.e. average more than once, average by more than half-way, etc

    • The theory is fairly complete in these cases

Slide51 l.jpg

Background: TCP Buffer Sizing

  • Standard “rule of thumb”:

    • Single TCP flow: Bandwidth × Delay worth of buffering needed for 100 % utilization.

  • Recent result (Appenzellar et al.):

    • For N >> 1 TCP flows: Bdwdth x Delay/sqrt(N) amount of buffering is enough.

    • The essence of this result is that when many flows combine, the Variance of the net sending rate decreases:

  • Buffer sizing problem is challenging in data centers:

    • Typically, only a small number of flows are active on each path. (N is small)

    • Ethernet switches are typically built with shallow buffers to keep costs down.

Slide52 l.jpg

Example: Simulation Setup


  • 10 Gig Ethernet

  • Switch buffer is 150 Kbytes deep.

  • We compare TCP and QCN for various # of flows, and RTTs.

Slide53 l.jpg

TCP vs QCN (N = 1, RTT = 120 μs)



Throughput = 99.5%

Standard Deviation = 265.4 Mbps

Throughput = 99.5%

Standard Deviation = 13.8 Mbps

Slide54 l.jpg

TCP vs QCN (N = 1, RTT = 250 μs)



Throughput = 95.5%

Standard Deviation = 782.7 Mbps

Throughput = 99.5%

Standard Deviation = 33.3 Mbps

Slide55 l.jpg

TCP vs QCN (N = 1, RTT = 500 μs)



Throughput = 88%

Standard Deviation = 1249.7 Mbps

Throughput = 99.5%

Standard Deviation = 95.4 Mbps

Slide56 l.jpg

TCP vs QCN (N = 10, RTT = 120 μs)



Throughput = 99.5%

Standard Deviation = 625.1 Mbps

Throughput = 99.5%

Standard Deviation = 25.1 Mbps

Slide57 l.jpg

TCP vs QCN (N = 10, RTT = 250 μs)



Throughput = 95.5%

Standard Deviation = 981 Mbps

Throughput = 99.5%

Standard Deviation = 27.2 Mbps

Slide58 l.jpg

TCP vs QCN (N = 10, RTT = 500 μs)



Throughput = 89%

Standard Deviation = 1311.4 Mbps

Throughput = 99.5%

Standard Deviation = 170.5 Mbps

Slide59 l.jpg

QCN and shallow buffers

  • In contrast to TCP, QCN is stable with shallow buffers, even with few sources.

  • Why?

    • Recall that buffer requirements are closely related to sending rate variance:

      Buffer size = C x Var(R1) x Bdwdth x Delay/ sqrt(N)

  • TCP:

    • Good performance for large N, since the denominator is large.

  • QCN:

    • Good performance for all N, since the numerator is small.

  • Thus, averaging reduces the variance of a source’s sending rate

    • This is a stochastic interpretation of the Averaging Principle’s success in keeping stability with shallow buffers

Conclusions l.jpg

  • We have seen the background, development and analysis of a congestion control scheme for the IEEE 802.1 Ethernet standard

  • The QCN algorithm is

    • More stable with respect to control loop delays

    • Requires much smaller buffers than TCP

    • Easy to build in hardware

  • The Averaging Principle is interesting and we’re exploring its use in nonlinear control systems