congestion management for data centers ieee 802 1 ethernet standard
Download
Skip this Video
Download Presentation
Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard

Loading in 2 Seconds...

play fullscreen
1 / 60

Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard. Balaji Prabhakar Departments of EE and CS Stanford University. Background. Data Centers see the true convergence of L3 and L2 transport

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard' - nixie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
congestion management for data centers ieee 802 1 ethernet standard

Congestion Management for Data Centers:IEEE 802.1 Ethernet Standard

Balaji Prabhakar

Departments of EE and CS

Stanford University

slide2

Background

  • Data Centers see the true convergence of L3 and L2 transport
    • While TCP is the dominant L3 transport protocol, and a significant amount of L2 traffic uses it, there is other L2 traffic; notably, storage and media
    • This, and other reasons, have prompted the IEEE 802.1 standards body to develop an Ethernet congestion management standard
    • In this lecture, we shall see the development of the QCN (Quantized Congestion Notification) algorithm for standardization in the IEEE 802.1 Data Center Bridging standards
    • We will also review the technical background of congestion control research
  • The lecture has 3 parts
    • A brief overview of the relevant congestion control background
    • A description of the QCN algorithm and its performance
    • The Averaging Principle: A new control-theoretic idea underlying the QCN and BIC-TCP algorithms which stabilizes them when loop delays increase; very useful for operating high-speed links with shallow buffers---the situation in 10+ Gbps Ethernets
managing congestion
Managing Congestion
  • Congestion is a standard feature of networked systems; in data networks,
    • Congestion occurs when links are oversubscribed when traffic and/or link bandwidth changes
    • A congestion notification mechanism allows switches/routers to directly control the rate of the ultimate sources of the traffic
  • We’ve been involved in developing QCN (for Quantized Congestion Notification) for standardization in the Data Center Bridging track of the IEEE 802.1 Ethernet standards
    • For deployment in 10 (and 40 and 100) Gbps Data Center Ethernets
  • Complete information on the QCN algorithm (p-code, draft of standard, detailed simulations of lots of scenarios) available at
congestion control in the internet
Congestion control in the Internet
  • In the Internet
    • Queue management schemes (e.g. RED) at the links signal congestion by either dropping or marking packets using ECN
    • TCP at end-systems uses these signals to vary the sending rate
    • There exists a rich history of algorithm development, control-theoretic analysis and detailed simulation of queue management schemes and congestion control algorithms for the Internet
      • Jacobson, Floyd et al, Kelly et al, Low et al, Srikant et al, Misra et al, Katabi et al …
  • TCP is excellent, so why look for another algorithm?
    • There is other traffic on Ethernet than TCP; so, native Ethernet congestion management is needed
    • TCP’s “one size fits all” approach makes it too conservative for high bandwidth-delay product networks
    • A hardware-based algorithm is needed for the very high speeds of operation encountered in 10, 40 and 100 Gbps
    • Ethernet and the Internet have very different operating conditions
switched ethernet vs the internet
Switched Ethernet vs. the Internet
  • Some significant differences …
    • No per-packet acks in Ethernet, unlike in the Internet
      • Not possible to know round trip time!
      • So congestion must be signaled to the source by switches
      • Algorithm not automatically self-clocked (like TCP)
    • Links can be paused; i.e. packets may not be dropped
    • No sequence numbering of L2 packets
    • Sources do not start transmission gently (like TCP slow-start); they can potentially come on at the full line rate of 10Gbps
    • Ethernet switch buffers are much smaller than router buffers (100s of KBs vs 100s of MBs)
    • Most importantly, algorithm should be simple enough to be implemented completely in hardware
  • Note: The QCN algorithm we have developed has Internet relatives; notably BIC-TCP at the source and the REM/PI controllers at switches
slide6

L2 Transport: IEEE 802.1

  • IEEE 802.1 Data Center Bridging standards: Enhancements to Ethernet
    • Reliable delivery (802.1Qbb): Link-level flow control (PAUSE) prevents congestion drops
    • Ethernet congestion management (802.1Qau): Prevents congestion spreading due to PAUSE
  • Consequences
    • Hardware-friendly algorithms: can operate on 10—100Gbps links
    • Partial offload of CPU: no packet retransmissions
    • Corruption losses require abort/restart; 10G over copper uses short cables to keep low BER
    • PAUSE absorption buffers: proportional to bdwdth x delay of links, high memory bandwidth
    • NOTE: Recent work addresses the last two points; this is not covered in the course

Pause absorption buffers

Congestion spreading

X

X

stability
Stability
  • Congestion control algorithms aim to
    • deliver high throughput, maintain low latencies/backlogs, be fair to all flows, be simple to implement and easy to deploy
  • Performance is related to stability of control loop
    • “Stability” refers to the non-oscillatory or non-exploding behavior of congestion control loops. In real terms, stability refers to the non-oscillatory behavior of the queues at the switch.
      • If the switch buffers are short, oscillating queues can overflow (hence drop packets/pause the link) or underflow (hence lose utilization)
      • In either case, links cannot be fully utilized, throughput is lost, flow transfers take longer
      • So stability is an important property, especially for networks with high bandwidth-delay products operating with shallow buffers
unit step response of the network
Unit step response of the network
  • The control loops are not easy to analyze
    • They are described by non-linear, delay differential equations which are usually impossible to analyze
    • So linearized analyses are performed using Nyquist or Bode theory
  • Is linearized analysis useful?
    • Yes! It is not difficult to know if a zero-delay non-linear system is stable. As the delay increases, linearization can be used to tell if the system is stable for delay (or number of sources) in some range; i.e. we get sufficient conditions
  • The above stability theory is essentially studying the “unit step response” of a network
    • Apply many “infinitely long flows” at time 0 and see how long the network takes to settle them to the correct collective and individual rate; the first is about throughput, the second is about fairness
tcp red a basic control loop
TCP--RED: A basic control loop

TCP

TCP

TCP

TCP

p

qavg

minth

maxth

RED: Drop probability, p, increases as

the congestion level goes up

TCP: Slow start +

Congestion avoidance

Congestion avoidance: AIMD

No loss: increase window by 1;

Pkt loss: cut window by half

tcp red
TCP--RED
  • Two ways to analyze and understand this control loop
    • Simulations: ns-2
    • Theory: Delay-differential equations
  • ns-2: A widely used event-driven simulator for the Internet
    • Very detailed and accurate
    • Different types of transport protocols: TCP, UDP, …
    • Router mechanisms and algorithms: RED, DRR, …
    • Web traffic: sessions, flows, power law flow sizes, …
    • Different types of network: wired, wireless, satellite, mobility,…
the simulation setup
The simulation setup

1200

# of TCP flows

grp2

grp1

1200

# of TCP flows

0

50

100

150

200

time

300

0

50

100

150

200

time

grp3

100Mbps

100Mbps

RED

RED

# of TCP flows

600

0

50

100

150

200

time

tcp red analytical model
TCP--RED: Analytical model

1/R

C

-

q

-

Time

Delay

p

TCP Control

RED Control

tcp red analytical model15
TCP--RED: Analytical model

1.5

Users:

Network:

W: window size; RTT: round trip time; C: link capacity

q: queue length; qa: ave queue length p: drop probability

*By V. Misra, W. Dong and D. Towsley at SIGCOMM 2000

*Fluid model concept originated by F. Kelly, A. Maullo and D. Tan at Jour. Oper. Res. Society, 1998

accuracy of analytical model
Accuracy of analytical model

Recall the ns-2 simulation from earlier: Delay at Link 1

why are the diff eqn models so accurate
Why are the Diff Eqn models so accurate?
  • They’ve been developed in Physics, where they are called Mean Field Models
  • The main idea
    • very difficult to model large-scale systems: there are simply too many events, too many random quantities
    • but, it is quite easy to model the mean or average behavior of such systems
    • interestingly, when the size of the system grows, its behavior gets closer and closer to that predicted by the mean-field model!
    • physicists have been exploiting this feature to model large magnetic materials, gases, etc.
    • just as a few electrons/particles don’t have a very big influence on a system, so is Internet resource usage not heavily influenced by a few packets: aggregates matter more
tcp red stability analysis
TCP--RED: Stability analysis
  • Given the differential equations, in principle one can figure out whether the TCP--RED control loop is stable
  • However, the differential equations are very complicated
    • 3rd or 4th order, nonlinear, with delays
    • There is no general theory, specific case treatments exist
  • “Linearize and analyze”
    • Linearize equations around the (unique) operating point
    • Analyze resultant linear, delay-differential equations using Nyquist or Bode theory
  • End result:
    • Design stable control loops
    • Determine stability conditions (RTT limits, number of users, etc)
    • Obtain control loop parameters: gains, drop functions, …
instability of tcp red
Instability of TCP--RED
  • As the bandwidth-delay-product increases, the TCP--RED control loop becomes unstable
  • Parameters: 50 sources, link capacity = 9000 pkts/sec, TCP--RED
  • Source: S. Low et. al. Infocom 2002
summary
Summary
  • We saw a very brief overview of research on the analysis of congestion control systems
  • As loop lags increase, the control loop becomes very oscillatory
    • This is true of any control scheme, not just congestion control schemes
    • In networks, oscillatory queue sizes tend to underflow buffers, causing to a loss of throughput; especially true for high BDP networks with shallow buffers
    • This has led to much research on developing algorithms for high BDP networks; e.g. High-Speed TCP, XCP, RCP, Scalable TCP, BIC-TCP, etc
    • We shall return to this later, after describing the QCN algorithm we have developed for the IEEE 802.1 standard
slide23

Quantized Congestion Notification (QCN):

Congestion control for Ethernet

Joint work with:

Mohammad Alizadeh, BerkAtikoglu and Abdul Kabbani, Stanford University

AshvinLakshmikantha, Broadcom

Rong Pan, Cisco Systems

Mick Seaman, Chair, Security Group; Ex-Chair, Interworking Group, IEEE 802.1

overview
Overview
  • The description of QCN is brief, restricted to the main points of the algorithm
    • A fuller description is available at the IEEE 802.1 Data Center Bridging Task Group’s website, including extensive simulations and pseudo-code
  • We will describe the congestion control loop
    • How is congestion measured at the switches?
    • What is the signal? And, how does the switch send it? (Remember there are no per-packet acks in Ethernet)
    • What does the source do when it receives a congestion signal?
  • Terminology:
    • Congestion Point: Where congestion occurs, mainly switches
    • Reaction Point: Source of traffic, mainly rate limiters in Ethernet NICs
qcn congestion point dynamics
QCN: Congestion Point Dynamics
  • Consider the single-source, single-switch loop below
  • Congestion Point (Switch) Dynamics: Sample packets, compute feedback (Fb), quantize Fb to 6 bits, and reflect only negative Fb values back to Reaction Point with a probability proportional to Fb.

Qeq

Source

Pmax

Reflection

Probability

Fb = -(Q-Qeq+ w . dQ/dt)

= -(queue offset + w.rate offset)

Pmin

|Fb|

qcn reaction point
QCN: Reaction Point

TR

Target Rate

CR

Rd/8

Rd/4

Rd

Rate

Rd/2

Current Rate

Time

Congestion message recd

  • Source (reaction point): Transmit regular Ethernet frames. When congestion message arrives:
    • Multiplicative Decrease:
    • Fast Recovery similar to BIC-TCP: gives high performance in high bandwidth-delay product networks, while being very simple.
    • Active Probing

Fast Recovery

Active Probing

timer supported qcn
Timer-supported QCN
  • Byte-Counter
    • 5 cycles of FR (150KB per cycle)
    • AI cycles afterwards (75KB per cycle)
    • Fb < 0 sends timer to FR

Byte-Ctr

  • RL
    • In FR if both byte-ctr and timer in FR
    • In AI if only one of byte-ctr or timer in AI
    • In HAI if both byte-ctr and timer in AI
  • Note: RL goes to HAI only after 500 pkts have been sent

RL

Timer

  • Timer
    • 5 cycles of FR (T msec per cycle)
    • AI cycles afterwards (T/2 msec/cycle)
    • Fb < 0 sends timer to FR
simulations basic case
Simulations: Basic Case
  • Parameters
    • 10 sources share a 10 G link, whose capacity drops to 0.5G during 2-4 secs
    • Max offered rate per source: 1.05G
    • RTT = 50 usec
    • Buffer size = 100 pkts (150KB); Qeq = 22
    • T = 10 msecs
    • RAI = 5 Mbps
    • RHAI = 50 Mbps

10 G

10 G

Source 1

Source 2

0.5G

Source 10

recovery time
Recovery Time

Recovery time = 80 msec

fluid model for qcn
Fluid Model for QCN

P = Φ(Fb)

  • Assume N flows pass through a single queue at a switch. State variables are TRi(t), CRi(t), q(t), p(t).

10%

Fb

63

accuracy equations vs ns 2 simulations
Accuracy:Equations vs. ns-2 simulations

N = 10, RTT = 100 us

N = 100, RTT = 500 us

N = 10, RTT = 1 ms

N = 10, RTT = 2 ms

summary32
Summary
  • The algorithm has been extensively tested in deployment scenarios of interest
    • Esp. interoperability with link-level PAUSE and TCP
    • All presentations are available at the IEEE 802.1 website:
  • The theoretical development is interesting, but most notably because QCN (and BIC-TCP) display strong stability in the face of increasing lags, or, equivalently in high bandwidth-delay product networks
  • While attempting to understand why these schemes perform so well, we have uncovered a method for improving the stability of any congestion control scheme; we present this next
background to the ap
Background to the AP
  • When the lags in a control loop increase, the system becomes oscillatory and eventually becomes unstable
  • Feedback compensation is applied to restore stability; the two main flavors of feedback compensation in are:
    • Determine lags (round trip times), apply the correct “gains” for the loop to be stable (e.g. XCP, RCP, FAST).
    • Include higher order queue derivatives in the congestion information fed back to the source (e.g. REM/PI, BCN).
    • Method 1 is not suitable for us, we don’t know RTTs in Ethernet
    • Method 2 requires a change to the switch implementation
  • The Averaging Principle is a different method
    • It is suited to Ethernet where round trip times are unavailable
    • It doesn’t need more feedback, hence switch implementations don’t have to change
    • QCN and BIC-TCP already turn out to employ it
the averaging principle ap
The Averaging Principle (AP)‏
  • A source in a congestion control loop is instructed by the network to decrease or increase its sending rate (randomly) periodically
  • AP: a source obeys the network whenever instructed to change rate, and then voluntarily performs averaging as below

TR = Target Rate

CR = Current Rate

recall qcn does 5 steps of averaging
Recall: QCN does 5 steps of Averaging

TR

CR

Target Rate

Rd/8

Rd/4

Rd

Rate

Rd/2

Current Rate

Time

Congestion message recd

  • The Fast Recovery portion of QCN, there are 5 steps of averaging
  • In fact, QCN and BIC-TCP are the Ave Prin applied to TCP!

Active Probing

applying the ap rcp rate control protocol dukkipatti and mckeown
Applying the APRCP: Rate Control ProtocolDukkipatti and McKeown
  • A router computes an upper bound R on the rate of all flows traversing it.
  • R recomputed every T (= 10) msec as follows:

Where

    • d0: Round trip time estimate (set constant= 10 msec in our case)‏
    • C: link capacity (= 2.4 Gbps)
    • Q: Current queue size at the switch
    • y(t): incoming rate
    • α = 0.1
    • ß = 1
  • A flow chooses the smallest advertised rate on its path.
  • We consider a scenario where 10 RCP sources share a single link.
ap rcp stability
AP-RCP Stability

RTT = 60 msec

RTT = 65 msec

ap rcp stability cont d
AP-RCP Stability cont’d

RTT = 120 msec

RTT = 130 msec

ap rcp stability cont d40
AP-RCP Stability cont’d

RTT = 230 msec

RTT = 240 msec

understanding the ap
Understanding the AP
  • As mentioned earlier, the two major flavors of feedback compensation are:
    • Determine lags, chose appropriate gains
    • Feedback higher derivatives of state
  • We prove that the AP is sense equivalent to both of the above!
    • This is great because we don’t need to change network routers and switches
    • And the AP is really very easy to apply; no lag-dependent optimizations of gain parameters needed
ap equivalence single source case
AP Equivalence: Single Source Case

Source

does AP

Fb

Regular

source

0.5 Fb + 0.25 T dFb/dt

  • Systems 1 and 2 are discrete-time models for an AP enabled source, and a regular source respectively.
  • Main Result: Systems 1 and 2 are algebraically equivalent. That is, given identical input sequences, they produce identical output sequences.
    • Therefore the AP is equivalent to adding a derivative to the feedback and reducing the gain!
    • Thus, the AP does both known forms of feedback compensation without knowing RTTs or changing switch implementations
ap rcp vs pd rcp
AP-RCP vs PD-RCP

RTT = 120 msec

RTT = 130 msec

a generic control example
A Generic Control Example
  • As an example, we consider the plant transfer function:

P(s) = (s+1)/(s3+1.6s2+0.8s+0.6)

step response two step ap delay 25 seconds
Step Response Two-step AP, Delay = 25 seconds

Two-step AP is even more stable than

Basic AP

summary of ap
Summary of AP
  • The AP is a simple method for making many control loops (not just congestion control loops) more robust to increasing lags
  • Gives a clear understanding as to the reason why the BIC-TCP and QCN algorithms have such good delay tolerance: they do averaging repeatedly
    • There is a theorem which deals explicitly with the QCN-type loop
  • Variations of the basic principle are possible; i.e. average more than once, average by more than half-way, etc
    • The theory is fairly complete in these cases
slide51

Background: TCP Buffer Sizing

  • Standard “rule of thumb”:
    • Single TCP flow: Bandwidth × Delay worth of buffering needed for 100 % utilization.
  • Recent result (Appenzellar et al.):
    • For N >> 1 TCP flows: Bdwdth x Delay/sqrt(N) amount of buffering is enough.
    • The essence of this result is that when many flows combine, the Variance of the net sending rate decreases:
  • Buffer sizing problem is challenging in data centers:
    • Typically, only a small number of flows are active on each path. (N is small)
    • Ethernet switches are typically built with shallow buffers to keep costs down.
slide52

Example: Simulation Setup

Switch

  • 10 Gig Ethernet
  • Switch buffer is 150 Kbytes deep.
  • We compare TCP and QCN for various # of flows, and RTTs.
slide53

TCP vs QCN (N = 1, RTT = 120 μs)

QCN

TCP

Throughput = 99.5%

Standard Deviation = 265.4 Mbps

Throughput = 99.5%

Standard Deviation = 13.8 Mbps

slide54

TCP vs QCN (N = 1, RTT = 250 μs)

QCN

TCP

Throughput = 95.5%

Standard Deviation = 782.7 Mbps

Throughput = 99.5%

Standard Deviation = 33.3 Mbps

slide55

TCP vs QCN (N = 1, RTT = 500 μs)

QCN

TCP

Throughput = 88%

Standard Deviation = 1249.7 Mbps

Throughput = 99.5%

Standard Deviation = 95.4 Mbps

slide56

TCP vs QCN (N = 10, RTT = 120 μs)

QCN

TCP

Throughput = 99.5%

Standard Deviation = 625.1 Mbps

Throughput = 99.5%

Standard Deviation = 25.1 Mbps

slide57

TCP vs QCN (N = 10, RTT = 250 μs)

QCN

TCP

Throughput = 95.5%

Standard Deviation = 981 Mbps

Throughput = 99.5%

Standard Deviation = 27.2 Mbps

slide58

TCP vs QCN (N = 10, RTT = 500 μs)

QCN

TCP

Throughput = 89%

Standard Deviation = 1311.4 Mbps

Throughput = 99.5%

Standard Deviation = 170.5 Mbps

slide59

QCN and shallow buffers

  • In contrast to TCP, QCN is stable with shallow buffers, even with few sources.
  • Why?
    • Recall that buffer requirements are closely related to sending rate variance:

Buffer size = C x Var(R1) x Bdwdth x Delay/ sqrt(N)

  • TCP:
    • Good performance for large N, since the denominator is large.
  • QCN:
    • Good performance for all N, since the numerator is small.
  • Thus, averaging reduces the variance of a source’s sending rate
    • This is a stochastic interpretation of the Averaging Principle’s success in keeping stability with shallow buffers
conclusions
Conclusions
  • We have seen the background, development and analysis of a congestion control scheme for the IEEE 802.1 Ethernet standard
  • The QCN algorithm is
    • More stable with respect to control loop delays
    • Requires much smaller buffers than TCP
    • Easy to build in hardware
  • The Averaging Principle is interesting and we’re exploring its use in nonlinear control systems
ad