tutorial survey of ll fc methods for datacenter ethernet 101 flow control
Download
Skip this Video
Download Presentation
Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control

Loading in 2 Seconds...

play fullscreen
1 / 18

Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control. M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and Clark Jeffries 26 Sept. 2006 IBM Zurich Research Lab. Outline. Part I Requirements of datacenter link-level flow control (LL-FC)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control' - ornice


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
tutorial survey of ll fc methods for datacenter ethernet 101 flow control

Tutorial Survey of LL-FC Methods for Datacenter Ethernet101 Flow Control

M. Gusat

Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and Clark Jeffries

26 Sept. 2006

IBM Zurich Research Lab

outline
Outline
  • Part I
    • Requirements of datacenter link-level flow control (LL-FC)
    • Brief survey of top 3 LL-FC methods
      • PAUSE, aka. On/Off grants
      • credit
      • rate
    • Baseline performance evaluation
  • Part II
    • Selectivity and scope of LL-FC
      • per-what? : LL-FC’s resolution
req ts of 3x next generation of ethernet flow control for datacenters
Req’ts of .3x’: Next Generation of Ethernet Flow Control for Datacenters
  • Lossless operation

No-drop expectation of datacenter apps (storage, IPC)

Low latency

  • Selective

Discrimination granularity: link, prio/VL, VLAN, VC, flow...?

Scope: Backpressure upstream one hop, k-hops, e2e...?

  • Simple...

PAUSE-compatible !!

generic ll fc system
Generic LL-FC System
  • One link with 2 adjacent buffers: TX (SRC) and RX (DST)
    • Round trip time (RTT) per link is system’s time constant
  • LL-FC issues:
    • link traversal (channel Bw allocation)
    • RX buffer allocation
    • pairwise-communication between channel’s terminations
      • signaling overhead (PAUSE, credit, rate commands)
    • backpressure (BP):
      • increase / decrease injections
      • stop and restart protocol

RTT

slide5

FC-Basics: PAUSE (On/Off Grants)

“Over

“Over

-

-

run”=

run”=

FC Return path

Stop

Stop

Send STOP

Send STOP

Go

Go

TX Queues

RX Buffer

OQ

OQ

Data Link

Threshold

Threshold

PAUSE BP Semantics :

STOP / GO / STOP..

Xbar

-

Down

stream Links

* Note: Selectivity and granularity of FC domains are not considered here.

slide6

FC-Basics: Credits

Xbar

* Note: Selectivity and granularity of FC domains are not considered here.

correctness min memory for no drop
Correctness: Min. Memory for “No Drop”
  • "Minimum“: to operate lossless => O(RTTlink)
    • Credit : 1 credit = 1 memory location
    • Grant : 5 (=RTT+1) memory locations
  • Credits
    • Under full load the single credit is constantly looping between RX and TX RTT=4 => max. performance = f(up-link utilisation) = 25%
  • Grants
    • Determined by slow restart: if last packet has left the RX queue, it takes an RTT until the next packet arrives
pause vs credit @ m rtt 1
PAUSE vs. Credit @ M = RTT+1
  • "Equivalent" = ‘fair’ comparison
    • Credit scheme: 5 credit = 5 memory locations
    • Grant scheme: 5 (=RTT+1) memory locations

Performance loss for PAUSE/Grants is due to lack of underflow protection, because if M < 2*RTT the link is not work-conserving (pipeline bubbles on restart)

For equivalent (to credit) performance, M=9 is required for PAUSE.

slide9

FC-Basics: Rate

  • RX queue Qi=1 (full capacity).
  • Max. flow (input arrivals) during one timestep (Dt = 1) is 1/8.  
  • Goal: update the TX probability Ti from any sending node during the time interval [t, t+1) to obtain the new Ti applied during the time interval [t+1, t+2).
  • Algorithm for obtaining Ti(t+1) from Ti(t) ... =>
  • Initially the offered rate from source0 was set = .100 , and from source1 = .025. All other processing rates were .125. Hence all queues show low occupancy.
  • At timestep 20, the flow rate to the sink was reduced to .050 => causing a congestion level in Queue2 of .125/.050 = 2.5 times processing capacity.
  • Results: The average queue occupancies are .23 to .25, except Q3 = .13. The source flows are treated about equally and their long-term sum is about .050 (optimal).
conclusion part i which scheme is better
Conclusion Part I: Which Scheme is “Better”?
  • PAUSE

+ simple

+ scalable (lower overhead of signalling)

- 2xM size required

  • Credits (absolute or incremental)

+ are always lossless, independent of the RTT and memory size

+ adopted by virtually all modern ICTNs (IBA, PCIe, FC, HT, ...)

    • not trivial for buffer-sharing
    • protocol reliability
    • scalability
  • At equal M = RTT, credits show 30+% higher Tput vs. PAUSE

*Note: Stability of both was formally proven here

  • Rate: in-between PAUSE and credits

+ adopted in adapters

+ potential good match for BCN (e2e CM)

- complexity (cheap fast bridges)

part ii selectivity and scope of ll fc per prio vl pause
Part II: Selectivity and Scope of LL-FC“Per-Prio/VL PAUSE”
    • The FC-ed ‘link’ could be a
      • physical channel (e.g. 802.3x)
      • virtual lane (VL, e.g. IBA 2-16 VLs)
      • virtual channel (VC, larger figure)
      • ...
  • Per-Prio/VL PAUSE is the often proposed PAUSE v2.0 ...
  • Yet, is it good enough for the next decade of datacenter Ethernet?
  • Evaluation of IBA vs. PCIe/As vs. NextGen-Bridge (PrizmaCi)
already implemented in iba and other ictns
Already Implemented in IBA (and other ICTNs...)
  • IBA has 15 FC-ed VLs for QoS
    • SL-to-VL mapping is performed per hop, according to capabilities
  • However, IBA doesn’t have VOQ-selective LL-FC
    • “selective” = per switch (virtual) output port
  • So what?
    • Hogging - aka buffer monopolization, HOL1-blocking, output queue lockup, single-stage congestion, saturation tree(k=0)
  • How can we prove that hogging really occurs in IBA?
    • A. Back-of-the-envelope reasoning
    • B. Analytical modeling of stability and work-conservation (papers available)
    • C. Comparative simulations: IBA, PCI-AS etc. (next slides)
iba se hogging scenario
IBA SE Hogging Scenario
  • Simulation: parallel backup to a RAID across an IBA switch
    • TX / SRC
      • 16 independent IBA sources, e.g. 16 “producer” CPU/threads
      • SRC behavior: greedy, using any communication model (UD)
      • SL: BE service discipline on a single VL
        • (the other VLs suffer of their own )
    • Fabrics (single stage)
      • 16x16 IBA generic SE
      • 16x16 PCI-AS switch
      • 16x16 Prizma CI switch
    • RX / DST
      • 16 HDD “consumers”
      • t0 : initially each HDD sinks data at full 1x (100%)
      • tsim : during simulation HDD[0] enters thermal recalibration or sector remapping; consequently
          • HDD[0] progressively slows down its incoming link throughput: 90, 80,..., 10%
first friendly bernoulli traffic

R

First: Friendly Bernoulli Traffic
  • 2 Sources (A, B) sending @ (12x + 4x) to 16*1x End Nodes (C..R)

Fig. from IBA Spec

achievable performance

Throughput loss

aggregate throughput

actual IBA performance

link 0 throughput reduction

myths and fallacies about hogging
Myths and Fallacies about Hogging
  • Isn’t IBA’s static rate control sufficient?
  • No, because it is STATIC
  • IBA’s VLs are sufficient...?!
  • No.
    • VLs and ports are orthogonal dimensions of LL-FC
      • 1. VLs are for SL and QoS => VLs are assigned to prios, not ports!
      • 2. Max. no. of VLs = 15 << max (SE_degree x SL) = 4K
  • Can the SE buffer partitioning solve hogging, blocking and sat_trees, at least in single SE systems?
  • No.
    • 1. Partitioning makes sense only w/ Status-based FC (per bridge output port - see PCIe/AS SBFC);
      • IBA doesn’t have a native Status-based FC
    • 2. Sizing becomes the issue => we need dedication per I and O ports
      • M = O( SL * max{RTT, MTU} * N2 ) very large number!
      • Academic papers and theoretical disertations prove stability and work-conservation, but the amounts of required M are large
conclusion part ii selectivity and scope of ll fc
Conclusion Part II: Selectivity and Scope of LL-FC
    • Despite 16 VLs, IBA/DCE is exposed to the “transistor effect”: any single flow can modulate the aggregate Tput of all the others
    • Hogging (HOL1-blocking) requires a solution even for the smallest IBA/DCE system (single hop)
    • Prios/VL and VOQ/VC are 2 orthogonal dimensions of LL-FC

Q: QoS violation as price of ‘non-blocking’ LL-FC?

  • Possible granularities of LL-FC queuing domains:
    • A. CM can serve in single hop fabrics also as LL-FC
    • B. Introduce VOQ-FC: intermediate coarser grain

no. VCs = max{VOQ} * max{VL} = 64..4096 x 2..16 <= 64K VCs

Alternative: 802.1p (map prios to 8 VLs) + .1q (map VLANs to 4K VCs)?

Was proposed in 802.3ar...

slide18

LL-FC Between Two Bridges

Switch[k]

Switch[k+1]

TX Port[k,j]

RX Port[k+1, i]

VOQ[1]

RX Mgnt.

Unit (Buffer

Allocation)

TX Scheduler

RX Buffer

"send packet"

VOQ[n]

LL-FC

TX Unit

LL-FC Reception

“return path of LL-FC token"

ad