E 2 CM updates IEEE 802.1 Interim @ Geneva

E2CM updatesIEEE 802.1 Interim @ Geneva Cyriel Minkenberg & Mitch Gusat IBM Research GmbH, Zurich May 29, 2007

Outline • Summary of E2CM proposal • How it works • What has changed • New E2CM performance results • Managing across a non-CM domain • Performance in fat tree topology • Mixed link speeds (1G/10G) IBM Research GmbH, Zurich

Refresher: E2CM Operation • Probe arrives at dst • Insert timestamp • Return probe to source • Probing is triggered by BCN frames; only rate-limited flows are probed • Insert one probe every X KB of data sent per flow, e.g. X = 75 KB • Probes traverse network inband: Objective is to observe real current queuing delay • Variant: continuous probing (used here) • Per flow, BCN and probes employ the same rate limiter • Control per-flow (probe) as well as per-queue (BCN) occupancy • CPID of probes = destination MAC • Rate limiter is associated with CPID from which last negative feedback was received • Increment only on probes from associated CPID • Parameters relating to probes may be set differently (in particular Qeq,flow, Qmax,flow, Gd,flow, Gi,flow) Probe • Qeq exceeded • Send BCN to source Switch 1 Switch 2 src BCN • BCN arrives at source • Install rate limiter • Inject probe w/ timestamp • Probe arrives at source • Path occupancy computed • AIMD control applied using same rate limiter Switch 3 dst IBM Research GmbH, Zurich

Synergies • “Added value” of E2CM • Fair and stable rate allocation • Fine granularity owing to per-flow end-to-end probing • Improved initial response queue convergence speeds • Transparent to network • Purely end-to-end, no (additional) burden on bridges • “Added value” of ECM • Fast initial response • Feedback travels straight back to source • Capped aggregate queue length for large-degree hotspots • Controls sum of per-flow queue occupancies IBM Research GmbH, Zurich

Modifications since March proposal See also au-sim-ZRL-E2CM-src-based-r1.2.pdf IBM Research GmbH, Zurich

Coexistence of CM and non-CM domains • Concern has been raised that an end-to-end scheme requires global deployment • We consider the case where a non-CM switch exists in the path of the congesting flows • CM messages terminated at edge of domain • Cannot relay notifications across non-CM domain • Cannot control congestion inside non-CM domain • Non-CM (legacy) bridge behavior • Does not generate or interpret any CM notifications • Can relay CM notifications as regular frames? • May depend on bridge implementation • Next results make this assumption IBM Research GmbH, Zurich

Node 1 100% 100% Node 2 100% Node 3 100% Node 4 Managing across a non-CM domain CM • Switches 1, 2, 3 & 5 are in congestion-managed domains, switch 4 is in a non-congestion-managed domain • Four hot flows of 10 Gb/s each from nodes 1, 2, 3, 4 to node 6 (hotspot) • One cold (lukewarm) flow of 10 Gb/s from node 5 to 7 • Max-min fair allocation provides 2.0 Gb/s to each flow Non-CM-domain CM-domain Switch 1 Node 6 CM-domain Switch 2 Switch 4 Switch 5 Switch 3 Node 7 100% Node 5 IBM Research GmbH, Zurich

Traffic Mean flow size = [1’500, 60’000] B Geometric flow size distribution Source stops sending at T = 1.0 s Simulation runs to completion (no frames left in the system) Scenario See previous slide Switch Radix N = 2, 3, 4 M = 150 KB/port Link time of flight = 1 us Partitioned memory per input, shared among all outputs No limit on per-output memory usage PAUSE enabled or disabled Applied on a per input basis based on local high/low watermarks watermarkhigh = 141.5 KB watermarklow = 131.5 KB If disabled, frames dropped when input partition full Adapter Per-node virtual output queuing, round-robin scheduling No limit on number of rate limiters Ingress buffer size = unlimited, round-robin VOQ service Egress buffer size = 150 KB PAUSE enabled watermarkhigh = 141.5 KB watermarklow = 131.5 KB ECM W = 2.0 Qeq = 37.5 KB (= M/4) Gd = 0.5 / ((2*W+1)*Qeq) Gi0 = (Rlink / Runit) * ((2*W+1)*Qeq) Gi = 0.1 * Gi0 Psample = 2% (on average 1 sample every 75 KB Runit = Rmin = 1 Mb/s BCN_MAX enabled, threshold = 150 KB BCN(0,0) disabled Drift enabled (1 Mb/s every 10 ms) E2CM (per-flow) Continuous probing Wflow = 2.0 Qeq,flow = 7.5 KB Gd, flow = 0.5 / ((2*W+1)*Qeq,flow) Gi, flow = 0.01 * (Rlink / Runit) / ((2*W+1)*Qeq,flow) Psample = 2% (on average 1 sample every 75 KB) Runit = Rmin = 1 Mb/s BCN_MAXflow enabled, threshold = 30 KB BCN(0,0)flow disabled Simulation Setup & Parameters IBM Research GmbH, Zurich

E2CM: Per-flow throughput Bernoulli Bursty PAUSE disabled Max-min fair rates PAUSE enabled IBM Research GmbH, Zurich

E2CM: Per-node throughput Bernoulli Bursty PAUSE disabled Max-min fair rates PAUSE enabled IBM Research GmbH, Zurich

E2CM: Switch queue length Bernoulli Bursty Stable OQ level PAUSE disabled PAUSE enabled IBM Research GmbH, Zurich

Frame drops, flow completions, FCT • Mean FCT longer w/ PAUSE • All flows accounted for (w/o PAUSE not all flows completed) • Absence of PAUSE heavily skews results • In particular for hot flows  much longer FCT w/ PAUSE • Cold flow FCT independent of burst size! • Load compression: Flows wait for a long time in adapter before being injected • FCT dominated by adapter latency • Cold traffic also traverses hotspot, therefore suffers from compression IBM Research GmbH, Zurich

Fat tree network • Fat trees enable scaling to arbitrarily large networks with constant (full) bisectional bandwidth • We use static, destination-based, shortest-path routing • For more details on construction and routing see: au-sim-ZRL-fat-tree-build-and-route-r1.0.pdf IBM Research GmbH, Zurich

(2,0) (2,1) (2,2) (2,3) 0 8 1 9 2 10 11 3 (1,0) (1,1) (1,2) (1,3) (3,0) (3,1) (3,2) (3,3) 12 4 13 5 14 6 7 15 (0,0) (0,1) (0,2) (0,3) (4,0) (4,1) (4,2) (4,3) 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 Fat tree network spine Level 2 Up • Switches are labeled (stageID, switchID): • stageID  [0, S-1 • switchID  [0, (N/2)L-1] Fat tree: Folded representation Level 1 Down Level 0 Conventions M = no. of end nodes = N*(N/2)L-1 N = no. of bidir ports per switch L = no. of levels (folded) S = no. of stages = 2L-1 (unfolded) Number of switches per stage = (N/2)L-1 Total number of switches = (2L-1) *(N/2)L-1 Nodes are connected at left and right edges Left nodes are numbered 0 through M/2-1 Right nodes are numbered M/2 to M-1 spine (0,0) (1,0) (2,0) (3,0) (4,0) (0,1) (1,1) (2,1) (3,1) (4,1) Unfolded to Benes (0,2) (1,2) (2,2) (3,2) (4,2) (0,3) (1,3) (2,3) (3,3) (4,3) Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Left Right IBM Research GmbH, Zurich

Traffic Mean flow size = [1’500, 60’000] B Geometric flow size distribution Uniform destination distribution (except self) Mean load = 50% Source stops sending at T = 1.0 s Simulation runs to completion Scenario 16-node (3-level) and 32-node (4-level) fat tree networks Output-generated hotspot (rate reduction to 10% of link rate) on port 1 from 0.1 to 0.5 s Switch Radix N = 4 M = 150 KB/port Link time of flight = 1 us Partitioned memory per input, shared among all outputs No limit on per-output memory usage PAUSE enabled or disabled Applied on a per input basis based on local high/low watermarks watermarkhigh = 141.5 KB watermarklow = 131.5 KB If disabled, frames dropped when input partition full Adapter Per-node virtual output queuing, round-robin scheduling No limit on number of rate limiters Ingress buffer size = unlimited, round-robin VOQ service Egress buffer size = 150 KB PAUSE enabled watermarkhigh = 141.5 KB watermarklow = 131.5 KB ECM W = 2.0 Qeq = 37.5 KB (= M/4) Gd = 0.5 / ((2*W+1)*Qeq) Gi0 = (Rlink / Runit) * ((2*W+1)*Qeq) Gi = 0.1 * Gi0 Psample = 2% (on average 1 sample every 75 KB Runit = Rmin = 1 Mb/s BCN_MAX enabled, threshold = 150 KB BCN(0,0) en-/disabled, threshold = 300 KB Drift enabled (1 Mb/s every 10 ms) E2CM (per-flow) Continuous probing Wflow = 2.0 Qeq,flow = 7.5 KB Gd, flow = 0.5 / ((2*W+1)*Qeq,flow) Gi, flow = 0.01 * (Rlink / Runit) / ((2*W+1)*Qeq,flow) Psample = 2% (on average 1 sample every 75 KB) Runit = Rmin = 1 Mb/s BCN_MAXflow enabled, threshold = 30 KB BCN(0,0)flow en/-disabled, threshold = 60 KB Simulation Setup & Parameters IBM Research GmbH, Zurich

E2CM fat tree results: 16 nodes, 3 levels Bernoulli Bursty Aggr. Thrput Hot Q length IBM Research GmbH, Zurich

E2CM fat tree results: 32 nodes, 4 levels Bernoulli Bursty Aggr. Thrput Hot Q length IBM Research GmbH, Zurich

Frame drops, completed flows, FCT 16 nodes 32 nodes IBM Research GmbH, Zurich

Mixed link speeds Service rate = 10% • Nodes 1-10 are connected via 1G adapters and links • Switch 1 has 10 1G ports and 1 10G port to switch 2, which has 2 10G ports • Shared-memory switches  create more serious congestion • Ten hot flows of 0.5 Gb/s each from nodes 1-10 to node 11 (hotspot) • Node 11 sends uniformly at 5 Gb/s (cold) • Max-min fair shares: 12.5 MB/s for [1-10]  11 Node 1 50% Switch 1 Switch 2 1G 50% Node 11 50% 10G 10G Output-generated Node 10 1G Node 1 50% Switch 1 Switch 2 1G 50% Input-generated Node 11 50% 10G 10G Node 10 1G • Same topology as above • One hot flow of 5.0 Gb/s from node 11 to node 1 (hotspot) • Nodes 1-10 send uniformly at 0.5 Gb/s (cold) • Max-min fair shares: 62.5 MB/s for 11  1 and 6.25 MB/s for [2-10]  1 IBM Research GmbH, Zurich

E2CM mixed speed: output-generated HS Per-node throughput Per-flow throughput PAUSE disabled PAUSE enabled IBM Research GmbH, Zurich

E2CM mixed speed: input-generated HS Per-node throughput Per-flow throughput PAUSE disabled PAUSE enabled IBM Research GmbH, Zurich

Probing mixed speed: output-generated HS Per-node throughput Per-flow throughput PAUSE disabled Perfect bandwidth sharing PAUSE enabled IBM Research GmbH, Zurich

Probing mixed speed: input-generated HS Per-node throughput Per-flow throughput PAUSE disabled PAUSE enabled IBM Research GmbH, Zurich

Conclusions • FCT dominated by adapter latency for rate-limited flows • E2CM can manage across non-CM domains • Even a hotspot within a non-CM domain can be controlled • Need to ensure that CM notifications can traverse non-CM domains • They have to look like valid frames to non-CM bridges • E2CM works excellently in multi-level fat tree topologies • E2CM also copes well with mixed speed networks • Continuous probing improves E2CM’s overall performance • In low-degree hotspot scenarios probing-only appears to be sufficient to control congestion IBM Research GmbH, Zurich

E 2 CM updates IEEE 802.1 Interim @ Geneva