1 / 24

E 2 CM updates IEEE 802.1 Interim @ Geneva

This update discusses the E2CM proposal for managing congestion across non-CM domains, including performance results in fat tree topology and mixed link speeds.

fallison
Download Presentation

E 2 CM updates IEEE 802.1 Interim @ Geneva

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. E2CM updatesIEEE 802.1 Interim @ Geneva Cyriel Minkenberg & Mitch Gusat IBM Research GmbH, Zurich May 29, 2007

  2. Outline • Summary of E2CM proposal • How it works • What has changed • New E2CM performance results • Managing across a non-CM domain • Performance in fat tree topology • Mixed link speeds (1G/10G) IBM Research GmbH, Zurich

  3. Refresher: E2CM Operation • Probe arrives at dst • Insert timestamp • Return probe to source • Probing is triggered by BCN frames; only rate-limited flows are probed • Insert one probe every X KB of data sent per flow, e.g. X = 75 KB • Probes traverse network inband: Objective is to observe real current queuing delay • Variant: continuous probing (used here) • Per flow, BCN and probes employ the same rate limiter • Control per-flow (probe) as well as per-queue (BCN) occupancy • CPID of probes = destination MAC • Rate limiter is associated with CPID from which last negative feedback was received • Increment only on probes from associated CPID • Parameters relating to probes may be set differently (in particular Qeq,flow, Qmax,flow, Gd,flow, Gi,flow) Probe • Qeq exceeded • Send BCN to source Switch 1 Switch 2 src BCN • BCN arrives at source • Install rate limiter • Inject probe w/ timestamp • Probe arrives at source • Path occupancy computed • AIMD control applied using same rate limiter Switch 3 dst IBM Research GmbH, Zurich

  4. Synergies • “Added value” of E2CM • Fair and stable rate allocation • Fine granularity owing to per-flow end-to-end probing • Improved initial response queue convergence speeds • Transparent to network • Purely end-to-end, no (additional) burden on bridges • “Added value” of ECM • Fast initial response • Feedback travels straight back to source • Capped aggregate queue length for large-degree hotspots • Controls sum of per-flow queue occupancies IBM Research GmbH, Zurich

  5. Modifications since March proposal See also au-sim-ZRL-E2CM-src-based-r1.2.pdf IBM Research GmbH, Zurich

  6. Coexistence of CM and non-CM domains • Concern has been raised that an end-to-end scheme requires global deployment • We consider the case where a non-CM switch exists in the path of the congesting flows • CM messages terminated at edge of domain • Cannot relay notifications across non-CM domain • Cannot control congestion inside non-CM domain • Non-CM (legacy) bridge behavior • Does not generate or interpret any CM notifications • Can relay CM notifications as regular frames? • May depend on bridge implementation • Next results make this assumption IBM Research GmbH, Zurich

  7. Node 1 100% 100% Node 2 100% Node 3 100% Node 4 Managing across a non-CM domain CM • Switches 1, 2, 3 & 5 are in congestion-managed domains, switch 4 is in a non-congestion-managed domain • Four hot flows of 10 Gb/s each from nodes 1, 2, 3, 4 to node 6 (hotspot) • One cold (lukewarm) flow of 10 Gb/s from node 5 to 7 • Max-min fair allocation provides 2.0 Gb/s to each flow Non-CM-domain CM-domain Switch 1 Node 6 CM-domain Switch 2 Switch 4 Switch 5 Switch 3 Node 7 100% Node 5 IBM Research GmbH, Zurich

  8. Traffic Mean flow size = [1’500, 60’000] B Geometric flow size distribution Source stops sending at T = 1.0 s Simulation runs to completion (no frames left in the system) Scenario See previous slide Switch Radix N = 2, 3, 4 M = 150 KB/port Link time of flight = 1 us Partitioned memory per input, shared among all outputs No limit on per-output memory usage PAUSE enabled or disabled Applied on a per input basis based on local high/low watermarks watermarkhigh = 141.5 KB watermarklow = 131.5 KB If disabled, frames dropped when input partition full Adapter Per-node virtual output queuing, round-robin scheduling No limit on number of rate limiters Ingress buffer size = unlimited, round-robin VOQ service Egress buffer size = 150 KB PAUSE enabled watermarkhigh = 141.5 KB watermarklow = 131.5 KB ECM W = 2.0 Qeq = 37.5 KB (= M/4) Gd = 0.5 / ((2*W+1)*Qeq) Gi0 = (Rlink / Runit) * ((2*W+1)*Qeq) Gi = 0.1 * Gi0 Psample = 2% (on average 1 sample every 75 KB Runit = Rmin = 1 Mb/s BCN_MAX enabled, threshold = 150 KB BCN(0,0) disabled Drift enabled (1 Mb/s every 10 ms) E2CM (per-flow) Continuous probing Wflow = 2.0 Qeq,flow = 7.5 KB Gd, flow = 0.5 / ((2*W+1)*Qeq,flow) Gi, flow = 0.01 * (Rlink / Runit) / ((2*W+1)*Qeq,flow) Psample = 2% (on average 1 sample every 75 KB) Runit = Rmin = 1 Mb/s BCN_MAXflow enabled, threshold = 30 KB BCN(0,0)flow disabled Simulation Setup & Parameters IBM Research GmbH, Zurich

  9. E2CM: Per-flow throughput Bernoulli Bursty PAUSE disabled Max-min fair rates PAUSE enabled IBM Research GmbH, Zurich

  10. E2CM: Per-node throughput Bernoulli Bursty PAUSE disabled Max-min fair rates PAUSE enabled IBM Research GmbH, Zurich

  11. E2CM: Switch queue length Bernoulli Bursty Stable OQ level PAUSE disabled PAUSE enabled IBM Research GmbH, Zurich

  12. Frame drops, flow completions, FCT • Mean FCT longer w/ PAUSE • All flows accounted for (w/o PAUSE not all flows completed) • Absence of PAUSE heavily skews results • In particular for hot flows  much longer FCT w/ PAUSE • Cold flow FCT independent of burst size! • Load compression: Flows wait for a long time in adapter before being injected • FCT dominated by adapter latency • Cold traffic also traverses hotspot, therefore suffers from compression IBM Research GmbH, Zurich

  13. Fat tree network • Fat trees enable scaling to arbitrarily large networks with constant (full) bisectional bandwidth • We use static, destination-based, shortest-path routing • For more details on construction and routing see: au-sim-ZRL-fat-tree-build-and-route-r1.0.pdf IBM Research GmbH, Zurich

  14. (2,0) (2,1) (2,2) (2,3) 0 8 1 9 2 10 11 3 (1,0) (1,1) (1,2) (1,3) (3,0) (3,1) (3,2) (3,3) 12 4 13 5 14 6 7 15 (0,0) (0,1) (0,2) (0,3) (4,0) (4,1) (4,2) (4,3) 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 Fat tree network spine Level 2 Up • Switches are labeled (stageID, switchID): • stageID  [0, S-1 • switchID  [0, (N/2)L-1] Fat tree: Folded representation Level 1 Down Level 0 Conventions M = no. of end nodes = N*(N/2)L-1 N = no. of bidir ports per switch L = no. of levels (folded) S = no. of stages = 2L-1 (unfolded) Number of switches per stage = (N/2)L-1 Total number of switches = (2L-1) *(N/2)L-1 Nodes are connected at left and right edges Left nodes are numbered 0 through M/2-1 Right nodes are numbered M/2 to M-1 spine (0,0) (1,0) (2,0) (3,0) (4,0) (0,1) (1,1) (2,1) (3,1) (4,1) Unfolded to Benes (0,2) (1,2) (2,2) (3,2) (4,2) (0,3) (1,3) (2,3) (3,3) (4,3) Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Left Right IBM Research GmbH, Zurich

  15. Traffic Mean flow size = [1’500, 60’000] B Geometric flow size distribution Uniform destination distribution (except self) Mean load = 50% Source stops sending at T = 1.0 s Simulation runs to completion Scenario 16-node (3-level) and 32-node (4-level) fat tree networks Output-generated hotspot (rate reduction to 10% of link rate) on port 1 from 0.1 to 0.5 s Switch Radix N = 4 M = 150 KB/port Link time of flight = 1 us Partitioned memory per input, shared among all outputs No limit on per-output memory usage PAUSE enabled or disabled Applied on a per input basis based on local high/low watermarks watermarkhigh = 141.5 KB watermarklow = 131.5 KB If disabled, frames dropped when input partition full Adapter Per-node virtual output queuing, round-robin scheduling No limit on number of rate limiters Ingress buffer size = unlimited, round-robin VOQ service Egress buffer size = 150 KB PAUSE enabled watermarkhigh = 141.5 KB watermarklow = 131.5 KB ECM W = 2.0 Qeq = 37.5 KB (= M/4) Gd = 0.5 / ((2*W+1)*Qeq) Gi0 = (Rlink / Runit) * ((2*W+1)*Qeq) Gi = 0.1 * Gi0 Psample = 2% (on average 1 sample every 75 KB Runit = Rmin = 1 Mb/s BCN_MAX enabled, threshold = 150 KB BCN(0,0) en-/disabled, threshold = 300 KB Drift enabled (1 Mb/s every 10 ms) E2CM (per-flow) Continuous probing Wflow = 2.0 Qeq,flow = 7.5 KB Gd, flow = 0.5 / ((2*W+1)*Qeq,flow) Gi, flow = 0.01 * (Rlink / Runit) / ((2*W+1)*Qeq,flow) Psample = 2% (on average 1 sample every 75 KB) Runit = Rmin = 1 Mb/s BCN_MAXflow enabled, threshold = 30 KB BCN(0,0)flow en/-disabled, threshold = 60 KB Simulation Setup & Parameters IBM Research GmbH, Zurich

  16. E2CM fat tree results: 16 nodes, 3 levels Bernoulli Bursty Aggr. Thrput Hot Q length IBM Research GmbH, Zurich

  17. E2CM fat tree results: 32 nodes, 4 levels Bernoulli Bursty Aggr. Thrput Hot Q length IBM Research GmbH, Zurich

  18. Frame drops, completed flows, FCT 16 nodes 32 nodes IBM Research GmbH, Zurich

  19. Mixed link speeds Service rate = 10% • Nodes 1-10 are connected via 1G adapters and links • Switch 1 has 10 1G ports and 1 10G port to switch 2, which has 2 10G ports • Shared-memory switches  create more serious congestion • Ten hot flows of 0.5 Gb/s each from nodes 1-10 to node 11 (hotspot) • Node 11 sends uniformly at 5 Gb/s (cold) • Max-min fair shares: 12.5 MB/s for [1-10]  11 Node 1 50% Switch 1 Switch 2 1G 50% Node 11 50% 10G 10G Output-generated Node 10 1G Node 1 50% Switch 1 Switch 2 1G 50% Input-generated Node 11 50% 10G 10G Node 10 1G • Same topology as above • One hot flow of 5.0 Gb/s from node 11 to node 1 (hotspot) • Nodes 1-10 send uniformly at 0.5 Gb/s (cold) • Max-min fair shares: 62.5 MB/s for 11  1 and 6.25 MB/s for [2-10]  1 IBM Research GmbH, Zurich

  20. E2CM mixed speed: output-generated HS Per-node throughput Per-flow throughput PAUSE disabled PAUSE enabled IBM Research GmbH, Zurich

  21. E2CM mixed speed: input-generated HS Per-node throughput Per-flow throughput PAUSE disabled PAUSE enabled IBM Research GmbH, Zurich

  22. Probing mixed speed: output-generated HS Per-node throughput Per-flow throughput PAUSE disabled Perfect bandwidth sharing PAUSE enabled IBM Research GmbH, Zurich

  23. Probing mixed speed: input-generated HS Per-node throughput Per-flow throughput PAUSE disabled PAUSE enabled IBM Research GmbH, Zurich

  24. Conclusions • FCT dominated by adapter latency for rate-limited flows • E2CM can manage across non-CM domains • Even a hotspot within a non-CM domain can be controlled • Need to ensure that CM notifications can traverse non-CM domains • They have to look like valid frames to non-CM bridges • E2CM works excellently in multi-level fat tree topologies • E2CM also copes well with mixed speed networks • Continuous probing improves E2CM’s overall performance • In low-degree hotspot scenarios probing-only appears to be sufficient to control congestion IBM Research GmbH, Zurich

More Related