1 / 17

End-to-End Congestion Control for InfiniBand

End-to-End Congestion Control for InfiniBand. Jose Renato Santos, Yoshio Turner, John Janakiraman. HP Labs. Outline. Motivation: Unique System Area Network (SAN) characteristics require new congestion control approach Proposed approach appropriate for SANs: ECN packet marking

rowdy
Download Presentation

End-to-End Congestion Control for InfiniBand

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. End-to-End Congestion Control for InfiniBand Jose Renato Santos, Yoshio Turner, John Janakiraman HP Labs IEEE Infocom 2003

  2. Outline • Motivation: Unique System Area Network (SAN) characteristics require new congestion control approach • Proposed approach appropriate for SANs: • ECN packet marking • Source response: rate control with window limit • Focus: Design of source response functions • New convergence conditions, design methodology • New functions: LIPD and FIMD • Performance Evaluation: LIPD, FIMD, AIMD • Conclusions IEEE Infocom 2003

  3. System Area Networks Characteristics • InfiniBand example: Industry standard server interconnect – 2Gb/s(1x) to 24Gb/s(12x) links • Characteristics: congestion control implications • No packet dropping  Need network support for detecting congestion • Low network latency (tens of ns cut-through switching)  Simple logic for hardware implementation • Low buffer capacity at switches (e.g., 2KB input buffer stores only four 512-byte packets)  TCP window mechanism inadequate (narrow operational range) • Input-buffered switches  Alternative congestion detection mechanisms IEEE Infocom 2003

  4. local flows (L) remote flows (R) Inter-switch Link congested link SWITCH A SWITCH B victim flow Problem: Congestion Spreading Flow not using congested link suffers performance degradation (victim flow) Simulation (R=L=10) • Remote flowsuse only 30% of inter-switch link bandwidth • Contention for root link full buffer prevents victim flow from using remaining inter-switch link bandwidth non-congested link Link BW: 8 Gb/s (4x link) Packet Size: 2 KB Buffer Size: 4 packets/port (8 KB) Buffer Org.: Input port IEEE Infocom 2003

  5. Our Congestion Control Approach • Explicit Congestion Notification (ECN) for input-buffered switches • Source adjusts packet injection according to network feedback encoded in ECN returned via ACK • Combines window and rate control • New source response functions more efficient than AIMD IEEE Infocom 2003

  6. Source Response:Rate Control with Window Limit • Window Control • Self-clocked, bounds switch buffer utilization • Narrow operational range (window=2 uses all bandwidth in idle network) • Window=1 is too large if # flows > # buffer slots • Rate Control • Low buffer util. possible (< 1 packet per flow) • Wide operational range • Not self-clocked • Proposed Approach: Rate control with a fixed window limit (w=1) IEEE Infocom 2003

  7. Designing Rate Control Functions • Definition: When source receives ACK Decrease rate on marked ACK: rnew = fdec(r) Increase rate on unmarked ACK: rnew = finc(r) • fdec(r) and finc(r) should provide : • Congestion avoidance • High network bandwidth utilization • Fair allocation of bandwidth among flows • Develop new sufficient conditions for fdec(r) & finc(r) • Exploit differences in packet marking rates across flows to relax conditions • Requires novel time-based formulation IEEE Infocom 2003

  8. Avoiding Congested State • Steady state: flow rate oscillates around optimal value in alternating phases of rate decrease and increase • Want to avoid time in congested state • Magnitude of response to marked ACK is larger or equal to magnitude of response to unmarked ACK Congestion Avoidance Condition: finc(fdec(r))  r IEEE Infocom 2003

  9. Fairness Convergence • [Chiu/Jain 1989][Bansal/Balakrishnan 2001] developed convergence conditions assuming all flows receive feedback and adjust rates synchronously • Each increase/decrease cycle must improve fairness • Observation: In congested state, the mean number of marked packets for a flow is proportional to the flow rate. • bias promotes flow rate fairness • Enables weaker fairness convergence condition • Benefit: fairness with faster rate recovery IEEE Infocom 2003

  10. Finc(t) Rate as function of time (in absence of marks) . r . rate decrease step fdec(r) Trec(r) t time Fairness Convergence Relax condition: rate decrease-increase cycles need only maintain fairness in the synchronous case • If two flows receive marks, lower rate flow should recover earlier than or in the same timeas higher rate flow Fairness Convergence Condition: Trec(r1)  Trec(r2) for r1 < r2 IEEE Infocom 2003

  11. Maximizing Bandwidth Utilization • Goal: as flows depart, remaining flows should recover rate quickly to maximize utilization • Fastest recovery: use limiting cases of conditions • Congestion Avoidance Condition finc(fdec(r))  r Use finc(fdec(r)) = r for minimum rate Rmin • Fairness Convergence Condition Trec(r1)  Trec(r2) Use Trec(r1) = Trec(r2) for higher rates Maximum Bandwidth Utilization Condition: Trec(r) = 1/ Rmin for all r IEEE Infocom 2003

  12. Finc(t) Finc(t) r finc(r) decrease step rate rate increase step fdec(r) r 1/r Trec=1/Rmin tr t t time time Design Methodology: Choose fdec(r), find finc(r) satisfying conditions Use fdec(r) to derive Finc(t): Finc(t) = fdec(Finc(t + Trec)), Trec=1/Rmin Use Finc(t) to find finc(r): finc(r ) = Finc(tr+1/r) where Finc(tr) = r IEEE Infocom 2003

  13. New Response Functions • Fast Increase Multiplicative Decrease (FIMD): • Decrease function: fdecfimd(r) = r/m, constant m>1 (same as AIMD) • Increase function:fincfimd(r) = r · mRmin/r • Much faster rate recovery than AIMD • Linear Inter-Packet Delay (LIPD): • Decrease function: increases inter-packet delay (ipd) by 1 packet transmission time r = Rmax/(ipd+1) • Increase function: finclipd(r) = r/(1- Rmin/Rmax) • Large decreases at high rate, small decreases at low rate • Simple Implementation: e.g., table lookup IEEE Infocom 2003

  14. Increase Behavior Over Time: FIMD, AIMD, LIPD 1 0.9 0.8 FIMD (m=2) AIMD (m=2) 0.7 LIPD 0.6 normalized rate 0.5 0.4 0.3 0.2 0.1 0 2K 10K 20K 30K 40K 50K 60K 65K time (units of packet transmission time) Finc(t) IEEE Infocom 2003

  15. IL RL Performance: Source Response Functions LIPD AIMD 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 normalized rate 0.6 normalized rate 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 4 5 6 7 8 9 10 11 4 5 6 7 8 9 10 11 buffer size (packets) buffer size (packets) FIMD root link (RL) local flows (LF) 1 0.9 inter-switch link (IL) remote flows (RF) 0.8 0.7 normalized rate 0.6 0.5 0.4 0.3 0.2 0.1 0 4 5 6 7 8 9 10 11 buffer size (packets) IEEE Infocom 2003

  16. Conclusions • Proposed/Evaluated congestion control approach appropriate for unique characteristics of SANs such as InfiniBand • ECN applicable to modern input-queued switches • Source response: rate control w/ window limit • Derived new relaxed conditions for source response function convergence  functions with fast bandwidth reclamation • Based on observation of packet marking bias • Two examples: FIMD/LIPD outperform AIMD • Future extensions: • Hybrid window-rate control (allow w > 1) • Evaluation with richer traffic patterns/topologies IEEE Infocom 2003

  17. IEEE Infocom 2003

More Related