DARD : D istributed A daptive R outing for D atacenter Networks

DARD: Distributed Adaptive Routing for Datacenter Networks Xin Wu, Xiaowei Yang

Multiple equal cost paths in DCN core Agg ToR src dst pod • Scale-out topology -> Horizontal expansion -> More paths

Suboptimal scheduling -> hot spot src1 dst1 dst2 src2 • Unavoidable intra-datacenter traffic • Common services: DNS, search, storage • Auto-scaling: dynamic application instances

To prevent hot spots • Distributed • ECMP & VL2: flow-level hashing in switches • Centralized • Hedera: compute optimal scheduling in ONE server Design Space Distributed: RobustbutNot Efficient Centralized: Efficient but Not Robust

Goal: practical, efficient, robust • Practical • Using well-proven technologies • Efficient • Close to optimal traffic scheduling • Robust • No single point failure Design Space Centralized: Efficient but Not Robust Distributed: RobustbutNot Efficient Distributed: RobustandEfficient

Contributions • Explore the possibility of distributedyet close-to-optimalflow scheduling in DCNs. • A working implementation in testbed. • Proven convergence upper bound.

Intuition: minimize the maximum number of flows via a link dst1 src2 src3 src1 dst2 dst3 Step 0: maximum # of flows via a link = 3

Architecture • Control loop runs on every server independently Monitor network states Compute next scheduling Change flow’s path

Monitor network states • srcasks switches for the #_of_flowsand bandwidthof each link to dst. dst src • srcassemblies the link states to identify the most and least congested paths to dst.

Distributed computation • Runs on every server 1.for each dst 2.{ 3.Pbusy: the most congested path from src to dst; 4.Pfree: the least congested path from src to dst; 5.if (moving one flow from pbusyto pfree won’t cause a more congested path than pbusy) 6. Move one flow from pbusyto pfree; 7. } • Steps to convergence is bounded

Change path: using different src-dst pair core3 core2 core4 core1 3.0.0.0/8 4.0.0.0/8 2.0.0.0/8 1.0.0.0/8 agg1 1.1.0.0/16 2.1.0.0/16 agg2 agg1 tor2 tor1 1.1.1.0/24 2.1.1.0/24 3.1.1.0/24 4.1.1.0/24 tor1 src dst agg1’s down-hill table dst next hop 1.1.1.0/24 tor1 1.1.2.0/24 tor2 2.1.1.0/24 tor1 2.1.2.0/24 tor2 agg1’s up-hill table src next hop 1.0.0.0/8 core1 2.0.0.0/8 core2 src 1.1.1.2 2.1.1.2 3.1.1.2 4.1.1.2 dst 1.2.1.2 2.2.1.2 3.2.1.2 4.2.1.2 • src-dst address pair uniquely encodes a path • Static forwarding table

Forwarding example: E2->E1 2.0.0.0/8 1.0.0.0/8 core1 agg2 agg1 tor2 tor1 1.1.1.2 1.2.1.2 E1 E2 agg1’s down-hill table dst next hop 1.1.1.0/24 tor1 1.1.2.0/24 tor2 2.1.1.0/24 tor1 2.1.2.0/24 tor2 agg1’s up-hill table src next hop 1.0.0.0/8 core1 2.0.0.0/8 core2 Packet header: src: 1.2.1.2, dst: 1.1.1.2

Forwarding example: E1->E2 2.0.0.0/8 1.0.0.0/8 core1 agg2 agg1 tor2 tor1 1.1.1.2 1.2.1.2 E1 E2 agg1’s down-hill table dst next hop 1.1.1.0/24 tor1 1.1.2.0/24 tor2 2.1.1.0/24 tor1 2.1.2.0/24 tor2 agg1’s up-hill table src next hop 1.0.0.0/8 core1 2.0.0.0/8 core2 Packet header: src: 1.1.1.2, dst: 1.2.1.2

Randomness: prevent path oscillation • Add a random time interval to the control cycle

Implementation • DeterLabtestbed • 16-end-hosts fattree • Monitoring: OpenFlow API • Computation: daemon on end hosts • One NIC multiple addresses: IP alias • Static routes: OpenFlow forwarding table • Multipath: IP-in-IP encapsulation • ns-2 simulator • For different & larger topologies

DARD fully utilizes the bisection bandwidth • Simulation, 1024-end-host fattree • pVLB: periodical flow-level VLB Bisection bandwidth (Gbps) Traffic Patterns

DARD improves large file transfer time • Testbed, 16-end-host fattree Inter-pod dominant Intra-pod dominant random DARD vs. ECMP improvement # of new files per second

DARD converges in 2~3 control cycles • Simulation, 1024-end-host fattree, static traffic patterns • One control cycle ≈ 10 seconds Inter-pod dominant Intra-pod dominant random Convergence time (seconds)

Randomness prevents path oscillation • Simulation, 128-end-host fattree Intro-pod dominant random Inter-pod dominant Times a flow switches its paths

DARD’s control overhead is bounded by the topology • control_traffic= #_of_serversx#_of_switches. • Simulation, 128-end-host fattree Control traffic (MB/s) DARD Hedera # of simultaneous flows

Conclusion • DARD: Distributed Adaptive Routing for Datacenters • Practical: well-proven end-host-based technologies • Efficient: close to optimal traffic scheduling • Robust: no single point failure Monitor network states Compute next scheduling Change flow’s path

Thank You! Questions and comments: xinwu@cs.duke.edu

DARD : D istributed A daptive R outing for D atacenter Networks