Extreme Networking Achieving Nonstop Network Operation Under Extreme Operating Conditions

Extreme NetworkingAchieving Nonstop Network Operation Under Extreme Operating Conditions Jon Turnerjst@cs.wustl.eduhttp://www.arl.wustl.edu/arl

Project Overview • Motivation • data networks have become mission-critical resource • networks often subject to extreme traffic conditions • need to design networks for worst-case conditions • technology advances making extreme defenses practical • Extreme network services • Lightweight Flow Setup (LFS) • Network Access Service (NAS) • Distributed Tree Service (DTS) • Key router technology components • Super-Scalable Packet Scheduling (SPS) • Dynamic Queues with Auto-aggregation (DQA) • Scalable Distributed Queueing (SDQ)

Extreme Router Architecture ControlProcessor Switch Fabric Dist. Q. Ctl. Dist. Q. Ctl. OutputPortProc. Dist. Q. Ctl. Dist. Q. Ctl. InputPortProc. . . . FlowLookup FlowLookup Flow/RouteLookup Flow/RouteLookup • system mgmt. • route table cfg. • signalling Scalableswitch fabric Lookup routeor state forreserved flows • Distrib. queueing • traffic isolation • protect res. flows

Prototype Extreme Router ControlProcessor Field Programmable Port Ext. Smart Port Card SDRAM128 MB SRAM4 MB Switch Fabric 64 MB Sys.FPGA APIC NorthBridge IPP OPP IPP IPP IPP IPP OPP OPP OPP OPP IPP OPP Pentium FPX FPX FPX FPX FPX FPX Cache ATM Switch Core SPC SPC SPC SPC SPC SPC Field Programmable Port Extenders ReprogrammableApplicationDevice NetworkInterfaceDevice TI TI TI TI TI TI Transmisson Interfaces Embedded Processors

I O I O I O I O I O I O Distributed Queueing periodic queuelength reports ControlProcessor Switch Fabric queueper output Sched. Sched. Sched. Sched. Sched. Sched. Scheduler paces eachqueue according tobacklog share Routing Routing Routing Routing Routing Routing TI TI TI TI TI TI

Is Distributed Queueing Necessary? • ATM switches generally do not do it. • switch is engineered with small speedup (typically 2:1) • with well-regulated traffic, do not expect >2:1 overload • Overloads more likely in IP networks. • limited route diversity makes congested links common • route selection not guided by session bandwidth needs • routing changes cause rapid shifts in traffic • crude, slow congestion control mechanism • no protection from malicious users • Challenges • prevent congestion while avoiding “underflow” • scalability - target 1000x10 Gb/s systems • support fair queueing and reserved flow queueing

Basic Distributed Queueing Algorithm • Goal: avoid switch congestion and output queue underflow. • Let hi(i,j) be input i’s share of input-side backlog to output j. • can avoid switch congestion by sending from input i to output j at rate LShi(i,j) • where L is external link rate and S is switch speedup • Let lo(i,j)be input i’s share of total backlog for output j. • can avoid underflow of queue at output j by sending from input i to output j at rate Llo(i,j) • this works if L(lo(i,1)+···+lo(i,n))LS for all i • Let wt(i,j) be the ratio of lo(i,j) to lo(i,1) +···+ lo(i,n). • Let rate(i,j)=LSmin{wt(i,j),hi(i,j)}. • Note: algorithm avoids congestion and for large enough S, avoids underflow. • what is the smallest value of S for which underflow cannot occur?

Stress Test can vary number of inputs and outputs used, and length of “phases”

critical rate second first phase Stress Test Simulation - Min Rates

critical rate second first phase Stress Test - Actual Rates Under-use of input bandwidth

Stress Test - Input Queue Lengths input side backlog for final output implies underflow

Stress Test - Output Queue Lengths persistent output side backlog caused by earlier dip in forwarding rate

Improving Basic Algorithm • Basic algorithm does not always make full use of available input bandwidth. • does not reallocate bandwidth that is “sacrificed” by queues that are “output limited” • extend algorithm to reallocate • Revised rate allocation at input i: R = SL repeat n times Let j be unassigned queue with smallest ratio hi(i,j)/lo(i,j) Let wt(i,j) = lo(i,j)/(sum of lo(i,q) for unassigned queues q) rate(i,j) = min{Rwt(i,j),SLhi(i,j)} R = R - rate(i,j) • Plus other refinements.

Performance Gain - Allocated Rates full use of input bandwidth preallocate bandwidth to idle outputs

critical rate Performance Gain - Min Rates

Worst-Case Min Rate Sums

Results for Random Bursty Traffic Lost link capacity is negligible for speedups greater than 1.2

Extending for Fair Queueing • Fair queueing gives each flow equal share of congested link. • limits impact of “greedy” users on others • improves performance of congestion control mechanisms, reducing queueing delays and packet loss • Partial solution • per flow queues with packet scheduler at each output • provides fairness when no significant input-side queueing • Better solution • per flow input and output queues • distributed queueing controls rates of per-output schedulers at the inputs • bandwidth allocated by number of backlogged queues

dq to output 1 . . . . . . Switch Fabric to output 2 . . . to output n . . . . . . Fair Distributed Queueing separate queue set for each output dist. queueing controls rate of each queue set • Periodic update messages contain information on both backlog and number of backlogged queues.

Fair Distributed Queueing Algorithm • Same objectives as before plus fairness. • each backlogged queue gets equal share of congested output • so, allocate bandwidth according to number of backlogged queues • Let Q(i,j) be number of backlogged queues at input i for j. • Let hi(i,j) = Q(i,j)/(Q(1,j) +    + Q(n,j)). • can avoid switch congestion by ensuring rate(i,j) LShi(i,j) • Let need(j) be total input-side share of backlog to output j. • Let lo(i,j)= need(j)Q(i,j)/(Q(1,j) +    + Q(n,j)). • can avoid underflow by ensuring rate(i,j) Llo(i,j) • this works if L(lo(i,1)+···+lo(i,n))LS for all i • Use same rate allocation as before with modified lo and hi. • For weighted fair queueing, re-define Q(i,j) to be total weight of backlogged queues at input i for output j.

Summary • Growing reliance on data networks creates higher expectations - reliability, consistent performance. • design for worst-case - constructive paranoia • extreme defenses can be practical • Distributed queueing is key component of scalable extreme routers. • with small speedup, prevents congestion (always) and underflow (almost always) while ensuring fairness (mostly) • increases latency and complexity • Current reconfigurable hardware capabilities. • 67K elementary logic cells (LUT+FF) plus 2.5 Mb of SRAM • over 1K IO pads, high speed IOs (>500 MHz) • enables experimental implementation of complex features

Extreme Networking Achieving Nonstop Network Operation Under Extreme Operating Conditions