Devavrat Shah Stanford University

Randomization and Heavy Traffic Theory: New approaches to the design and analysis of switch algorithms Devavrat Shah Stanford University

Network algorithms • Algorithms implemented in networks, e.g. in • switches/routers scheduling algorithms routing lookup packet classification • memory/buffer managers maintaining statistics active queue management bandwidth partitioning • load balancers randomized load balancing • web caches eviction schemes placement of caches in a network

Network algorithms: challenges • Time constraint: Need to make complicated decisions very quickly • e.g. line speeds in the Internet core 10Gbps (soon to be 40Gbps) • packets arrive roughly every 50ns (roughly 10ns) • Limited computational resources • due to rigid space and heat dissipation constraints • Algorithms need to be very simple so as to be implementable • but simple algorithms may perform poorly, if not well-designed

An illustration • Consider the following simple system Capacity =1 • Algorithms • serve a randomly chosen queue: very easy • longest queue first (LQF): complicated to implement • Performance • serve at random: does not give maximum throughput • LQF: max throughput and minimizes the maximum backlog

Input queued switch • Multiple queues and servers

Input queued switch Not allowed Crossbar fabric 1 2 3 1 2 3 • Crossbar constraints • each input can connect to at most one output • each output can connect to at most one input

Switch scheduling Crossbar fabric 1 2 3 1 2 3 • Crossbar constraints • each input can connect to at most one output • each output can connect to at most one input

Notation and definitions 1 2 3 1 2 3

Useful performance metric • Throughput • an algorithm is stable (or delivers 100% throughput)if for any admissible arrival, the average backlog is bounded; i.e. • Average delay or average backlog (queue-size)

Schedule ´ Bipartite graph matching Schedule or Matching 19 3 4 21 1 18 7

Scheduling algorithms 19 3 4 21 1 18 7 • iSLIP: Cisco’s GSR 12000 Series • Nick McKeown • Wave Front Arbiter • Tamir and Chi • Parallel Iterative Matching • Anderson, Owicki, Saxe, Thacker Practical Maximal Matchings  Not stable

Scheduling algorithms 19 19 18 1 7 Max Wt Matching Max Size Matching 19 3 4 21 1 18 7 Practical Maximal Matchings  Stable (Tassiulas-Ephremides 92, McKeown et. al. 96, Dai-Prabhakar 00)  Not stable  Not stable (McKeown-Ananthram-Walrand 96)

The Maximum Weight Matching Algorithm • MWM: performance • throughput: stable(Tassiulas-Ephremides 92; McKeown et al 96; Dai-Prabhakar 00) • backlogs: very low on average(Leonardi et al 01; Shah-Kopikare 02) • MWM: implementation • has cubic worst-case complexity (approx. 27,000 iterations for a 30-port switch) • MWM algorithms involve backtracking: i.e. edges laid down in one iteration may be removed in a subsequent iteration • algorithm not amenable to pipelining

Switch algorithms 19 19 18 1 7 Better performance Max Wt Matching Maximal matching Max Size Matching Easier to implement Not stable Not stable Stable and low backlogs

Summarizing… • Need algorithms that are simple and perform well • can process packets economically at line speeds • deliver a high throughput • and low latencies/backlogs • Need to develop methods for analyzing the performance of these algorithms • traditional methods aren’t good enough

Rest of the talk • A simple, high performance scheduling algorithm • Randomized approximation of Max Wt Matching (joint work with P. Giaccone and B. Prabhakar) • A method for analyzing backlogs based on heavy traffic theory (joint work with D. Wischik) • A new approach to switch scheduling • Randomized edge coloring (joint work with G. Aggrawal, R. Motwani and A. Zhu)

Algorithm IRandomized approximationto the Max Wt Matching algorithm

Randomized approximation to MWM • Consider the following randomized approximation: • At every time • - sample d matchings independently and uniformly • - use the heaviest of these d matchings to schedule packets • Ideally we would like to use a small value of d. However,… Theorem. Even with d = N, this algorithm is not stable. In fact, when d=N, the throughput · 1 – e-1¼ 0.63. (Giaccone-Prabhakar-Shah 02)

Tassiulas’ algorithm Previous schedule S(t-1) Random Matching R(t) Next time MAX Current schedule S(t)

Tassiulas’ algorithm MAX 10 50 10 40 30 10 70 10 60 20 S(t-1) R(t) W(R(t))=150 W(S(t-1))=160 S(t)

Performance of Tassiulas’ algorithm Theorem (Tassiulas 98): The above scheme is stable under any admissible Bernoulli IID inputs.

Backlogs under Tassiulas’ algorithm

Reducing backlogs: the Merge operation Merge 30v/s120 130 v/s30 10 50 40 10 30 10 70 10 60 20 S(t-1) R(t) W(R(t))=150 W(S(t-1))=160

Reducing backlogs: the Merge operation Merge 10 50 40 10 30 10 70 10 60 20 S(t-1) R(t) W(R(t))=150 W(S(t-1))=160 W(S(t)) = 250

Performance of Merge algorithm Theorem (GPS): The Merge scheme is stable under any admissible Bernoulli IID inputs.

Merge v/s Max

Use arrival information: Serena 23 7 47 89 3 11 31 2 97 5 S(t-1) The arrival graph W(S(t-1))=209

Use arrival information: Serena 23 47 89 3 11 31 2 97 5 S(t-1) The arrival graph W(S(t-1))=209

Use arrival information: Serena 23 Merge 0 23 89 3 31 97 S(t) W(S(t))=243 23 47 89 3 11 31 97 6 S(t-1) W=121 W(S(t-1))=209

Performance of Serena algorithm Theorem (GPS): The Serena algorithm is stable under any admissible Bernoulli IID inputs.

Backlogs under Serena

Summary of main ideas and results • Randomization alone is not enough • Memory (using past sample) is very powerful • Exploiting problem structure: the Merge operation • Using arrival information • We obtained an algorithm • whose performance is close to that of MWM • and is very simple to implement

A new method for analyzing backlogsvia Heavy Traffic Theory

Some context… • We have seen the design of a simple, high throughput switch scheduling algorithm • The backlog/delay induced by an algorithm is another very important measure of its performance • For example, recall Tassiulas’ scheme

Delay or backlog analysis • Delay analysis is harder than throughput analysis because • throughput measures the long-term efficiency of an algorithm • whereas, the delay (or backlog) induced by an algorithm is a measure of its instantaneous efficiency • It is also harder to design low delay schedulers

Current delay analysis methods are not suitable • Queuing theory • requires knowledge of service time distribution • but this depends on the scheduling algorithm • service distribution is too complicated to determine • Large deviations: very successful for output-queued switch • N-port OQ switch can be “decomposed” into N independent 1-port switches • but an N-port input-queued switch is not decomposable • Stochastic coupling arguments • effective for establishing: Delay(Alg A) < Delay(Alg B) (without saying how much better) • very hard to make for input-queued switches

Our approach a • Use the machinery of heavy traffic theory (developed by J. M. Harrison, M. Bramson, R. Williams and many others) • Focus on the following intriguing observation • consider the MWM-a algorithm: edge(i,j) has weight Qija • Keslassy-McKeown 01 have observed that the average delay increases with a

Why this is worth focusing on • The MWM class of algorithms are the most analyzed • huge body of literature on throughput analysis • hardly anything known about their delay properties • in particular, no explanation of the Keslassy-McKeown observation • Note the singularity at a=0 in the K-M observation • “continuity” implies that delay should be the smallest when a=0 • however, MWM-0 is the Maximum Size Matching algorithm • MSM is known to be unstable; i.e. delays are infinite when a=0 ! (McKeown-Ananthram-Walrand 96) • Question: if MSM is not the optimum, who is? • Heavy traffic theory helps answer these questions

Heavy traffic theory • Consider a network with N queues • as some parameter r ! 1 • let the load on the system approach its capacity (hence the name HT) • speed up time by a factor r2, and scale down space by a factor r; i.e. quantity of interest: Qi(r2t)/r = Qir(t), say • as r !1, we’re looking at the zoomed out system both in time and space • this type of scaling occurs in the Central Limit Theorem time Original system 1/r r2 Scaled system

Heavy traffic theory • Consider a network with N queues • as some parameter r ! 1 • let the load on the system approach its capacity (hence the name HT) • speed up time by a factor r2, and scale down space by a factor r; i.e. quantity of interest: Qi(r2t)/r = Qir(t), say • as r !1, we’re looking at the zoomed out system both in time and space • this type of scaling occurs in the Central Limit Theorem • Main result: As r !1 • (Q1r(t),…,QNr(t)) ! (Q11(t),…,QN1(t)) = Q1(t), which is a multidimensional Reflected Brownian Motion (RBM) (such limiting results are independent of details of distribution) w

State space collapse: dimension reduction • Q1(t) can roam in lower dimensional space, say of dim K < N

State space collapse: dimension reduction • Q1(t) can roam in lower dimensional space, say of dim K < N • The HT limit procedure for LQF system • load: lr=l1r+L+lNr = 1-r-1 ! 1 • state: Qr(t)=(Q1r(t),L,QNr(t)) ! Q1(t) = (Q11(t),L, QN1(t)) • State space collapse • because of the LQF policy, Q11(t) = L = QN1(t) • or, the dimension of the state space is just 1, not N Q1r(t) l1r lNr Capacity =1 QNr(t)

State space collapse: performance • More importantly, the state space induced by an IQ switch scheduling algorithm gives an indication of its performance; i.e. State space(Alg A) ½ State space(Alg B) ) Alg B is better than Alg A • So , to resolve the Keslassy-McKeown observation, we need to • determine the state space of MWM-a, and • show • State space(MWM-a1) ½ State space(MWM-a2) if a1 > a2

Switch: state space collapse 17 4 ? • The dimension of the state space is N2, at most • what is the dimension due to state space collapse under MWM? • Given that MWM tries to equalize the weights of all matchings, • can we determine the minimum number of queue-sizes we need to know in order to work out the size of all the N2 queues • Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know. • Consider an example of 3-port switch 19 17 3 4 MWM=28 6

Switch: state space collapse 19 17 3 4 MWM=28 6 19 17 4 7 ? • The dimension of the state space is N2, at most • what is the dimension due to state space collapse under MWM? • Given that MWM tries to equalize the weights of all matchings, • can we determine the minimum number of queue-sizes we need to know in order to work out the size of all the N2 queues • Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know. • Consider an example of 3-port switch 7

Switch: state space collapse 19 17 3 4 MWM=28 6 4 • The dimension of the state space is N2, at most • what is the dimension due to state space collapse under MWM? • Given that MWM tries to equalize the weights of all matchings, • can we determine the minimum number of queue-sizes we need to know in order to work out the size of all the N2 queues • Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know. • Consider an example of 3-port switch 7 5

Switch: state space collapse 19 17 3 4 MWM=28 6 ? • The dimension of the state space is N2, at most • what is the dimension due to state space collapse under MWM? • Given that MWM tries to equalize the weights of all matchings, • can we determine the minimum number of queue-sizes we need to know in order to work out the size of all the N2 queues • Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know. • Consider an example of 3-port switch 7 5

Switch: state space collapse 19 17 3 4 MWM=28 6 18 • The dimension of the state space is N2, at most • what is the dimension due to state space collapse under MWM? • Given that MWM tries to equalize the weights of all matchings, • can we determine the minimum number of queue-sizes we need to know in order to work out the size of all the N2 queues • Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know. • Consider an example of 3-port switch 7 5

Devavrat Shah Stanford University

Devavrat Shah Stanford University

Presentation Transcript

Stanford University Tom Andriacchi

Daniel Schwartz Stanford University

Monica Lam Stanford University

Stanford University

Microsoft Research Stanford University

Stanford University

Stanford University

Martin Kay Stanford University

Scott Fendorf Stanford University

Maria Giovanna Dainotti Stanford University, Stanford, USA

Stanford University

Daniel Schwartz Stanford University

Stanford University, SLAC, NIIT

Stanford University

Andrea Goldsmith Stanford University

Wimax for Stanford University

Stanford University

Microsoft Research Stanford University

Kyle Shen Stanford University

Stanford University MRSEC 0213618

Stanford University Mobile Strategy

Stanford University