Towards Simple, High-performance Input-Queued Switch Schedulers

Download Presentation

Towards Simple, High-performance Input-Queued Switch Schedulers

Loading in 2 Seconds...

- 82 Views
- Uploaded on
- Presentation posted in: General

Towards Simple, High-performance Input-Queued Switch Schedulers

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Towards Simple, High-performance Input-Queued Switch Schedulers

Devavrat Shah

Stanford University

Joint work with

Paolo Giaccone and Balaji Prabhakar

Berkeley, Dec 5

- Description of input-queued switches
- Scheduling
- the problem
- some history

- Simple, high-performance schedulers
- Laura
- Serena
- Apsara

- Conclusions

- N inputs, N outputs (in fig, N = 3)
- Time is slotted
- at most one packet can arrive per time-slot at each input

- Equal sized cells/packets
- Buffers only at inputs
- Use a crossbar for switching packets

- Crossbar is defined by these constraints: in each time-slot
- only one packet can be transferred to each output
- only one packet can be transferred from each input

- The scheduling problem: Subject to the above constraint, find a matching of inputs and outputs
- i.e. determine which output will receive a packet from which input in each time slot

- [Karol et al. 1987] Throughput is limited due to head-of-line blocking (limited to 58% for Bernoulli IID uniform traffic)
- [Tamir 1989] Observed that with “Virtual Output Queues” (VOQs) head-of-line blocking is eliminated.

S(t)

L11(t)

A11(t)

1

1

D1(t)

A1N(t)

AN1(t)

DN(t)

N

N

ANN(t)

LNN(t)

3. Queue occupancies:

Occupancy

L11(t)

LNN(t)

[Anderson et al. 1993] A schedule is equivalent to finding a matching in a bipartite graph induced by input and output nodes

20

3

2

30

25

[McKeown et al. 1995] (a) Maximum size match does not give 100% throughput.(b) But maximum weight match can, where weight can be queue-length, age of a cell

20

MWM

30

25

- Maximum weight matching (MWM)
- 100% throughput
- provable delay bounds for i.i.d. Bernoulli admissible traffic
- but, finding MWM is like solving a network-flow problem whose complexity is -- complex for high-speed networks

- We seek to approximate maximum weight matching
- Our goal:
- obtain a simply implementable approximation to MWM that performs competitively with MWM

- Two performance measures
- throughput
- delay

- We first consider simple approximations to MWM that deliver 100% throughput (i.e. stability), and then deal with delay

- Randomization
- well-known method for simplifying implementation

- Using information in packet arrivals
- since queue-sizes grow due to arrivals, and arrival times are a source of randomness

- Hardware parallelism
- yields an efficient search procedure

- The main idea of randomized algorithms is
- to simplify the decision-making process by basing
decisions upon a small, randomly chosen sample from the state rather than upon the complete state

- to simplify the decision-making process by basing

- Find the oldest person from a population of 1 billion
- Deterministic algorithm: linear search
- has a complexity of 1 billion

- A randomized version: find the oldest of 30 randomly chosen people
- has a complexity of 30 (ignoring complexity of random sampling)

- Performance
- linear search will find the absolute oldest person (rank = 1)
- if R is the person found by randomized algorithm, we can make statements like
P(R has rank < 100 million) > 0.95

- thus, we can say that the performance of the randomized algorithm is very good with a high probability

- Often, we want to perform some operation iteratively
- Example: find the oldest person each year
- Say in 2001 you choose 30 people at random
- and store the identity of the oldest person in memory
- in 2002 you choose 29 new people at random
- let R be the oldest person from these 29 + 1 = 30 people
P(R has rank < 100 million)

or, P(R has rank < 50 million)

- Choose d matchings at random and use the heaviest one as the schedule
- Ideally we would like to have small d. However:
- Theorem: Even with d = N this algorithm doesn’t yield 100% throughput!

- Switch Size : 32 X 32
- Input Traffic (shown for a 4 X 4 switch)
- Bernoulli i.i.d. inputs
- diagonal load matrix:
- normalized load=x+y<1
- x=2y

- The state of the switch changes due to arrivals & departures
- Between consecutive time slots, a queue’s length can change at most by 1
- hence a heavy matching tends to stay heavy

- Therefore
- ‘’remembering’’ a heavy matching should help in improving the performance

- [Tassiulas 1998] proposed the following algorithm based on this observation:
- let S(t-1) be the matching used at time t-1
- let R(t) be a matching chosen uniformly at random
- and let S(t) be the heavier of R(t) and S(t-1)

- This gives 100% throughput !
- note the boost in throughput is due to the use of memory

- But, delays are very large

- Let G be a fully-connected graph where each node is one of the N! possible schedules
- Construct a Hamiltonian walk, H(t), on G
- H(t) cycles through the nodes of G

- At any time t
- let R(t) = H(t mod N!)
- and let S(t) be the heavier of R(t) and S(t-1)
- this also has 100% throughput, but delays are large
(derandomization will be useful later)

- Lemma: Consider IQ switch with Bernoulli i.i.d. inputs. Let B be a matching algorithm which ensures WB(t) >= W*(t) – c for every t. Then B is stable.
- Theorem: WDER(t) >= W*(t) – 2N.N! Therefore, it is stable.

- These simple approximations of MWM yield 100% throughput, but delays are large
- To obtain good delays we’ll present three different algorithms which use the following features:
- selective remembrance -- Laura
- information in the arrivals -- Serena
- hardware parallelism -- Apsara

S(t-1)

R(t)

COMP

Next time

S(t)

Tassiulas

- COMP = Maximum
- R(t) – uniform sample

Laura

- COMP = Merge, picks the best edges of two matchings
- R(t) – non-uniform sample

Merging Procedure

10

50

10

40

30

10

70

10

60

20

Merging

S(t-1)

R

W(S(t-1))=160

W(R)=150

10 – 40+10 -30+10-50= - 90

70-10+60-20=100

S(t)

W(S(t)) = 250

- Theorem:
- LAURA is stable under any admissible Bernoulli i.i.d. input traffic.

- Switch size: N = 32
- Length of VOQ: QMAX = 10000
- Comparison with
- iSLIP, iLQF, MUCS, RPA and MWM

- Traffic Matrices
- uniform
- diagonal
- sparse
- logdiagonal

SERENA

- Since an increase in queue sizes is due to arrivals
- And arrivals are a source of randomness
- Use arrivals to generate random matching

S(t-1)

R(t) = matching generated using arrivals

Merge

Next time

S(t)

Merging Procedure

23

89

89

3

3

2

1

5

5

Merging

R

23

W(R)=121

89

3

31

97

S(t)

W(S(t))=243

23

7

47

11

31

97

S(t-1)

Arr-R

W(S(t-1))=209

Theorem:

- SERENA achieves 100% throughput under any admissible i.i.d. Bernoulli traffic pattern

- One way to obtain MWM is to search the space of all N! matchings
- A natural approximation: If S(t-1) is the current matching, then S(t) is the heaviest matching in a “neighborhood” of S(t-1)
- It turns out that there is a convenient way of defining neighbors (both for theory and for practice)

S(t)

Example: 3 x 3 switch

Neighbors

Neighbors differ from S(t) in ONLY TWO edges

(for all values of N)

Neighbors generated in parallel

Hamiltonian Walk

N1

N2

Nk

H(t)

S(t-1)

MAX

Next time

S(t)

- Theorem: Apsara is stable under any admissible i.i.d. Bernoulli traffic.
(stability due to Hamiltonian matching)

- Also, note that W(S(t)) >= W(S(t-1),t)
- Theorem: If W(S(t)) = W(S(t-1),t) then
W(S(t)) >= 0.5 W *(t)

(this is not enough to ensure stability)

- The Apsara algorithm searches over neighbors in parallel
- If space is limited to modules, then search over randomly chosen subsetof size K from all neighbors
- And there are other (good) deterministic ways of searching a smaller neighborhood of matchings

- We have presented novel scheduling algorithms for input-queued switches
- Laura
- Serena
- Apsara

- They are simple to implement and perform competitively with respect to the Maximum Weight Matching algorithm

- L. Tassiulas, “Linear complexity algorithms for maximum throughput in radio networks and input-queued switches,” Proc. INFOCOM 1998.
- D. Shah, P. Giaccone and B. Prabhakar, “An efficient randomized algorithm for input-queued switch scheduling,” Proc. of Hot Interconnects, 2001.
- P. Giaccone, D. Shah and B. Prabhakar,” An Implementable Parallel Scheduler for Input-Queued Switches”, Proc. of Hot Interconnects, 2001.
- P. Giaccone, B. Prabhakar and D. Shah, “Towards simple and efficient scheduler for high-aggregate IQ switches”, Submitted INFOCOM’02.
- R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, 1995.