1 / 101

Can we make these scheduling algorithms simpler? Using a Simpler Architecture

Can we make these scheduling algorithms simpler? Using a Simpler Architecture. Buffered Crossbar Switches. A buffered crossbar switch is a switch with buffered fabric (memory inside the crossbar).

maalik
Download Presentation

Can we make these scheduling algorithms simpler? Using a Simpler Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Can we make these scheduling algorithms simpler?Using a Simpler Architecture

  2. Buffered Crossbar Switches • A buffered crossbar switch is a switch with buffered fabric (memory inside the crossbar). • A pure buffered crossbar switch architecture, has only buffering inside the fabric and none anywhere else. • Due to HOL blocking problem, VOQ are used in the input side.

  3. Flow Control Arbiter 1 1 … …. N … 1 Arbiter 2 …. N … … … … … 1 … Arbiter …. N N … • Data • Input Cards Arbiter Arbiter Arbiter Buffered Crossbar Architecture Output Card Output Card Output Card 2 N 1

  4. Scheduling Process • Scheduling is divided into three steps: • Input scheduling:each input selects in a certain way one cell from the HoL of an eligible queue and sends it to the corresponding internal buffer. • Output scheduling: each output selects in a certain way from all internally buffered cells in the crossbar to be delivered to the output port. • Delivery notifying:for each delivered cell, inform the corresponding input of the internal buffer status.

  5. Advantages • Total independence between input and output arbiters (distributed design) (1/N complexity as compared to centralized schedulers) • Performance of Switch is much better (because there is much less output contention) – a combination of IQ and OQ switches • Disadvantage: Crossbar is more complicated

  6. I/O Contention Resolution 1 2 3 4 1 2 3 4

  7. I/O Contention Resolution 1 2 3 4 1 2 3 4

  8. The Round Robin Algorithm • InRr-OutRr • Input scheduling: InRr (Round-Robin) - Each input selects the next eligible VOQ, based on its highest priority pointer, and sends its HoL packet to the internal buffer. • Output scheduling: OutRr (Round-Robin) - Each output selects the next nonempty internal buffer, based on its highest priority pointer, and sends it to the output link.

  9. 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 Input Scheduling (InRr.) 1 2 3 4 1 2 3 4

  10. 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 Output Scheduling (OutRr.) 4 1 3 2 1 4 1 3 2 2 3 4 1 3 2 4 1 3 2 4 1 2 3 4

  11. 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 1 4 1 3 2 2 4 1 3 2 3 4 1 3 2 4 Out. Ptrs Updt + Notification delivery 1 2 3 4

  12. Performance study Delay/throughput under Bernoulli Uniform and Burtsy Uniform Stability performance:

  13. 32x32 Switch under Bernoulli Uniform Traffic OQ RR-RR 3 10 1-SLIP 4-SLIP 2 10 Average Delay 1 10 0 10 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Load Bernoulli Uniform Arrivals

  14. 32x32 Switch under Bursty Uniform Traffic OQ RR-RR 1-SLIP 4-SLIP 3 10 Average Delay 2 10 1 10 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Load Bursty Uniform Arrivals

  15. Scheduling Process • Because the arbitration is simple: • We can afford to have algorithms based on weights for example (LQF, OCF). • We can afford to have algorithms that provide QoS

  16. Buffered Crossbar Solution: Scheduler • The algorithm MVF-RR is composed of two parts: • Input scheduler – MVF (most vacancies first) Each input selects the column of internal buffers (destined to the same output) where there are most vacancies (non-full buffers). • Output scheduler – Round-robin Each output chooses the internal buffer which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one.

  17. Buffered Crossbar Solution: Scheduler • The algorithm ECF-RR is composed of two parts: • Input scheduler – ECF (empty column first) • Each input selects first empty column of internal buffers (destined to the same output). If there is no empty column, it selects on a round-robin basis. • Output scheduler – Round-robin • Each output chooses the internal buffer which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one.

  18. Buffered Crossbar Solution: Scheduler • The algorithm RR-REMOVE is composed of two parts: • Input scheduler – Round-robin (with remove-request signal sending) • Each input chooses non-empty VOQ which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one. It then sends out at most one remove-request signal to outputs • Output scheduler – REMOVE • For each output, if it receives any remove-request signals, it chooses one of them based on its highest priority pointer and removes the cell. If no signal is received, it does simple round-robin arbitration.

  19. Buffered Crossbar Solution: Scheduler • The algorithm ECF-REMOVE is composed of two parts: • Input scheduler – ECF (with remove-request signal sending) • Each input selects first empty column of internal buffers (destined to the same output). If there is no empty column, it selects on a round-robin basis.It then sends out at most one remove-request signal to outputs • Output scheduler – REMOVE • For each output, if it receives any remove-request signals, it chooses one of them based on its highest priority pointer and removes the cell. If no signal is received, it does simple round-robin arbitration.

  20. Round-robin arbiter Round-robin arbiter Highest priority pointer Any grant Grants Grants Selector 0 Selector N-1 Arbitration results Hardware Implementation of ECF-RR: An Input Scheduling Block

  21. Performance Evaluation: Simulation Study Uniform Traffic

  22. Performance Evaluation: Simulation Study ECF-REMOVe over RR-RR

  23. Performance Evaluation : Simulation Study Bursty Traffic

  24. Performance Evaluation: Simulation Study ECF-REMOVe over RR-RR

  25. Performance Evaluation : Simulation Study Hotspot Traffic

  26. Performance Evaluation: Simulation Study ECF-REMOVe over RR-RR

  27. Quality of Service Mechanisms for Switches/Routers and the Internet

  28. Recap • High-Performance Switch Design • We need scalable switch fabrics – crossbar, bit-sliced crossbar, Clos networks. • We need to solve the memory bandwidth problem • Our conclusion is to go for input queued-switches • We need to use VOQ instead of FIFO queues • For these switches to function at high-speed, we need efficient and practically implementable scheduling/arbitration algorithms

  29. Algorithms for VOQ Switching • We analyzed several algorithms for matching inputs and outputs • Maximum size matching: these are based on bipartite maximum matching – which can be solved using Max-flow techniques in O(N2.5) • These are not practical for high-speed implementations • They are stable (100% throughput for uniform traffic) • They are not stable for non-uniform traffic • Maximal size matching: they try to approximate maximum size matching • PIM, iSLIP, SRR, etc. • These are practical – can be executed in parallel in O(logN) or even O(1) • They are stable for uniform traffic and unstable for non-uniform traffic

  30. Algorithms for VOQ Switching • Maximum weight matching: These are maximum matchings based weights such queue length (LQF) (LPF) or age of cell (OCF) with a complexity of O(N3logN) • These are not practical for high-speed implementations. Much more difficult to implement than maximum size matching • They are stable (100% throughput) under any admissible traffic • Maximal weight matching: they try to approximate maximum weight matching. They use RGA mechanism like iSLIP • iLQF, iLPF, iOCF, etc. • These are “somewhat” practical – can be executed in parallel in O(longN) or even O(1) like iSLIP BUT the arbiters are much more complex to build • They are “recently” shown to be stable under any admissible traffic

  31. Algorithms for VOQ Switching • Randomized algorithms • They try in a smart way to approximate maximum weight matching by avoiding using an iterative process • They are stable under any admissible traffic • Their time complexity is small (depending on the algorithm) • Their hardware complexity is yet untested. • No schedulers – deal with mis-sequencing of packets • Distributed schedulers – buffered crossbars • Two important points to remember • The time complexity of an algorithm is not a “true” indication of its hardware implementation • 100% throughput does not mean a low delay • “Weak” vs. “Strong” stability

  32. VOQ Algorithms and Delay • But, delay is key • Because users don’t care about throughput alone • They care (more) about delays • Delay = QoS (= $ for the network operator) • Why is delay difficult to approach theoretically? • Mainly because it is a statistical quantity • It depends on the traffic statistics at the inputs • It depends on the particular scheduling algorithm used • The last point makes it difficult to analyze delays in i /q switches • For example in VOQ switches, it is almost impossible to give any guarantees on delay. • All you can hope for is to have a high throughput and a bounded queue length – bounded average delay (but even the bound on the queue length is beyond the control of the algorithm – we cannot say that the length of the queue should not be more than 10).

  33. Link 1, ingress Link 1, egress Link 2, ingress Link 2, egress Link 3, ingress Link 3, egress Link 4, ingress Link 4, egress VOQ Algorithms and Delay • This does not mean that we cannot have an algorithm that can do that. It means there exist none at this moment. • For this exact reason, almost all quality of service schemes (whether for delay or bandwidth guarantees) assume that you have an output-queued switch

  34. VOQ Algorithms and Delay • WHY: Because an OQ switch has no “fabric” scheduling/arbitration algorithm. • Delay simply depends on traffic statistics • Researchers have shown that you can provide a lot of QoS algorithms (like WFQ) using a single server and based on the traffic statistics • But, OQ switches are extremely expensive to build • Memory bandwidth requirement is very high • These QoS scheduling algorithms have little practical significance for scalable and high-performance switches/routers.

  35. 1 1 2 2 1 1 1 1 2 2 2 1 1 1 2 1 1 1 Output QueueingThe “ideal”

  36. How to get good delay cheaply? • Enter speedup… • The fabric speedup for an IQ switch equals 1 (mem. bwdth. = 2) • The fabric speedup for an OQ switch equals N (mem. Bwdth. = N+1) • Suppose we consider switches with fabric speedup of S, 1 < S << N • Such switch will require buffers both at the input and the output  call these combined input- and output-queued (CIOQ) switches • Such switches could help if… • With very small values of S • We get the performance – both delay and throughput – of an OQ switch

  37. A CIOQ switch • Consist of • An (internally non-blocking, e.g. crossbar) fabric with speedup S > 1 • Input and output buffers • A scheduler to determine matchings

  38. A CIOQ switch • For concreteness, suppose S = 2. The operation of the switch consists of • Transferring no more than 2 cells from (to) each input (output) • Logically, we will think of each time slot as consisting of two phases • Arrivals to (departures from) switch occur at most once per time slot • The transfer of cells from inputs to outputs can occur in each phase

  39. 1 2 1 2 1 Using Speedup

  40. Performance of CIOQ switches • Now that we have a higher speedup, do we get a handle on delay? • Can we say something about delay (e.g., every packet from a given flow should below 15 msec)? • There is one way of doing this: competitive analysis •  the idea is to compete with the performance of an OQ switch

  41. Intuition Speedup = 1 Fabric throughput = .58 Ave I/p queue = too large Speedup = 2 Fabric throughput = 1.16 Ave I/p queue = 6.25

  42. Intuition (continued) Speedup = 3 Fabric throughput = 1.74 Ave I/p queue = 1.35 Speedup = 4 Fabric throughput = 2.32 Ave I/p queue = 0.75

  43. Performance of CIOQ switches • The setup • Under arbitrary, but identical inputs (packet-by-packet) • Is it possible to replace an OQ switch by a CIOQ switch and schedule the CIOQ switch so that the outputs are identical packet-by-packet? To exactly mimick an OQ switch • If yes, what is the scheduling algorithm?

  44. What is exact mimicking? Apply same inputs to an OQ and a CIOQ switch • packet by packet Obtain same outputs • packet by packet

  45. What is exact mimicking? Why is a speedup of N not necessary? It is useless to bring all packets to the output if they need wait at the output. Need to bring packets at the output before they can leave.

  46. Consequences • Suppose, for now, that a CIOQ is competitive wrt an OQ switch. Then • We get perfect emulation of an OQ switch • This means we inherit all its throughput and delay properties • Most importantly – all QoS scheduling algorithms originally designed for OQ switches can be directly used on a CIOQ switch • But, at the cost of introducing a scheduling algorithm – which is the key

  47. Emulating OQ Switches with CIOQ • Consider an N x N switch with (integer) speedup S > 1 • We’re going to see if this switch can emulate an OQ switch • We’ll apply the same inputs, cell-by-cell, to both switches • We’ll assume that the OQ switch sends out packets in FIFO order • And we’ll see if the CIOQ switch can match cells on the output side

  48. Key concept: Urgency OQ switch Urgency of a cell at any time = its departure time - current time • It basically indicates the time that this packet will depart the OQ switch • This value is decremented after each time slot • When the value reaches 0, it must depart (it is at the HoL of the output queues)

  49. Key concept: Urgency • Algorithm: Most urgent cell first (MUCF). In each “phase” • Outputs try to get their most urgent cells from inputs. • Input grant to output whose cell is most urgent. In case of ties, output i takes priority over output i + k. • Loser outputs try to obtain their next most urgent cell from another (unmatched) input. • When no more matchings are possible, cells are transferred.

  50. Key concept: Urgency - Example • At the beginning of phase 1, both outputs 1 and 2 request input 1 to obtain their most urgent cells • Since there is a tie, then input 1 grants to output 1 (give it to least port #). • Output 2 proceeds to get its next most urgent cell (from input 2 and has urgency of 3)

More Related