1.2k likes | 1.38k Views
Packet Arbitration in VoQ switches and Others and QoS. Recap. High-Performance Switch Design We need scalable switch fabrics – crossbar, bit-sliced crossbar, Clos networks. We need to solve the memory bandwidth problem Our conclusion is to go for input queued-switches
E N D
Recap • High-Performance Switch Design • We need scalable switch fabrics – crossbar, bit-sliced crossbar, Clos networks. • We need to solve the memory bandwidth problem • Our conclusion is to go for input queued-switches • We need to use VOQ instead of FIFO queues • For these switches to function at high-speed, we need efficient and practically implementable scheduling/arbitration algorithms
Cell Data Port Processor Port Processor Crossbar optics optics LCS Protocol LCS Protocol optics optics Request Grant/Credit Switch core architecture Port #1 Scheduler Port #256
Algorithms for VOQ Switching • We analyzed several algorithms for matching inputs and outputs • Maximum size matching: these are based on bipartite maximum matching – which can be solved using Max-flow techniques in O(N2.5) • These are not practical for high-speed implementations • They are stable (100% throughput for uniform traffic) • They are not stable for non-uniform traffic • Maximal size matching: they try to approximate maximum size matching • PIM, iSLIP, SRR, etc. • These are practical – can be executed in parallel in O(logN) or even O(1) • They are stable for uniform traffic and unstable for non-uniform traffic
Algorithms for VOQ Switching • Maximum weight matching: These are maximum matchings based weights such queue length (LQF) (LPF) or age of cell (OCF) with a complexity of O(N3logN) • These are not practical for high-speed implementations. Much more difficult to implement than maximum size matching • They are stable (100% throughput) under any admissible traffic • Maximal weight matching: they try to approximate maximum weight matching. They use RGA mechanism like iSLIP • iLQF, iLPF, iOCF, etc. • These are “somewhat” practical – can be executed in parallel in O(logN) or even O(1) like iSLIP BUT the arbiters are much more complex to build
Algorithms for VOQ Switching • Randomized algorithms • They try in a smart way to approximate maximum weight matching by avoiding using an iterative process • They are stable under any admissible traffic • Their time complexity is small (depending on the algorithm) • Their hardware complexity is yet untested.
OQ routers: • + work-conserving (QoS) • - memory bandwidth = (N+1)R R R R R R R • IQ routers: • + memory bandwidth = 2R • - arbitration complexity Bipartite Matching Remember: Two Successive Scaling Problems
Today: 64 ports at 10Gbps, 64-byte cells. • Arbitration Time = = 51.2ns • Request/Grant Communication BW = 17.5Gbps 64bytes 10Gbps IQ Arbitration Complexity • Scaling to 160Gbps: • Arbitration Time = 3.2ns • Request/Grant Communication BW = 280Gbps • Two main alternatives for scaling: • Increase cell size • Eliminate arbitration
Desirable Characteristics for Router Architecture • Ideal: OQ • 100% throughput • Minimum delay • Maintains packet order • Necessary: able to regularly connect any input to any output • What if the world was perfect? Assume Bernoulli iid uniform arrival traffic...
Round-Robin Scheduling • Uniform & non-bursty traffic => 100% throughput • Problem: traffic is non-uniform & bursty
1 1 1 N N N Two-Stage Switch (I) External Inputs Internal Inputs External Outputs First Round-Robin Second Round-Robin
1 1 1 N N N Load Balancing Two-Stage Switch (I) External Inputs Internal Inputs External Outputs First Round-Robin Second Round-Robin
1 2 2 1 1 1 1 N N N Two-Stage Switch Characteristics External Inputs Internal Inputs External Outputs Cyclic Shift Cyclic Shift 100% throughput Problem: unbounded mis-sequencing
Two-Stage Switch (II) New N3 instead of N2
a b 1 3 2 Expanding VOQ Structure Solution: expand VOQ structure by distinguishing among switch inputs
What is being done in practice(Cisco for example) • They want schedulers that achieve 100% throughput and very low delay (Like MWM) • They want it to be as simple as iSLIP in terms of hardware implementation • Is there any solution to this !!!!!
Typical Performance of ISLIP-like Algorithms PIM with 4 iterations
Can we make these scheduling algorithms simpler?Using a Simpler Architecture
Buffered Crossbar Switches • A buffered crossbar switch is a switch with buffered fabric (memory inside the crossbar). • A pure buffered crossbar switch architecture, has only buffering inside the fabric and none anywhere else. • Due to HOL blocking problem, VOQ are used in the input side.
Flow Control Arbiter 1 1 … …. N … 1 Arbiter 2 …. N … … … … … 1 … Arbiter …. N N … • Data • Input Cards Arbiter Arbiter Arbiter Buffered Crossbar Architecture Output Card Output Card Output Card 2 N 1
Scheduling Process • Scheduling is divided into three steps: • Input scheduling:each input selects in a certain way one cell from the HoL of an eligible queue and sends it to the corresponding internal buffer. • Output scheduling: each output selects in a certain way from all internally buffered cells in the crossbar to be delivered to the output port. • Delivery notifying:for each delivered cell, inform the corresponding input of the internal buffer status.
Advantages • Total independence between input and output arbiters (distributed design) (1/N complexity as compared to centralized schedulers) • Performance of Switch is much better (because there is much less output contention) – a combination of IQ and OQ switches • Disadvantage: Crossbar is more complicated
I/O Contention Resolution 1 2 3 4 1 2 3 4
I/O Contention Resolution 1 2 3 4 1 2 3 4
The Round Robin Algorithm • InRr-OutRr • Input scheduling: InRr (Round-Robin) - Each input selects the next eligible VOQ, based on its highest priority pointer, and sends its HoL packet to the internal buffer. • Output scheduling: OutRr (Round-Robin) - Each output selects the next nonempty internal buffer, based on its highest priority pointer, and sends it to the output link.
4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 Input Scheduling (InRr.) 1 2 3 4 1 2 3 4
4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 Output Scheduling (OutRr.) 4 1 3 2 1 4 1 3 2 2 3 4 1 3 2 4 1 3 2 4 1 2 3 4
4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 1 4 1 3 2 2 4 1 3 2 3 4 1 3 2 4 Out. Ptrs Updt + Notification delivery 1 2 3 4
Performance study Delay/throughput under Bernoulli Uniform and Burtsy Uniform Stability performance:
32x32 Switch under Bernoulli Uniform Traffic OQ RR-RR 3 10 1-SLIP 4-SLIP 2 10 Average Delay 1 10 0 10 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Load Bernoulli Uniform Arrivals
32x32 Switch under Bursty Uniform Traffic OQ RR-RR 1-SLIP 4-SLIP 3 10 Average Delay 2 10 1 10 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Load Bursty Uniform Arrivals
Scheduling Process • Because the arbitration is simple: • We can afford to have algorithms based on weights for example (LQF, OCF). • We can afford to have algorithms that provide QoS
Buffered Crossbar Solution: Scheduler • The algorithm MVF-RR is composed of two parts: • Input scheduler – MVF (most vacancies first) Each input selects the column of internal buffers (destined to the same output) where there are most vacancies (non-full buffers). • Output scheduler – Round-robin Each output chooses the internal buffer which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one.
Buffered Crossbar Solution: Scheduler • The algorithm ECF-RR is composed of two parts: • Input scheduler – ECF (empty column first) • Each input selects first empty column of internal buffers (destined to the same output). If there is no empty column, it selects on a round-robin basis. • Output scheduler – Round-robin • Each output chooses the internal buffer which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one.
Buffered Crossbar Solution: Scheduler • The algorithm RR-REMOVE is composed of two parts: • Input scheduler – Round-robin (with remove-request signal sending) • Each input chooses non-empty VOQ which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one. It then sends out at most one remove-request signal to outputs • Output scheduler – REMOVE • For each output, if it receives any remove-request signals, it chooses one of them based on its highest priority pointer and removes the cell. If no signal is received, it does simple round-robin arbitration.
Buffered Crossbar Solution: Scheduler • The algorithm ECF-REMOVE is composed of two parts: • Input scheduler – ECF (with remove-request signal sending) • Each input selects first empty column of internal buffers (destined to the same output). If there is no empty column, it selects on a round-robin basis.It then sends out at most one remove-request signal to outputs • Output scheduler – REMOVE • For each output, if it receives any remove-request signals, it chooses one of them based on its highest priority pointer and removes the cell. If no signal is received, it does simple round-robin arbitration.
Round-robin arbiter Round-robin arbiter Highest priority pointer Any grant Grants Grants Selector 0 Selector N-1 Arbitration results Hardware Implementation of ECF-RR: An Input Scheduling Block
Performance Evaluation: Simulation Study Uniform Traffic
Performance Evaluation: Simulation Study ECF-REMOVe over RR-RR
Performance Evaluation : Simulation Study Bursty Traffic
Performance Evaluation: Simulation Study ECF-REMOVe over RR-RR
Performance Evaluation : Simulation Study Hotspot Traffic
Performance Evaluation: Simulation Study ECF-REMOVe over RR-RR
Quality of Service Mechanisms for Switches/Routers and the Internet
VOQ Algorithms and Delay • But, delay is key • Because users don’t care about throughput alone • They care (more) about delays • Delay = QoS (= $ for the network operator) • Why is delay difficult to approach theoretically? • Mainly because it is a statistical quantity • It depends on the traffic statistics at the inputs • It depends on the particular scheduling algorithm used • The last point makes it difficult to analyze delays in i /q switches • For example in VOQ switches, it is almost impossible to give any guarantees on delay.
Link 1, ingress Link 1, egress Link 2, ingress Link 2, egress Link 3, ingress Link 3, egress Link 4, ingress Link 4, egress VOQ Algorithms and Delay • This does not mean that we cannot have an algorithm that can do that. It means there exist none at this moment. • For this exact reason, almost all quality of service schemes (whether for delay or bandwidth guarantees) assume that you have an output-queued switch
Policer Classifier Policer QoS Router Queue management Policer Per-flow Queue Scheduler Classifier shaper Policer Per-flow Queue Per-flow Queue Scheduler shaper Per-flow Queue