1 / 25

The Parallel Packet Switch

The Parallel Packet Switch. Sundar Iyer, Amr Awadallah, & Nick McKeown High Performance Networking Group, Stanford University. Web Site: http://klamath.stanford.edu/fjr. Contents. Motivation Introduction Key Ideas Speedup, Concentration, Constraints Centralized Algorithm

delta
Download Presentation

The Parallel Packet Switch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Parallel Packet Switch Sundar Iyer, Amr Awadallah, & Nick McKeown High Performance Networking Group, Stanford University. Web Site: http://klamath.stanford.edu/fjr

  2. Contents • Motivation • Introduction • Key Ideas • Speedup, Concentration, Constraints • Centralized Algorithm • Theorems, Results & Summary • Motivation for a Distributed Algorithm • Concepts • Independence, Trade-Off, Request Duplication • Performance of DPA • Conclusions & Future Work

  3. Motivation • To build • a switch with memories running slower than the line rate • a switch with a highly scaleable architecture • To build • an extremely high-speed packet switch • a switch with extremely high line rates • Quality of Service • Redundancy “I want an ideal switch”

  4. Architecture Alternatives Y QoS Support • An Ideal Switch: • The memory runs at lower than line rate speeds • Supports QoS • Is easy to implement Ideal ! PPS Switch ? Output Queued CIOQ Switch Input Queued X 1x Ease of Implementation 2x Nx Z Memory Speeds

  5. What is a Parallel Packet Switch ? A parallel packet-switch (PPS) is comprised of multiple identical lower-speed packet-switches operating independently and in parallel. An incoming stream of packets is spread, packet-by-packet, by a de-multiplexor across the slower packet-switches, then recombined by a multiplexor at the output.

  6. Key Ideas in a Parallel Packet Switch • Key Concept - “Inverse Multiplexing” • Buffering occurs only in the internal switches ! • By choosing a large value of “k”, we would like to arbitrarily • reduce the memory speeds within a switch Can such a switch work “ideally” ? Can it give the advantages of an output queued switch ? What should the multiplexor and de-multiplexor do ? Does not the switch behave well in a trivial manner ?

  7. Definitions • Output Queued Switch • A switch in which arriving packets are placed immediately in queues at the output, where they contend with packets destined to the same output waiting their turn to depart. • “We would like to perform as well as an output queued switch” • Mimic (Black Box Model) • Two different switches are said to mimic each other, if under identical inputs, identical packets depart from each switch at the same time • Work Conserving • A system is said to be work-conserving if its outputs never idle unnecessarily. • “If you got something to do, do it now !!”

  8. Ideal Scenario Output-Queued Switch Multiplexer Demultiplexer (R/3) 1 R R (R/3) 1 1 Demultiplexer Multiplexer (R/3) R R Output-Queued Switch 2 2 (R) 2 (R/3) Demultiplexer Multiplexer R R (R/3) 3 3 Output-Queued Switch k =3 Multiplexer Demultiplexer (R/3) R R (R/3 N=4 N=4 Packets destined to output port two

  9. Potential Pitfalls - Concentration “Concentration is when a large number of cells destined to the same output are concentrated on a small fraction of internal layers” Output-Queued Switch Multiplexer Demultiplexer (R/3) 1 R R (R/3) 1 1 Demultiplexer Multiplexer (R/3) R R (2R/3) Output-Queued Switch 2 2 2 (R/3) Demultiplexer Multiplexer R R (R/3) 3 3 Output-Queued Switch k =3 Multiplexer Demultiplexer R R (R/3) N=4 N=4 Packets destined to output port two

  10. R R R C3 C1 A R 1 A 1 C1:A, 1 R B R R R 2 B R R 2 C2:A, 2 C2 R R R C R 3 C 3 C3:A, 1 Cells arriving at Cells departing at (c) (d) R R C3 C3 R 1 A C4:B, 2 1 R B R R R 2 B R R 2 R R C R 3 C5 C4 R C C5:B, 2 3 Cells arriving at Cells departing at Can concentration always be avoided ? t=0’ t=0 t=1 t=1’

  11. Link Constraints • Input Link Constraint- An external input port is constrained to send a cell to a specific layer at most once every ceil(k/S) time slots. • This constraint is due to the switch architecture • Each arriving cell must adhere to this constraint • Output Link Constraint • A similar constraint exists for an output port Demultiplexer Demultiplexer After t =4 After t =5 A speedup of 2, with 10 links

  12. AIL and AOL Sets • Available Input Link Set: AIL(i,n), is the set of layers to which external input port i can start sending a cell in time slot n. • This is the set of layers that external input i has not started sending any cells to within the last ceil(k/S) time slots. • AIL(i,n) evolves over time • AIL(i,n) is full when there are no cells destined to an input for ceil(k/S) time slots. • Available Output Link Set:AOL(j,n’), is the set of layers that can send a cell to external output j at time slot n’ in the future. • This is the set of layers that have not started to send a new cell to external output j in the last ceil(k/S) time slots before time slot n’ • AOL(j,n’) evolves over • time & cells to output j • AOL(j,n’) is never full as long as there are cells in the system destined to output j.

  13. Bounding AIL and AOL • Lemma1: AIL(j,n) >= k - ceil(k/S) +1 • Lemma2: AOL(j,n’) >= k - ceil(k/S) +1 k ceil(k/S) -1 Demultiplexer k - ceil(k/S) +1 AIL(i,n) At t =n

  14. Theorems • Theorem1: (Sufficiency) If a PPS guarantees that each arriving cell is allocated to a layer l, such that l € AIL(i,n) and l € AOL(j,n’), (i.e. if it meets both the ILC and the OLC) then the switch is work-conserving. U AIL(i,n) AOL(j,n’) The intersection set • Theorem2: (Sufficiency) A speedup of 2k/(k+2) is sufficient for a PPS to meet both the input and output link constraints for every cell • Corollary:A PPS is work conserving, if S >2k/(k+2)

  15. Theorems .. contd • Theorem3: (Sufficiency) A PPS can exactly mimic an FCFS-OQ switch with a speedup of 2k/(k+2) Analogy to CLOS ?

  16. Summary of Results • CPA - Centralized PPS Algorithm • Each input maintains the AIL set. • A central scheduler is broadcast the AIL Sets • CPA calculates the intersection between AIL and AOL • CPA timestamps the cells • The cells are output in the order of the global timestamp • If the speedup S >= 2, then • CPA is work conserving • CPA is perfectly load balancing • CPA can perfectly mimic an FCFS OQ Switch

  17. Motivation for a Distributed Solution • Centralized Algorithm not practical • N Sequential decisions to be made • Each decision is a set intersection • Does not scale with N, the number of input ports • Ideally, we would like a distributed algorithm where each input makes its decision independently. • Caveats • A totally distributed solution leads to concentration • A speedup of k might be required

  18. Potential Pitfall “If inputs act independently, the PPS can immediately become non work conserving” • Decrease the number of inputs which request simultaneously • Give the scheduler choice • Increase the speedup appropriately

  19. DPA - Distributed PPS Algorithm • Inputs are partitioned into k groups of size floor(N/k) • N schedulers • One for each output • Each maintains AOL(j,n’) • There are ceil(N/k) scheduling stages • Broadcast phase • Request phase • Each input requests a layer which satisfies ILC &OLC (primary request) • Each input also requests a duplicate layer (duplicate request) • Duplication function • Grant phase • The scheduler grants each input one request amongst the two

  20. The Duplicate Request Function • Input i€group g • The primary request is to layer l • l’ is the duplicate request layer • k is the number of layers • l’ = (l +g) mod k “Inputs belonging to group k do not send duplicate requests”

  21. Output-Queued Switch Multiplexor De multiplexor (R/k) (R/k) 1 C1: B R R A 1 Multiplexor De multiplexor C 2: B R Output-Queued Switch R B 2 2 Multiplexor De multiplexor C 3: B R R C 3 Output-Queued Switch =3 k Multiplex or De multiplexor C 4: B R R N=4 D Key Idea - Duplicate Requests Group 1 = 1,2; Group2 = 3; Group 3 = 4 Inputs 1,3,4 participate in the first scheduling stage Input 4 belongs to group 3 and does not duplicate

  22. Understanding the Scheduling Stage in DPA • A set of x nodes can pack at the most x(x-1) +1 request tuples • A set of x request tuples span at least ceil[sqrt(x)] layers • The maximum number of requests which need to be granted to a single layer in a given scheduling stage is bounded by ceil[sqrt(k)] So a speedup of around sqrt(k) suffices ?

  23. DPA … results • Fact1:(Work Conservance - Necessary condition for PPS) • For the PPS to be work conserving we require that no more than s cells be scheduled to depart from the same layer in a given window of k time slots. • Fact2: (Work Conservance - Sufficiency for DPA) • If in any scheduling stage we present only layers which have less than S - ceil[sqrt(k)] cells belonging to the present k-window slot in the AOL. then DPA will always remain work conserving. • Fact3: We have to ensure that there always exists 2 layers such that • l € AIL & AOL • l’ is the duplicate of l • l’ also € AIL & AOL • A speedup of S suffices, where • S > ceil[sqrt(k)] +3, k > 16 • S > ceil[sqrt(k)] + 4, k > 2

  24. Conclusions & Future Work CPA is not practical DPA has to be made simpler • Extend the results to take care of • Non FIFO QoS policies in a PPS • Study multicasting in a PPS

  25. Questions Please !

More Related