1 / 11

A Deficit Round Robin 20MB/s Layer 2 Switch

A Deficit Round Robin 20MB/s Layer 2 Switch. Muraleedhara Navada Francois Labonte. Fairness in Switches. Output Queued Switch. How to provide fair bandwidth allocation at output link ? Simple FIFO favors greedy flow Separate flows into FIFOs at output Bit by Bit fair queuing

kolton
Download Presentation

A Deficit Round Robin 20MB/s Layer 2 Switch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Deficit Round Robin 20MB/s Layer 2 Switch Muraleedhara Navada Francois Labonte

  2. Fairness in Switches Output Queued Switch • How to provide fair bandwidth allocation at output link ? • Simple FIFO favors greedy flow • Separate flows into FIFOs at output • Bit by Bit fair queuing • Weighted Fair Queuing allows different weight for flows • Packetized Weighted Fair Queuing (aka PGPS) calculates departure time for each packet 50 100 50 50 50 50 50 50 150 Round-Robin bit by bit allocation

  3. Deficit Round Robin Credits • Packetized Weighted Fair Queuing is complicated to implement • Deficit Round Robin keeps track of credits for each flow • Flow sends according credits • Add credits according to weight • Essentially PWFQ at coarser level 50 100 75 50 50 50 75 75 50 50 50 75 150 Credits Time 50 100 75 50 50 50 50 25 25 50 50 75 150 Credits 50 100 150 50 50 100 100 50 50 150 150

  4. NetFPGA System 1MB SRAM • 8 Port 10MB/s duplex ethernet • Control FPGA (CFPGA) handles physical interface (MAC) • Our design targets both the User FPGAs (UFPGA) UFPGA1 CFPGA 10MB/s Ethernet UFPGA0 1MB SRAM 1MB SRAM

  5. Design Considerations • 4 MACs behind each port (8) • Each flow is a unique Source Address – Destination Address pair • ~1024 flows • Split across FPGAs • Each UFPGAs read incoming packets from different ports(0-3 and 4-7) • tradeoff between memory storage and fairness across all flows

  6. Memory Buffer Allocation • Static Partitioning of 1MB SRAM across 512 flows gives 2kbytes per flow < 2 max size packets • Need more dynamic allocation • Segments: smaller size means less fragmentation, but more pointer and list handling overhead • 128 bytes was chosen • Keep free segments list • Save on-chip only pointer to head and tail of each flow P1 P1 P2 P3 P4 P5 P5 P6

  7. MAC address Learning • Instead of telling which MAC addresses belong to which port • Learn them from the source address • Note that our split FPGA design (reading from different ports) require them to communicate the MACs learned between them • When destination MAC is not learned yet, broadcast (send to all other ports). • So MAC learning implies broadcast capability

  8. Read Operation Share SA Master Control Read, port MAC Learning Flow Assignment CFPGA Interface Control Handler DA, SA Flow ID Packet Memory Manager Flow Tail DRR Engine Length, ptr 1 MB SRAM

  9. Write Operation Master Control Write, port MAC Learning Flow Assignment CFPGA Interface Port REQ Control Handler Port GNT Data Ready Packet Memory Manager Head, length DRR Engine Next head, length, latency 1 MB SRAM

  10. FLOW data 512 x 160bits SRAM Port 0 FIFO Port 1 FIFO Port 2 FIFO Port 3 FIFO Port 4 FIFO Port 5 FIFO Port 6 FIFO Port 7 FIFO DRR Engine • How to handle 512 flows and stay work conserving: • Only one flow active at any time • DRR allocation happens on dequeuing • Fifos contain the next flow to be serviced for each port • Statistics per flow • Weight • Latency • Byte sent • Packet sent • Packets active

  11. Conclusion • A Deficit Round Robin Switch with 1k flows has been implemented • Provides dynamic memory buffer allocation, MAC learning and broadcast • Parallel design split across 2 chips • Gathers statistics on flows

More Related