1 / 53

Switch EECS 252 – Spring 2006 RAMP Blue Project

Switch EECS 252 – Spring 2006 RAMP Blue Project. Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley May 1, 2006. Outline. Goal of switch Implementation Performance Future implementation Current state of project Project experience.

Download Presentation

Switch EECS 252 – Spring 2006 RAMP Blue Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SwitchEECS 252 – Spring 2006RAMP Blue Project Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley May 1, 2006

  2. Outline • Goal of switch • Implementation • Performance • Future implementation • Current state of project • Project experience CS252-s06, Project Presentation

  3. One Piece of the Puzzle • Main goal of RAMP Blue is to build a large scale system • To do useful work, processors must be able to communicate • Therefore, we need an interconnection network CS252-s06, Project Presentation

  4. Implementation Goals • Support communication between all processors in system • Flexible hardware allowing parameterization of global system constants, especially number of Microblaze cores per FPGA • Minimal resource utilization • High throughput • Low latency • Simple, homogenous hardware • Simple software interface CS252-s06, Project Presentation

  5. Hardware Design Constraints • RAMP Blue will be implemented on the BEE2 • 4 user FPGAs per BEE2 board • 2 LVCMOS links FPGA-to-FPGA communication • Relatively low latency (2 or 3 cycles) • Throughput: more than 64bit • 16 MGT links per board (4 per FPGA) for board-to-board communication • Relatively high latency (20 or more cycles) • Throughput: 32bit or 64 bit • To achieve lowest latency possible, we limit the packet routes to at most 1 MGT link • 16 Microblaze cores per FPGA (64 per board) • Depending on resource utilization, number of cores per FPGA may need to be reduced CS252-s06, Project Presentation

  6. Physical Topology • Topology is fixed and homogenous throughout the system • Each FPGA directly connected to 2 other FPGAs on the same board and 4 other boards • Number of cores per FPGA is the same on every FPGA • Each board has a direct connection to every other board in the system (maximum of 17 boards) • BOARD n hooks up to board BOARD 16 through MGT n • With 16 cores per FPGA, 17 boards supports 1088 processors! CS252-s06, Project Presentation

  7. Board Level Connectivity CS252-s06, Project Presentation

  8. FPGA Level Connectivity For clarity, configuration shown is with 4 Microblaze cores per FPGA CS252-s06, Project Presentation

  9. Switch Fabric Specifications • Crossbar switch with maximal connectivity • Every Microblaze can access every other Microblaze on the same FPGA directly • Every Microblaze can access both LVCMOS links • Every Microblaze can access all FPGA-local MGT links • Buffering on inputs and outputs • Store-and-forward buffers for Microblazes to decrease complexity and simplify software interface • Cut through buffers for LVCMOS links • MGT links wrapped XAUI cores that already have internal buffers CS252-s06, Project Presentation

  10. Microblaze Level Connectivity For clarity, configuration shown is with 4 Microblaze cores per FPGA CS252-s06, Project Presentation

  11. Switch Overall CS252-s06, Project Presentation

  12. Scheduler • If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first • Other control logic not shown here is used to implement protocol between switch and buffers CS252-s06, Project Presentation

  13. Scheduler • If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first • Other control logic not shown here is used to implement protocol between switch and buffers CS252-s06, Project Presentation

  14. Scheduler • If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first • Other control logic not shown here is used to implement protocol between switch and buffers CS252-s06, Project Presentation

  15. Scheduler • If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first • Other control logic not shown here is used to implement protocol between switch and buffers CS252-s06, Project Presentation

  16. Scheduler • If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first • Other control logic not shown here is used to implement protocol between switch and buffers CS252-s06, Project Presentation

  17. Scheduler • If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first • Other control logic not shown here is used to implement protocol between switch and buffers CS252-s06, Project Presentation

  18. Scheduler • If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first • Other control logic not shown here is used to implement protocol between switch and buffers CS252-s06, Project Presentation

  19. Scheduler • If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first • Other control logic not shown here is used to implement protocol between switch and buffers CS252-s06, Project Presentation

  20. Scheduler • If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first • Other control logic not shown here is used to implement protocol between switch and buffers CS252-s06, Project Presentation

  21. Source Routing • Fixed topology allows for straightforward source routing implementation • Destination routing would be more robust, but would require significantly more resources and greater complexity • Packet header is extremely simple: just a concatenated sequence of hops • Minimal hardware required to determine next hop and adjust the header at every hop (zero LUTs used – can’t get better than that!) • The next hop is encoded in the lowest bits of the header • To adjust the header, the hardware must simply shift out the lowest bits CS252-s06, Project Presentation

  22. Source Routing – Hop Encoding • Need 5 bits to represent each hop • Must be able to encode 16 cores per FPGA + 4 MGT links + 2 LVCMOS links = 22 total encodings (+ 1 for a FIN code) • If 8 or less cores per FPGA are used, then each hop can be represented using only 4 bits (hardware supports parameterization of the hop encoding width) • Maximum of 6 hops based on physical topology • Constrained MGT links to 1 hop per route • Therefore, worst case route is:LVCMOS  LVCMOS  MGT  LVCMOS  LVCMOS  MB • Hop encoding allows header to fit into 1 word • 6 hops x 5 bits/hop = 30 bits CS252-s06, Project Presentation

  23. Source Routing – Hop Encoding • Need 5 bits to represent each hop • Must be able to encode 16 cores per FPGA + 4 MGT links + 2 LVCMOS links = 22 total encodings (+ 1 for a FIN code) • If 8 or less cores per FPGA are used, then each hop can be represented using only 4 bits (hardware supports parameterization of the hop encoding width) • Maximum of 6 hops based on physical topology • Constrained MGT links to 1 hop per route • Therefore, worst case route is:LVCMOS  LVCMOS  MGT  LVCMOS  LVCMOS  MB • Hop encoding allows header to fit into 1 word • 6 hops x 5 bits/hop = 30 bits CS252-s06, Project Presentation

  24. Source Routing – Global Naming • Processors are globally named • Necessary to reach the goal of a simple software interface • If there are 16 cores per FPGA with 4 FPGAs per board and 17 total boards, then the processors are numbered 0 - 1087 • Naming scheme scales down with less cores • Necessary to support parameterization of global system constants (especially number of cores per FPGA) • If there are 4 cores per FPGA with 4 FPGAs per board and 17 total boards, then the processors are numbered 0 – 271 • Invalid processor number triggers error at the software level • Again, supports simple software interface • Ensures that only packets with valid headers enter the network CS252-s06, Project Presentation

  25. Source Routing Example • For simplicity, let’s assume there are 4 cores per FPGA • Let’s send from processor #10 to processor #24 (representative of worst case path) CS252-s06, Project Presentation

  26. Source Routing Example • For simplicity, let’s assume there are 4 cores per FPGA • Let’s send from processor #10 to processor #24 (representative of worst case path) CS252-s06, Project Presentation

  27. Source Routing Example • Destination core is on a different board, so packet must first be routed from the source FPGA (FPGA 2) to the FPGA that is connected to the destination board (which is FPGA 0) • This requires 2 hops over the LEFT LVCMOS link CS252-s06, Project Presentation

  28. Source Routing Example • Destination core is on a different board, so packet must first be routed from the source FPGA (FPGA 2) to the FPGA that is connected to the destination board (which is FPGA 0) • This requires 2 hops over the LEFT LVCMOS link CS252-s06, Project Presentation

  29. Source Routing Example • Destination core is on a different board, so packet must first be routed from the source FPGA (FPGA 2) to the FPGA that is connected to the destination board (which is FPGA 0) • This requires 2 hops over the LEFT LVCMOS link CS252-s06, Project Presentation

  30. Source Routing Example • Once at the proper FPGA, packet can be sent across the MGT link to an FPGA on the destination board CS252-s06, Project Presentation

  31. Source Routing Example • Once at the proper FPGA, packet can be sent across the MGT link to an FPGA on the destination board CS252-s06, Project Presentation

  32. Source Routing Example • Then, the packet must be routed to the destination FPGA, which requires 2 more LVCMOS hops CS252-s06, Project Presentation

  33. Source Routing Example • Then, the packet must be routed to the destination FPGA, which requires 2 more LVCMOS hops CS252-s06, Project Presentation

  34. Source Routing Example • Then, the packet must be routed to the destination FPGA, which requires 2 more LVCMOS hops CS252-s06, Project Presentation

  35. Source Routing Example • Finally, the packet must be forwarded to the destination Microblaze core CS252-s06, Project Presentation

  36. Source Routing Example • Finally, the packet must be forwarded to the destination Microblaze core CS252-s06, Project Presentation

  37. Source Routing Example • Each arrow head represents a hop – takes 5 hops to reach the destination FPGA • Requires one more hop to send the packet to the destination Microblaze core totalling 6 hops in the worst case CS252-s06, Project Presentation

  38. Source Routing – 17th Board • To support the 17th board, boards communicate to the 17th board through the MGT link of their own board number CS252-s06, Project Presentation

  39. Source Routing – 17th Board • To support the 17th board, boards communicate to the 17th board through the MGT link of their own board number CS252-s06, Project Presentation

  40. Source Routing – 17th Board • For example, for BOARD 0 to send to BOARD 16, it sends over MGT 0 CS252-s06, Project Presentation

  41. Microblaze Interface • Store and forward • Connecting to FSL bus for now • Essentially double buffered • MB FSL reading speed = extremely slow compare to switch delay time – at the fastest compilation with most efficient code, takes 48 cycle to write one value to FSL bus! • Example: send from MB to LVCMOS, loop back to LVCMOS link and then back to MB CS252-s06, Project Presentation

  42. LVCMOS interface • 2 cycles of latency • Two buses connecting 2 FPGAs, can be used to do anything • Wire control bus and data bus on LVCMOS, except data_full or free signal is high 2 cycle before it is really full CS252-s06, Project Presentation

  43. XAUI Interface • Much simplified because of XAUI has internal buffer • Essentially just some control signals • Interface has recently changed, so this is still in progress CS252-s06, Project Presentation

  44. Software Interface • Simple interface to send and receive data • int send(int src, int dest, byte *buf, int len) • Copies len bytes of buf into local outgoing Buffer Unit • Constructs source route from src MB core to dest MB core • Blocks until all data copied • Returns number of bytes sent or -1 on error • Receive is called by interrupt • int recv(byte *buf, int len) • Copies len bytes into buf from local incoming Buffer Unit • Blocks until all data received • Returns number of bytes received or -1 on error CS252-s06, Project Presentation

  45. Simplifications • Fixed packet length simplifies control hardware • Packet length fits completely into all buffers in the system, so the entire packet can be transferred from hop to hop • Once data transmission starts from MB buffer, it is not interrupted till MB input buffer • Store-and-forward implementation of MB buffers CS252-s06, Project Presentation

  46. Performance (still need to clean this up) • Latency1 =~ 48*packet length to write into FSL bus • Latency2 =~ 2* packet length to wait for MB buffer to be full • Latency3 =~ 2 in switch transmission • Latency4 =~ 48*packet length to read into FSL bus • Bandwidth = 32bit/cycle or 64 bit/cycle (current fsl do not support 64 bit) CS252-s06, Project Presentation

  47. Utilization on BEE2: Note: Measured with switch that connects 8 ports: 2 MB, 2 LVCMOS link, but no XAUI. All buffers are 32 bit wide and 16 word deep. CS252-s06, Project Presentation

  48. Future implementation • Switch topology change • Allow variable packet length – using control in fsl • DMA • 4 MB share a DMA CS252-s06, Project Presentation

  49. “Associated Switch” CS252-s06, Project Presentation

  50. Clustered Organization • Microblaze cores organized into clusters • Since there are 4 DIMMs on the BEE2, split into 4 clusters • NIC will coordinate transfer of data for all MBs in cluster • Faster transfer for MBs in the same cluster because its DMA • Faster overall transfer because data copying done in hardware • Only 4 bits per hop now, but extra hop needed CS252-s06, Project Presentation

More Related