1 / 18

RAMP Blue: A Message Passing Multi-Processor System on the BEE2

RAMP Blue: A Message Passing Multi-Processor System on the BEE2. Andrew Schultz and Alex Krasnov P.I. John Wawrzynek. Introduction.

kathryn
Download Presentation

RAMP Blue: A Message Passing Multi-Processor System on the BEE2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov P.I. John Wawrzynek

  2. Introduction • RAMP Blue is an initial design driver for the RAMP project with the goal of building an extremely large (~1000 node) multiprocessor system on a cluster of BEE2 modules using soft-core processors • Goal of RAMP Blue is to experiment and learn lessons on building large scale computer architectures on FPGA platforms for emulation, not performance • Current system has 256 cores on 8-module cluster running full parallel benchmarks, easy upgrade to 512 cores on 16-modules once boards are available

  3. RDF: RAMP Design Framework • All designs to be implemented within RAMP are known as the target system and must follow some restrictions defined by the RDF • All designs are composed of units with well defined ports that communicate with each other over uni-directional, point-to-point channels • Units can be as simple as a single logical gate, or more often a larger unit such as a CPU core or cache • Timing between units is completely decoupled by the channel

  4. RAMP Blue Goals • RAMP Blue is sibling to other RAMP design driver projects: 1) RAMP Red: Port of existing transactional cache system to FPGA PowerPC cores 2) RAMP Blue: Message passing multiprocessor system with existing, FPGA optimized soft core (MicroBlaze) 3) RAMP White: Cache coherent multiprocessor system with full featured soft-core • Blue also to run off-the-shelf, message passing, scientific codes and benchmarks (provide existing tests and basis for comparison) • Main goal is to fit as many cores as possible in the system and have the ability to reliably run code and change system parameters

  5. RAMP Blue Requirements • Built from existing tools (RDL not available at time) but fit RDF guidelines for future integration • Require design and implementation of gateware and software for implementing MicroBlaze with uClinux on BEE2 modules • Sharing of DDR2 memory system • Communication and bootstrapping from user FPGAs • Debugging and control from control FPGA • New on-chip network for MicroBlaze to MicroBlaze communication • Communication on-chip, FPGA to FPGA on board, and board to board • Completely new double precision floating point unit for scientific codes

  6. MicroBlaze Characteristics • 3-stage, RISC like architecture designed for implementation on FPGAs • Takes advantage of FPGA unique features (e.g. fast carry chains) and addresses FPGA shortcomings (e.g. lack of CAMs in cache) • Maximum clock rate of 100 MHz (~0.5 MIPS/MHz) on Virtex-II Pro FPGAs • Split I and D cache with configurable size, direct mapped • Fast hardware multiplier/divider, optional hardware barrel shifter • Configurable hardware debugging support (watch/breakpoints) • Several peripheral interface bus options • GCC tool chain support and ability to run uClinux

  7. Node Architecture

  8. Memory System • Requires sharing memory channel with a configurable number of MicroBlaze cores • No coherence, each DIMM is partitioned, but bank management keeps cores from fighting with each other

  9. Control Communication • Communication channel from control PowerPC to individual MicroBlaze required for bootstrapping and debugging • Gateware provides general purpose, low-speed network • Software provides character and Ethernet abstraction on channel • Kernel is sent over channel and file systems can be mounted • Console channel allows debugging messages and control

  10. Double Precision FPU • Due to size of FPU, sharing is crucial to meeting resource budget • Shared FPU much like reservation stations in microarchitecture with MicroBlaze issuing instructions

  11. Network Characteristics • Interconnect must fit within the RDF model • Network interface uses simple FSL channels, currently PIO but could be DMA • Source routing between nodes (non-adaptive, link failure intolerant) • Only links to physically fail would be board-to-board XAUI links • Topology of interconnect is full cross-bar on chip with all-to-all connection of board-to-board links • Longest path between nodes is four on-board links and one off-board link • Encapsulated Ethernet packets with source routing information prepended • Virtual cut through flow control with virtual channels for deadlock avoidance

  12. Network Implementation

  13. Multiple Cores • Scaling up to mulitple cores per FPGA is primarily constrained by resources • The current evaulation cluster implements 8 cores/FPGA using roughly 85% of the slices (but only slightly more than half of the LUTs/FFs) • Sixteen cores fit on each FPGA without infrastructure (switch, memory, etc), 10-12 are maximum depending on options • Options include hardware accelerators, cache size, FPU timing, etc.

  14. Test Cluster • Sixteen BEE2 modules with 8 cores per user FPGA and two 1 GB DDR2 DIMMs per user FPGA • Overall cluster has 512 cores, scaling up to 12 cores per FPGA and utilizing the control FPGA it is realistic to achieve 960 cores in the cluster

  15. Software • Each node in the cluster boots its own copy of uClinux and mounts a file system from an external NFS file system • The Unified Parallel C (UPC) message passing framework was ported to uClinux • The main porting effort with UPC is adapting to its transport layer, GASnet, but this was circumvented by using the GASnet UDP transport layer • Floating point integration is achieved via modification to the GCC SoftFPU backend to emit code to interact with FPU • Within UPC the NAS Parallel Benchmark runs on the cluster • Only class “S” benchmarks can be run due to the limited memory (256MB/node)

  16. Performance • Performance is not the key metric for success with RAMP Blue • While improving performance is secondary goal, ability to reliably implement a system with a wide range of parameters and meet timing closure in with desired resources is primary goal • Analysis of performance points out bottlenecks for incremental improvement in future RAMP infrastructure designs • Analysis of node to node network shows software (i.e. network interface) is primary bottleneck, finer grained analysis forthcoming with RDL port • Just for the heck of it, the NAS performance numbers:

  17. Implementation Issues • Building such a large design exposed several insidious bugs in both hardware and gateware • MicroBlaze bugs in both gateware and GCC toolchain required a good deal of time to track down (race conditions, OS bugs, GCC backend bugs) • Memory errors with large designs on BEE2, still not completely understood, probably has to do with noise on power plane increasing clock jitter • Lack of debugging insight and long recompile time greatly hindered progress • Building large cluster exposed bugs caused by variation in BEE2 boards • Multiple layers of user control (FGPA, processor, I/O, software) all contribute to uncertainty in operation

  18. Conclusion • RAMP Blue represents the first steps to developing a robust library of RAMP infrastructure for building more complicated parallel systems • Much of the RAMP Blue gateware is directly applicable to future systems • Many important lessons were learned about required debugging/insight capabilities • New bugs and reliability issues were exposed in BEE2 platform and gateware to help influence future RAMP hardware platforms and characteristics for robust software/gateware infrastructure • RAMP Blue also represents the largest soft-core, FPGA based computing system ever built and demonstrates the incredible research flexibility such systems allow • Ability to literally tweak the hardware interfaces and parameters provide a “research sandbox” for exciting new possibilities • E.g. add DMA and RDMA to networking, arbitrarily tweak network topology and experiment with system level paging and coherence, etc.

More Related