1 / 14

The Stanford FLASH Multiprocessor

The Stanford FLASH Multiprocessor. J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, J. Hennessy.

limei
Download Presentation

The Stanford FLASH Multiprocessor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Stanford FLASH Multiprocessor J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, J. Hennessy. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA) 1994. Henry Cook CS258 4/2/2008

  2. Goals • Support both cache-coherent shared memory and message passing • Not just either/or, but both at the same time • Design a custom node controller • Build actual hardware • 256 node target, scaling to thousands • (Actually built 64)

  3. Big Idea • Principal difference between CCSM and MP is protocol for transferring data • Overall machine structure is the same • Functions performed by node controller are also the same • By making the controller a special purpose protocol processor, can leverage flexibility of software while reducing overheads

  4. System Architecture • Each processor is a single MIPS chip • MAGIC has a protocol processor • Different messages types are processed by different software handlers

  5. Protocols - CCSM • Directory-based, with dynamic pointer allocation structure • Similar to DASH protocol • Separate request-reply networks to eliminate cycles • Handlers must yield if they cannot run to completion

  6. Protocols - MP • Long messages vs short messages • Block transfer vs synchronization • User-level parameters passed to transfer handler running on MAGIC • When all user message components have arrived at dest., a reception handler is invoked

  7. Protocols - Extensions • Just have to change the handlers • Emulate COMA attraction memories • Implement synchronization primitives as MAGIC handlers • Short message support similar to active messages • Not user-level active messages

  8. MAGIC Architecture • Separation and specialization • Use hardwired data movement logic for speed • Use control logic that runs software protocols for flexibility

  9. MAGIC Architecture • Must operate quickly enough to avoid being the bottleneck • Hardware based speculative message dispatch to PP • Separate hardware for message sends

  10. Protocol Processor • Implements subset of DLX ISA with extensions for common protocol ops • Statically-scheduled dual-issue superscalar processor • No interrupts, exceptions, address translation or interlocks

  11. Performance • How do latencies look for local read misses? • Derived from Verilog model • Assuming PP cache hits

  12. Performance • PP occupancies are longest sub-operation that must be performed • Block transfer bandwidth of 300-400MB/s

  13. Testbed • System-level simulator in C++ • Protocol verifier, SPLASH benchmarks • Verilog description of MAGIC • N-1 simulated nodes and one Verilog node

  14. Retrospective • Flexibility of software handlers is key • Supporting multiple protocols • Scaling to arch to multiple machine sizes • Debugging protocol operation • Building hardware in an academic environment is difficult and time consuming

More Related