1 / 15

The NUMAchine Multiprocessor

The NUMAchine Multiprocessor. ICPP 2000. Outline. Presentation Overview. Architecture System Overview Key Features Fast ring routing Hardware Cache Coherence Memory Model: Sequential Consistency Simulation Studies Ring performance Network Cache performance Coherence overhead

coy
Download Presentation

The NUMAchine Multiprocessor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The NUMAchine Multiprocessor ICPP 2000

  2. Outline Presentation Overview • Architecture • System Overview • Key Features • Fast ring routing • Hardware Cache Coherence • Memory Model: Sequential Consistency • Simulation Studies • Ring performance • Network Cache performance • Coherence overhead • Prototype Performance • Hardware Status • Conclusion

  3. Arch:Sys System Architecture • Hierarchical ring network, based on clusters ( NUMAchine’s ‘Stations’) which are themselves bus-based SMPs

  4. Arch:Features NUMAchine’s Key Features • Hierachical rings • Allow for very fast and simple routing • Provide good support for broadcast and multicast • Hardware Cache Coherence • Hierarchical, directory-based, CC-NUMA system • Writeback/Invalidate protocol, designed to use the broadcast/ordering properties of rings • Sequentially Consistent Memory Model • The most intuitive model for programmer’s trained on uniprocessors • Simple, low cost, but with good flexibility, scalability and performance

  5. Arch:Fmask Fast Ring Routing: Filtermasks • Fast ring routing is achieved by the use of Filtermasks (I.e. simple bit-masks) to store cache-line location information (imprecision reduces directory storage requirements) • These Filtermasks are used directly by the routing hardware in the ring interfaces

  6. CC Hardware Cache Coherence • Hierarchical, directory-based, writeback/invalidate • Directory entries are stored in both the per-station memory (‘home’ location), and cached in the network interfaces (hence the name, Network Cache) • The Network Cache stores both the remotely cached directory information, as well as the cache lines themselves, and allows the network interface to perform coherence operations locally (on-Station), avoiding remote accesses to the home directory • Filtermasks indicate which Stations (I.e. clusters) may potentially have a copy of a cache line (with the fuzziness due to the imprecise nature of the filter masks) • Processor Masks are used only within a Station, to indicates which particular caches may contain a copy (with the fuzziness here due to Shared lines that may have been silently ejected)

  7. SC Memory Model: Sequential Consistency • The most intuitive model for the normally trained programmer: increases the usability of the system • Easily supported by NUMAchine’s ring network: the only change necessary is to force invalidates to pass through a global ‘sequencing point’ on the ring, increasing the average invalidation latency by 2 ring hops (40 ns with our default 50 MHz rings)

  8. SS:RP1 Simulation Studies: Ring Performance 1 • Use the SPLASH-2 benchmarks suite, and a cycle-accurate hardware simulator with full modeling of the coherence protocol • Applications with high communication-to-computation ratios (e.g. FFT, Radix) show high utilizations, particularly in the Central Ring (indicating that a faster Central Ring would help)

  9. SS:RP2 Simulation Studies: Ring Performance 2 • Maximum and average ring interface queue depths indicate the network congestion, which correlates to bursty traffic • Large differences between the maximum and average values indicates large variability in burst size

  10. SS:NC Simulation Studies: Network Cache • Graphs show a measure of the Network Cache’s effect by looking at the hit rate (I.e. reduction in remote data and coherence traffic) • By categorizing the hits by the coherence directory state, we also see where the benefits come from: caching shared data, or reducing invalidations and coherence traffic

  11. SS:CO Simulation Studies: Coherence Overhead • Measure the overhead due to cache coherence, by allowing all writes to succeed immediately without checking cache-line state, and comparing against runs with the full cache coherence protocol in place (both using infinite-capacity Network Caches to avoid measurement noise due to capacity effects) • Results indicate that in many cases it is basic data locality and/or poor parallelizability that are impeding performance, not cache coherence

  12. PP Prototype Performance • Speedups from the hardware prototype, compared against estimates from the simulator

  13. Status Hardware Prototype Status • Fully operational running the custom Tornado OS, with a 32-processor system shown below

  14. Fin Conclusion • 4- and 8-way SMPs are fast becoming commodity items • The NUMAchine project has shown that a simple, cost-effective, CC-NUMA multiprocessor can be built using these SMP building blocks and a simple ring network, and still achieve good performance and scalability • In the medium-scale range (a few tens to hundreds of processors), rings are a good choice for a multiprocessor interconnect • We have demonstrated an efficient hardware cache coherence scheme, which is designed to make use of the natural ordering and broadcast capabilities of rings • NUMAchine’s architecture efficiently supports a sequentially consistent memory model, which we feel is essential for increasing the ease of use and programmability of multiprocessors

  15. Ack Acknowledgments: The NUMAchine Team • Hardware • Prof. Zvonko Vranesic • Prof. Stephen Brown • Robin Grindley (SOMA Networks) • Alex Grbic • Prof. Zeljko Zilic (McGill) • Steve Caranci (Altera) • Derek DeVries (OANDA) • Guy Lemieux • Kelvin Loveless (GNNettest) • Prof. Sinisa Srbljic (Zagreb) • Paul McHardy • Mitch Gusat (IBM) • Operating Systems • Prof. Michael Stumm • Orran Krieger (IBM) • Ben Gamsa • Jonathon Appavoo • Robert Ho • Compilers • Prof. Tarek Abdelrahman • Prof. Naraig Manjikian (Queens) • Applications • Prof. Ken Sevcik

More Related