1 / 12

March 17, 2008 Rhishikesh Limaye

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et. al. “The DASH Prototype: Implementation and Performance”, Proceedings of the International symposium on Computer Architecture, 1992.

candra
Download Presentation

March 17, 2008 Rhishikesh Limaye

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 258 Parallel Computer ArchitectureLecture 15.1DASH: Directory Architecture for Shared memoryImplementation, cost, performanceDaniel Lenoski, et. al. “The DASH Prototype: Implementation and Performance”, Proceedings of the International symposium on Computer Architecture, 1992. March 17, 2008 Rhishikesh Limaye

  2. DASH objectives • Demonstrates large-scale shared memory multiprocessor using directory-based cache coherence. • Prototype with 16-64 processors. • Argument is that: for performance and programmability, a parallel architecture should: • Scale to 100s-1000s of processors • Have high performance individual processors • Have single shared address space

  3. Two-level architecture • Cluster: • Uses bus-based shared memory with snoopy cache coherence • 4 processors per cluster • Inter-cluster: • Scalable interconnect network • Directory-based cache coherence

  4. Cluster level • Minor modifications to off-the-shelf 4D/340 cluster • 4 MIPS R3000 processors + 4 R3010 floating point coprocessors • L1 write-through, L2 write-back. • Cache coherence: • MESI i.e. Illinois • Cache-to-cache transfers good for cached remote locations • L1 cache is write-through => inclusion property • Pipelined bus with maximum bandwidth 64MB/s.

  5. Inter-cluster directory protocol • Three states per 16B memory chunk: invalid, shared, dirty. • Memory is distributed across clusters. • Directory bits: • Simple scheme of 1 bit per cluster + 1 dirty bit. • This is good for the prototype which has maximum 16 clusters. Should be replaced by limited-pointer/sparse directory for more clusters. • Replies are sent directly between clusters and not through the home cluster. • i.e. invalidation acks are collected at the requester node and not the home node of a memory location.

  6. Extra hardware for directory For each cluster, we have the following: • Directory bits: DRAM • 17 bits per 16-byte cache line • Directory controller: snoops every bus transaction within cluster, accesses directory bits and takes action. • Reply controller • Remote access cache: SRAM. 128KB, 16B line • Snoops remote accesses on the local bus • Stores state of on-going remote accesses made by local processors • Lock-up free: handle multiple outstanding requests • QUESTION: what happens if two remote requests collide in this direct mapped cache? • Pseudo-CPU: • for requests for local memory by remote nodes. • Performance monitor

  7. Memory performance • 4-level memory hierarchy: (L1, L2), (local L2s + memory), directory home, remote cluster

  8. Hardware cost of directory • [table 2 in the paper] • 13.7% DRAM – directory bits • For larger systems, sparse representation needed. • 10% SRAM – remote access cache • 20% logic gates – controllers and network interfaces • Clustering is important: • For uniprocessor node, directory logic is 44%. • Compare to message passing: • Message passing has about 10% logic + ~0 memory cost. • Thus, hardware coherence costs 10% more logic and 10% more memory. • Later argued that, the performance improvement is much greater than 10% -- 3-4X.

  9. Performance monitor • Configurable events. • SRAM-based counters: • 2 banks of 16K x 32 SRAM. • Addressed by events (i.e. event0, event1, event2… form address bit 0, 1, 2…) • Thus, can track (log 16K) = 14 events with each bank. • Trace buffer made of DRAM: • Can store 2M memory ops. • With software support, can log all memory operations.

  10. Performance results • 9 applications • Good speed-up on 5 – without specially optimizing for DASH. • MP3D has bad locality. PSIM4 is enhanced version of MP3D. • Cholesky: more processors => too fine granularity, unless problem size is increased unreasonably. • Note: dip after P = 4.

  11. Detailed study of 3 applications • What to read from tables 4, 5, 6: • Water and LocusRoute have equal fraction of reads local, but Water scales well, and LocusRoute doesn’t. • Remote caching works: • Water and LocusRoute have remote reference every 20 and 11 instructions, but busy pclks between processor stalls is 506 and 181.

  12. Conclusions • Locality is still important, because of higher remote latencies • However, for applications, natural locality can be enough (Barnes-Hut, Radiosity, Water). • Thus, good speed-ups can be achieved without difficult programming model (i.e. message passing) • For higher performance, have to worry about the extended memory hierarchy – but only for critical data structures. Analogous to argument in the uniprocessor world: caches vs. scratchpad memories/stream buffers.

More Related