March 17, 2008 Rhishikesh Limaye

CS 258 Parallel Computer ArchitectureLecture 15.1DASH: Directory Architecture for Shared memoryImplementation, cost, performanceDaniel Lenoski, et. al. “The DASH Prototype: Implementation and Performance”, Proceedings of the International symposium on Computer Architecture, 1992. March 17, 2008 Rhishikesh Limaye

DASH objectives • Demonstrates large-scale shared memory multiprocessor using directory-based cache coherence. • Prototype with 16-64 processors. • Argument is that: for performance and programmability, a parallel architecture should: • Scale to 100s-1000s of processors • Have high performance individual processors • Have single shared address space

Two-level architecture • Cluster: • Uses bus-based shared memory with snoopy cache coherence • 4 processors per cluster • Inter-cluster: • Scalable interconnect network • Directory-based cache coherence

Cluster level • Minor modifications to off-the-shelf 4D/340 cluster • 4 MIPS R3000 processors + 4 R3010 floating point coprocessors • L1 write-through, L2 write-back. • Cache coherence: • MESI i.e. Illinois • Cache-to-cache transfers good for cached remote locations • L1 cache is write-through => inclusion property • Pipelined bus with maximum bandwidth 64MB/s.

Inter-cluster directory protocol • Three states per 16B memory chunk: invalid, shared, dirty. • Memory is distributed across clusters. • Directory bits: • Simple scheme of 1 bit per cluster + 1 dirty bit. • This is good for the prototype which has maximum 16 clusters. Should be replaced by limited-pointer/sparse directory for more clusters. • Replies are sent directly between clusters and not through the home cluster. • i.e. invalidation acks are collected at the requester node and not the home node of a memory location.

Extra hardware for directory For each cluster, we have the following: • Directory bits: DRAM • 17 bits per 16-byte cache line • Directory controller: snoops every bus transaction within cluster, accesses directory bits and takes action. • Reply controller • Remote access cache: SRAM. 128KB, 16B line • Snoops remote accesses on the local bus • Stores state of on-going remote accesses made by local processors • Lock-up free: handle multiple outstanding requests • QUESTION: what happens if two remote requests collide in this direct mapped cache? • Pseudo-CPU: • for requests for local memory by remote nodes. • Performance monitor

Memory performance • 4-level memory hierarchy: (L1, L2), (local L2s + memory), directory home, remote cluster

Hardware cost of directory • [table 2 in the paper] • 13.7% DRAM – directory bits • For larger systems, sparse representation needed. • 10% SRAM – remote access cache • 20% logic gates – controllers and network interfaces • Clustering is important: • For uniprocessor node, directory logic is 44%. • Compare to message passing: • Message passing has about 10% logic + ~0 memory cost. • Thus, hardware coherence costs 10% more logic and 10% more memory. • Later argued that, the performance improvement is much greater than 10% -- 3-4X.

Performance monitor • Configurable events. • SRAM-based counters: • 2 banks of 16K x 32 SRAM. • Addressed by events (i.e. event0, event1, event2… form address bit 0, 1, 2…) • Thus, can track (log 16K) = 14 events with each bank. • Trace buffer made of DRAM: • Can store 2M memory ops. • With software support, can log all memory operations.

Performance results • 9 applications • Good speed-up on 5 – without specially optimizing for DASH. • MP3D has bad locality. PSIM4 is enhanced version of MP3D. • Cholesky: more processors => too fine granularity, unless problem size is increased unreasonably. • Note: dip after P = 4.

Detailed study of 3 applications • What to read from tables 4, 5, 6: • Water and LocusRoute have equal fraction of reads local, but Water scales well, and LocusRoute doesn’t. • Remote caching works: • Water and LocusRoute have remote reference every 20 and 11 instructions, but busy pclks between processor stalls is 506 and 181.

Conclusions • Locality is still important, because of higher remote latencies • However, for applications, natural locality can be enough (Barnes-Hut, Radiosity, Water). • Thus, good speed-ups can be achieved without difficult programming model (i.e. message passing) • For higher performance, have to worry about the extended memory hierarchy – but only for critical data structures. Analogous to argument in the uniprocessor world: caches vs. scratchpad memories/stream buffers.

March 17, 2008 Rhishikesh Limaye

March 17, 2008 Rhishikesh Limaye

Presentation Transcript

17 March

March 2008

Faceoff Rules March 17, 2008

March 2008

March 2008

March 2008

INTERTANKO North American Panel 17 March 2008

Benjamin Good March 17, 2008

March 2008

Hershey Lodge Preconference Symposium 17 March 2008

March 17, 2008 FM Report Parameter Standardization

NATIONAL PLANNING FORUM 17 MARCH 2008

Agnosia and Perceptual Disturbances March 17, 2008

March 2008

Namrata Mukherjee 17 th March 2008, Singapore

March 2008

March 2008

BIDDERS CONFERENCE March 17, 2008

-March 17, 2008 -Marc Dessel

Matthew Martin March 17, 2008

March, 2008

Benjamin Good March 17, 2008