280 likes | 486 Views
Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se. Outline. NUCA Locks DSZOOM – Software-based Shared Memory TMA – Trap-based Memory Architecture.
E N D
Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se Dissertation Seminar
Outline • NUCA Locks • DSZOOM – Software-based Shared Memory • TMA – Trap-based Memory Architecture Dissertation Seminar
Vasaloppet“Contention Problem in Sweden” 85.6533 km to go… CS Traditional cross-country ski race 90 km … Dissertation Seminar
Spin Locks under Contention Spin locks Spin locks with backoff IF (more contention) THEN less efficient CS … “The more important the slower it runs…” Critical Section (CS) Cost Amount of Contention Dissertation Seminar
Queue-based Locks Queue-based locks Spin locks Spin locks with backoff CS Cost IF (more contention) THEN constant CS cost … Amount of Contention Dissertation Seminar
This Dissertation NUCA locks Spin locks Spin locks with backoff CS Cost IF (more contention) THEN more efficient CS … “The more important the faster it runs…” Queue-based locks Amount of Contention Dissertation Seminar
NUCA Locks (Basic Idea) 1) Reduce traffic - one CPU per node is testing… 2) Improve lock handover 3) More efficient CS - local traffic is cheaper Switch Memory Memory Memory $ $ … $ $ $ … $ $ $ … $ P P P P P P P P P Lock/Unlock Test Test Test Test Lock/Unlock Test Test Test Test Test Test Test Dissertation Seminar
The HBO Lock (the simplest HBO) Creates Communication Affinity • What do we need? • node_id • Compare&swap (CAS) atomic operation CAS(Lock_address,FREE, node_id) • lock-acquire: • If the lock-value is in the state FREE: • The node_id is CAS-ed into the lock location • Else: 2 cases • The lock is “local” Spin with small backoff • The lock is “remote” Spin with large backoff • Simple but fairly effective… Dissertation Seminar
Performance ResultsRealistic microbenchmark, 2-node WildFire, 28 CPUs 14 14 WF Fairness? Dissertation Seminar
Fairness StudyRealistic microbenchmark, 2-node WildFire, 28 CPUs t Dissertation Seminar
Application Performance28-processor runs ≈ 4x Dissertation Seminar
Total Traffic: Raytrace Dissertation Seminar
HBO Locks inside Linux Kernel • Patch provided by Silicon Graphics, Inc. • Linux-IA64 kernel implementation, May 2005 • Page-fault handler runs 3x faster • 60 processors Dissertation Seminar
Outline • NUCA Locks • DSZOOM – Software-based Shared Memory • TMA – Trap-based Memory Architecture Dissertation Seminar
The DSZOOM Proposal Dissertation Seminar
The DSZOOM Proposal • Run entire protocol in requesting-processor • No protocol agent communication! • Assumes user-level remote memory access • put, get, and atomics [ InfiniBand] • Fine-grain memory protocols (e.g., 64 bytes) • Hardware-like memory models [Shasta, Blizzard, Sirocco] Dissertation Seminar
“Squeezing” Protocols into Binaries… DSZOOM Program Original Program ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop Fast-path Protocol Code ld [%o1 + 64], %o0 Slow-path Protocol Code (C-code) Binary/Assembler level instrumentation Dissertation Seminar
Write Permission Caching • Problem: store instrumentation relies on locking • More complex instrumentation • Solution: write permission cache (WPC) • Small and fast software-managed cache • Keeps write permissions • The WPC idea: • Exploit store locality • Dynamically reduce the number of memory references in store checking code Dissertation Seminar
Other “Features” • Two kinds of protocols • Invalidate • Update • Many optimizations • Instrumentation scheduling (update and invalidate) • Instrumentation batching (invalidate) • WPC-based write batching (update) • WPC-based dirty-data filtering (update) • Private-data filtering (update) • # of WPC entries (update and invalidate) • Coherence unit size (update and invalidate) Dissertation Seminar
Coherence Flags and Profiling • Coherence flags • Similar to optimization flags of compilers • Possible scenario: gcc -dszoom-cl 128 -dszoom-inv –O3 my_app.c • Execution profiling • Similar to profile feedback of compilers • Helps finding appropriate coherence flag settings • Low overhead implementation in DSZOOM • Less than 30 percent overhead • Works for both small and large input sets Dissertation Seminar
DSZOOM Results2-node WildFire, 16 CPUs 1.45x 1.11x Dissertation Seminar
Outline • NUCA Locks • DSZOOM – Software-based Shared Memory • TMA – Trap-based Memory Architecture Dissertation Seminar
Instrumentation Drawbacks DSZOOM Program Original Program ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop Fast-path Protocol Code ld [%o1 + 64], %o0 Slow-path Protocol Code (C-code) • Binary transparency? • Run-time execution overhead Dissertation Seminar
Trap-Based Memory Architectures • Basic idea • Detect fine-grained coherence violations in hardware • Trigger a coherence trap when one occur • Maintain coherence by software protocols • No memory system modifications • Minimal processor modifications • Binary Transparency • No need to instrument binaries/applications Dissertation Seminar
TMA LiteProof-of-concept Implementation • Load permission check • Hardware implementation of software check • Predefined “magic-value” convention • Store permission check • Hardware WPC • Can be seen as a very small cache • Operates on virtual addresses • Accessed in parallel with the data TLB Dissertation Seminar
TMA Lite Performance[TMA: simulation study, 4 nodes | DSZOOM: 2-node WildFire] 1.75x 1.01x Dissertation Seminar
Topics not Presented • RH lock algorithm • Controlled (un)fairness • HBO_GT and HBO_GT_SD algorithms • Global throttling and starvation detection • DSZOOM implementation details • Instrumentation challenges; scheduling, batching, etc. • Bandwidth filtering techniques; dirty- & private-data • Innovative TMA simulation tricks • Low-level “good days” hacks • Reusing Simics checkpoints Dissertation Seminar
Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se Dissertation Seminar