The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson

The Perf. of Spin Lock Alternatives for Shared-Memory MultiprocessorsBy T. E. Anderson Presented by Ashish Jha PSU SP 2010 CS-510 05/20/2010

Agenda • Preview of a SMP single Bus based system • $ protocol and the Bus • What is a Lock? • Usage and operations in a CS • What is Spin-Lock? • Usage and operations in a CS • Problems with Spin-Locks on SMP systems • Methods to improve Spin-Lock performance in both SW & HW • Summary

Preview: SMP Arch • Shared Bus • Coherent, Consistent and Contended Memory • Snoopy Invalidation based Cache Coherence Protocol • Guarantees Atomicity of a memory operation • Sources of Contention • Bus • Memory Modules T1: LD reg=[M1] CPU 0 CPU 1 T2: LD reg=[M1] CPU N-1 CPU N T3: ST [M1]=reg Invalid L1D L1D L1D L1D Exclusive Invalid Shared Invalid Shared Modified LN$ LN$ LN$ LN$ BSQ

What is a Lock? • Instruction defined and exposed by the ISA • To achieve “Exclusive” access to memory • Lock is an “Atomic” RMW operation • uArch guarantees Atomicity • achieved via Cache Coherence Protocol • Used to implement a Critical Section • A block of code with “Exclusive” access • Examples • TSL – Test-Set-Lock • CAS – Compare-Swap

Lock Operation Reg = TSL [M1] Y $Line Exclusive/ Modified? N Local $ Miss Y N Bus Tx Remote $ Miss N Y Invalidate Other CPU $ Line Invalidate Memory $ Line M1 in Local $ Modified State N M1 CLEAN? Y Set M1=BUSY Set M1=BUSY Reg=CLEAN GOT LOCK! Reg=BUSY NO LOCK

Critical Section using Lock • Simple, Intuitive and Elegant Reg = TSL [M1] Reg = CLEAN? //Got Lock //[M1]=BUSY Execute CS [M1] = CLEAN //Un-Lock

Critical Section using Spin-Lock • Spin on Test-and-Set • Yet again, Simple, Intuitive and Elegant Reg = TSL [M1] Spin-Lock BUSY Reg = ? //M[1]=BUSY CLEAN //Got Lock Execute CS [M1] = CLEAN //Un-Lock

Problem with Spin-Lock? • A Lock is a RMW operation • A “simple?” Store op • Works well for UP to few Core environment…next slide… Reg = TSL [M1] Spin-Lock BUSY Reg = ? //M[1]=BUSY CLEAN //Got Lock Execute CS [M1] = CLEAN //Un-Lock

Spin-Lock in Many-Core Env. • Severe Contention on the Bus, with Traffic from • Snoops • Invalidations • Regular Requests • Contended Memory module • Data requested by diff CPU’s residing in the same module T1: TSL[M1] //Lock CPU 0 CPU 1 T2: TSL[M1] //Spin CPU N-2 CPU N-1 CPU N T3: [M1]=CLEAN T3: TSL[M1] //Spin T3: TSL[M2] T3: reg=[M2] T3: TSL[M1] //Spin L1D L1D L1D L1D L1D Modified Modified Modified Invalid Invalid Invalid Modified Invalid Exclusive Modified Invalid Modified Q0: N-1,N-2 Q0: N-2 LN$ LN$ LN$ LN$ LN$ Q1: N Q2: 1 Q3: 0 BSQ

Spin-Lock in Many-Core Env. Cont’d • An avalanche effect on Bus & Mem Module contention with • more # of CPU’s – impacts scalability • More snoop and coherency traffic with Ping-pong effect on locks – unsuccessful test&set and invalidations • More starvation – lock has been released but delayed further with contention on bus • Requests conflicting with same mem module • Top it off with SW bugs • Locks and/or regular requests conflicting with same CL • Suppose lock latency was 20 Core Clks • Bus runs as much as 10x slower • Now latency to acquire the lock could increase by 10x Core clks or more T1: TSL[M1] //Lock CPU 0 CPU 1 T2: TSL[M1] //Spin CPU N-2 CPU N-1 CPU N T3: [M1]=CLEAN T3: TSL[M1] //Spin T3: TSL[M2] T3: reg=[M2] T3: TSL[M1] //Spin L1D L1D L1D L1D L1D Invalid Modified Invalid Invalid Modified LN$ LN$ LN$ LN$ LN$ BSQ

A better Spin-Lock • Spin on Read (Test-and-Test-and-Set) • A bit better as long as Lock not modified while spinning on cached value • Doesn’t hold long as # of CPU’s scaled • Same set of problems as before – lot of invalidations due to TSL Spin on Lock RD Y [M1]=BUSY? N Spin on Lock RD and TSL Reg = TSL [M1] BUSY Reg = ? //M[1]=BUSY CLEAN //Got Lock Execute CS [M1] = CLEAN //Un-Lock

Verify through Tests • Spin Lock latency and perf with small and large amounts of contention • Result confirms • Sharp degradation in perf for spin on test-set as #CPU’s sclaed • Spin on read slightly better • Both methods degrades badly (scales poorly) as CPUs increased • Peak perf never reached – time to quiesce almost linear with CPU count, hurting communication BW Time to quiesce, spin on read (usec) SOURCE: Figure’s copied from paper • 20 CPU Symmetric Model B SMP • WB-Invalidate $ • Shared Bus – one same bus for Lock and regular requests • Lock acquire-release=5.6 usec • elapsed time for CPU to exe CS 1M times • Ea CPU loops: wait for lock, do CS, release and delay for a time randomly selected

What can be done? • Can Spin-Lock performance be improved by • SW • Any efficient algorithm for busy locks? • HW • Any more complex HW needed?

SW Impr. #1a: Delay TSL Spin on Test -and- Test-and-Set Spin-Lock Spin on Test -and- Delay Test-and-Set Spin-Lock • By delaying the TSL • Reduce # of Invalidations and Bus Contentions • Delay could be set • Statically – delay slots for each processor, could be prioritized • Dynamically – as in CSMA NW – exponential back-off • Performance good with • Short delay and few spinners • Long delay and many spinners N //Lock RD [M1]=BUSY? Y Y Y //Spin on Lock RD //Spin on Lock RD [M1]=BUSY? [M1]=BUSY? N N DELAY //DELAY before TSL Reg = TSL [M1] //M[1]=BUSY Reg = TSL [M1] //M[1]=BUSY BUSY BUSY Reg = ? Reg = ? //Got Lock //Got Lock CLEAN CLEAN Execute CS Execute CS [M1] = CLEAN //Un-Lock [M1] = CLEAN //Un-Lock

SW Impr. #1b: Delay after ea. Lock access • Delay after each Lock access • Check lock less frequently • TSL – less misses due to invalidation, less bus contention • Lock RD – less misses due to invalidation, less bus contention • Good for architecture with no caches • Communication (Bus, NW) BW overflow Spin on Test -and- Test-and-Set Spin-Lock Delay on Test -and- Delay on Test-and-Set Spin-Lock Y N //Spin on Lock RD //Lock RD [M1]=BUSY? [M1]=BUSY? Y N //1. DELAY after Lock RD, before TSL DELAY //2. DELAY after TSL Reg = TSL [M1] //M[1]=BUSY Reg = TSL [M1] //M[1]=BUSY BUSY BUSY Reg = ? Reg = ? //Got Lock //Got Lock CLEAN CLEAN Execute CS Execute CS [M1] = CLEAN //Un-Lock [M1] = CLEAN //Un-Lock

SW Impr. #2: Queuing • To resolve contention • Delay uses time • Queue uses space • Queue Implementation • Basic • Allocate slot for each waiting CPU in a queue • Requires insertion and deletion – atomic op’s • Not good for small CS • Efficient • Each CPU get unique seq# - atomic op • One completing the lock, the current CPU activates one with next seq# - no atomic op • Q Performance • Works well (offers low contention) for bus based arch and NW based arch with invalidation • Less valuable for Bus based arch with no caches as still contention on bus for polling by each CPU • Increased Lock latency under low contention due to overhead in attaining the lock • Preemption on CPU holding the lock could further starve the CPU waiting on the lock – pass token before switching out • Centralized Q become bottleneck as # of CPU’s increased – solutions like divide Q between nodes, etc 0 –to- (N-1) Q’s • //slot for each CPU’s • //in separate CL • //continuous polling • //no coherence Traffic CPU’s spin on its own slot • //no atomic TSL – the Lock is your slot • //slot for each Lock! CPU (0) get the lock CPU (0) unlocking pass token to another CPU (5) • //requires an atomic TSL on that slot • //some criteria for “another” e.g. priority, FIFO

SW Impr.: Test Results • 20 CPU Symmetric Model B • Static & Dynamic Delay=0-15 usec • TSL =1 usec • No atomic incr, Q uses explicit lock w/ backoff to access seq # • Ea CPU loops 1M/#P times to acquire, do CS, release and compute • Spin-waiting overhead (sec) in executing the b’mark • At low CPU count (low contention) • Queue has high latency due to lock overhead • At high CPU count • Queue performs best • back-off performs slightly worse than static delays SOURCE: Figure copied from paper

HW Solutions • Separate Bus for Lock and Regular memory requests • As in Balance • Regular req follows invalidation based $ coherence • Lock req follows distributed-write based $ coherence • Expensive solution • Little benefit to Apps which don’t spend much time spin-waiting • How to manage if the two buses are slower

HW Sol. – Multistage interconnect NW CPU • NUMA type of arch • “SMP view” as a “combination of memory” across the “nodes” • Collapse all simultaneous req’s for a single lock from a node into one • Well value would be same for all requests • Saves contention BW • But performance could be offset by increased latency of “combining switches” • Could be defeated by normal NW with backoff or queuing • HW queue • Such as maintained by the cache controller • Uses same method as SW to pass token to next CPU • One proposal by Goodman et al. combines HW and SW to maintain the queue • HW implementation though complex could be faster

HW Sol. – Single Bus CPU • Single Bus had ping-pong problem with constant invalidations even if lock wasn’t available • Much due to “atomic” nature of RMW Lock instructions • Minimize invalidations by restricting it to only when the value has really changed • makes sense and solves problem when spinning on read • However, there would still be invalidation when lock finally released • Cache miss by each spinning CPU and further failed TSL consume BW • Time to quiesce reduced but not fully eliminated • Special handling of Read requests by improving snooping and coherence protocol • Broadcast on a Read which could eliminate duplicate read misses • First read after an invalidation (such as making lock available) will full-fill further read requests on the same lock • Requires implementing fully distributed write-coherence • Special handling of test-and-set requests in cache and bus controllers • If it doesn’t increase bus or cache cycle time, then it should be better than SW queuing or backoff • None of the methods show achieving ideal perf as measured and tested on Symmetry • The difficulty is knowing the type of atomic instruction making a request • The type is only known and computed in the Core • The Cache and the Bus sees everything as nothing other than a “request” • Ability to pass such control signals along with requests could help achieve the purpose

Summary • Spin Locks is a common method to achieve mutually exclusive access to a shared data-structure • Multi-Core CPU’s are more common • Spin Lock performance degrades as # of spinning CPU increases • Efficient methods in both SW & HW could be implemented to salvage performance degradation • SW • SW queuing • Performs best at high contention • Ethernet style backoff • Performs best at low contention • HW • For multi-stage NW CPU, HW queuing at a node to combine requests of one type could help save contention • For SMP bus based CPU, intelligent snooping could be implemented to reduce bus traffic • Recommendations to Spin-Lock performance (as above) looks promising • AMB small benchmarks • Benefits to “real” workloads is an open question

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson

Presentation Transcript

Shared Memory Multiprocessors

User-Level Interprocess Communication for Shared Memory Multiprocessors

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

User-Level Interprocess Communication for Shared Memory Multiprocessors

Shared Memory Multiprocessors

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

Shared Memory Multiprocessors

Shared Memory Multiprocessors

Shared Memory Multiprocessors

Cache Coherence in Shared Memory Multiprocessors

URPC for Shared Memory Multiprocessors

Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors

Shared Memory Multiprocessors

Shared Memory Multiprocessors

User-Level Interprocess Communication for Shared Memory Multiprocessors

Shared Memory Multiprocessors

User-Level Interprocess Communication for Shared Memory Multiprocessors

Lecture 18: Shared-Memory Multiprocessors

Shared Memory Multiprocessors

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors