1 / 21

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson. Presented by Ashish Jha PSU SP 2010 CS-510 05/20/2010. Agenda. Preview of a SMP single Bus based system $ protocol and the Bus What is a Lock? Usage and operations in a CS What is Spin-Lock?

eavan
Download Presentation

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Perf. of Spin Lock Alternatives for Shared-Memory MultiprocessorsBy T. E. Anderson Presented by Ashish Jha PSU SP 2010 CS-510 05/20/2010

  2. Agenda • Preview of a SMP single Bus based system • $ protocol and the Bus • What is a Lock? • Usage and operations in a CS • What is Spin-Lock? • Usage and operations in a CS • Problems with Spin-Locks on SMP systems • Methods to improve Spin-Lock performance in both SW & HW • Summary

  3. Preview: SMP Arch • Shared Bus • Coherent, Consistent and Contended Memory • Snoopy Invalidation based Cache Coherence Protocol • Guarantees Atomicity of a memory operation • Sources of Contention • Bus • Memory Modules T1: LD reg=[M1] CPU 0 CPU 1 T2: LD reg=[M1] CPU N-1 CPU N T3: ST [M1]=reg Invalid L1D L1D L1D L1D Exclusive Invalid Shared Invalid Shared Modified LN$ LN$ LN$ LN$ BSQ

  4. What is a Lock? • Instruction defined and exposed by the ISA • To achieve “Exclusive” access to memory • Lock is an “Atomic” RMW operation • uArch guarantees Atomicity • achieved via Cache Coherence Protocol • Used to implement a Critical Section • A block of code with “Exclusive” access • Examples • TSL – Test-Set-Lock • CAS – Compare-Swap

  5. Lock Operation Reg = TSL [M1] Y $Line Exclusive/ Modified? N Local $ Miss Y N Bus Tx Remote $ Miss N Y Invalidate Other CPU $ Line Invalidate Memory $ Line M1 in Local $ Modified State N M1 CLEAN? Y Set M1=BUSY Set M1=BUSY Reg=CLEAN GOT LOCK! Reg=BUSY NO LOCK

  6. Critical Section using Lock • Simple, Intuitive and Elegant Reg = TSL [M1] Reg = CLEAN? //Got Lock //[M1]=BUSY Execute CS [M1] = CLEAN //Un-Lock

  7. Critical Section using Spin-Lock • Spin on Test-and-Set • Yet again, Simple, Intuitive and Elegant Reg = TSL [M1] Spin-Lock BUSY Reg = ? //M[1]=BUSY CLEAN //Got Lock Execute CS [M1] = CLEAN //Un-Lock

  8. Problem with Spin-Lock? • A Lock is a RMW operation • A “simple?” Store op • Works well for UP to few Core environment…next slide… Reg = TSL [M1] Spin-Lock BUSY Reg = ? //M[1]=BUSY CLEAN //Got Lock Execute CS [M1] = CLEAN //Un-Lock

  9. Spin-Lock in Many-Core Env. • Severe Contention on the Bus, with Traffic from • Snoops • Invalidations • Regular Requests • Contended Memory module • Data requested by diff CPU’s residing in the same module T1: TSL[M1] //Lock CPU 0 CPU 1 T2: TSL[M1] //Spin CPU N-2 CPU N-1 CPU N T3: [M1]=CLEAN T3: TSL[M1] //Spin T3: TSL[M2] T3: reg=[M2] T3: TSL[M1] //Spin L1D L1D L1D L1D L1D Modified Modified Modified Invalid Invalid Invalid Modified Invalid Exclusive Modified Invalid Modified Q0: N-1,N-2 Q0: N-2 LN$ LN$ LN$ LN$ LN$ Q1: N Q2: 1 Q3: 0 BSQ

  10. Spin-Lock in Many-Core Env. Cont’d • An avalanche effect on Bus & Mem Module contention with • more # of CPU’s – impacts scalability • More snoop and coherency traffic with Ping-pong effect on locks – unsuccessful test&set and invalidations • More starvation – lock has been released but delayed further with contention on bus • Requests conflicting with same mem module • Top it off with SW bugs • Locks and/or regular requests conflicting with same CL • Suppose lock latency was 20 Core Clks • Bus runs as much as 10x slower • Now latency to acquire the lock could increase by 10x Core clks or more T1: TSL[M1] //Lock CPU 0 CPU 1 T2: TSL[M1] //Spin CPU N-2 CPU N-1 CPU N T3: [M1]=CLEAN T3: TSL[M1] //Spin T3: TSL[M2] T3: reg=[M2] T3: TSL[M1] //Spin L1D L1D L1D L1D L1D Invalid Modified Invalid Invalid Modified LN$ LN$ LN$ LN$ LN$ BSQ

  11. A better Spin-Lock • Spin on Read (Test-and-Test-and-Set) • A bit better as long as Lock not modified while spinning on cached value • Doesn’t hold long as # of CPU’s scaled • Same set of problems as before – lot of invalidations due to TSL Spin on Lock RD Y [M1]=BUSY? N Spin on Lock RD and TSL Reg = TSL [M1] BUSY Reg = ? //M[1]=BUSY CLEAN //Got Lock Execute CS [M1] = CLEAN //Un-Lock

  12. Verify through Tests • Spin Lock latency and perf with small and large amounts of contention • Result confirms • Sharp degradation in perf for spin on test-set as #CPU’s sclaed • Spin on read slightly better • Both methods degrades badly (scales poorly) as CPUs increased • Peak perf never reached – time to quiesce almost linear with CPU count, hurting communication BW Time to quiesce, spin on read (usec) SOURCE: Figure’s copied from paper • 20 CPU Symmetric Model B SMP • WB-Invalidate $ • Shared Bus – one same bus for Lock and regular requests • Lock acquire-release=5.6 usec • elapsed time for CPU to exe CS 1M times • Ea CPU loops: wait for lock, do CS, release and delay for a time randomly selected

  13. What can be done? • Can Spin-Lock performance be improved by • SW • Any efficient algorithm for busy locks? • HW • Any more complex HW needed?

  14. SW Impr. #1a: Delay TSL Spin on Test -and- Test-and-Set Spin-Lock Spin on Test -and- Delay Test-and-Set Spin-Lock • By delaying the TSL • Reduce # of Invalidations and Bus Contentions • Delay could be set • Statically – delay slots for each processor, could be prioritized • Dynamically – as in CSMA NW – exponential back-off • Performance good with • Short delay and few spinners • Long delay and many spinners N //Lock RD [M1]=BUSY? Y Y Y //Spin on Lock RD //Spin on Lock RD [M1]=BUSY? [M1]=BUSY? N N DELAY //DELAY before TSL Reg = TSL [M1] //M[1]=BUSY Reg = TSL [M1] //M[1]=BUSY BUSY BUSY Reg = ? Reg = ? //Got Lock //Got Lock CLEAN CLEAN Execute CS Execute CS [M1] = CLEAN //Un-Lock [M1] = CLEAN //Un-Lock

  15. SW Impr. #1b: Delay after ea. Lock access • Delay after each Lock access • Check lock less frequently • TSL – less misses due to invalidation, less bus contention • Lock RD – less misses due to invalidation, less bus contention • Good for architecture with no caches • Communication (Bus, NW) BW overflow Spin on Test -and- Test-and-Set Spin-Lock Delay on Test -and- Delay on Test-and-Set Spin-Lock Y N //Spin on Lock RD //Lock RD [M1]=BUSY? [M1]=BUSY? Y N //1. DELAY after Lock RD, before TSL DELAY //2. DELAY after TSL Reg = TSL [M1] //M[1]=BUSY Reg = TSL [M1] //M[1]=BUSY BUSY BUSY Reg = ? Reg = ? //Got Lock //Got Lock CLEAN CLEAN Execute CS Execute CS [M1] = CLEAN //Un-Lock [M1] = CLEAN //Un-Lock

  16. SW Impr. #2: Queuing • To resolve contention • Delay uses time • Queue uses space • Queue Implementation • Basic • Allocate slot for each waiting CPU in a queue • Requires insertion and deletion – atomic op’s • Not good for small CS • Efficient • Each CPU get unique seq# - atomic op • One completing the lock, the current CPU activates one with next seq# - no atomic op • Q Performance • Works well (offers low contention) for bus based arch and NW based arch with invalidation • Less valuable for Bus based arch with no caches as still contention on bus for polling by each CPU • Increased Lock latency under low contention due to overhead in attaining the lock • Preemption on CPU holding the lock could further starve the CPU waiting on the lock – pass token before switching out • Centralized Q become bottleneck as # of CPU’s increased – solutions like divide Q between nodes, etc 0 –to- (N-1) Q’s • //slot for each CPU’s • //in separate CL • //continuous polling • //no coherence Traffic CPU’s spin on its own slot • //no atomic TSL – the Lock is your slot • //slot for each Lock! CPU (0) get the lock CPU (0) unlocking pass token to another CPU (5) • //requires an atomic TSL on that slot • //some criteria for “another” e.g. priority, FIFO

  17. SW Impr.: Test Results • 20 CPU Symmetric Model B • Static & Dynamic Delay=0-15 usec • TSL =1 usec • No atomic incr, Q uses explicit lock w/ backoff to access seq # • Ea CPU loops 1M/#P times to acquire, do CS, release and compute • Spin-waiting overhead (sec) in executing the b’mark • At low CPU count (low contention) • Queue has high latency due to lock overhead • At high CPU count • Queue performs best • back-off performs slightly worse than static delays SOURCE: Figure copied from paper

  18. HW Solutions • Separate Bus for Lock and Regular memory requests • As in Balance • Regular req follows invalidation based $ coherence • Lock req follows distributed-write based $ coherence • Expensive solution • Little benefit to Apps which don’t spend much time spin-waiting • How to manage if the two buses are slower

  19. HW Sol. – Multistage interconnect NW CPU • NUMA type of arch • “SMP view” as a “combination of memory” across the “nodes” • Collapse all simultaneous req’s for a single lock from a node into one • Well value would be same for all requests • Saves contention BW • But performance could be offset by increased latency of “combining switches” • Could be defeated by normal NW with backoff or queuing • HW queue • Such as maintained by the cache controller • Uses same method as SW to pass token to next CPU • One proposal by Goodman et al. combines HW and SW to maintain the queue • HW implementation though complex could be faster

  20. HW Sol. – Single Bus CPU • Single Bus had ping-pong problem with constant invalidations even if lock wasn’t available • Much due to “atomic” nature of RMW Lock instructions • Minimize invalidations by restricting it to only when the value has really changed • makes sense and solves problem when spinning on read • However, there would still be invalidation when lock finally released • Cache miss by each spinning CPU and further failed TSL consume BW • Time to quiesce reduced but not fully eliminated • Special handling of Read requests by improving snooping and coherence protocol • Broadcast on a Read which could eliminate duplicate read misses • First read after an invalidation (such as making lock available) will full-fill further read requests on the same lock • Requires implementing fully distributed write-coherence • Special handling of test-and-set requests in cache and bus controllers • If it doesn’t increase bus or cache cycle time, then it should be better than SW queuing or backoff • None of the methods show achieving ideal perf as measured and tested on Symmetry • The difficulty is knowing the type of atomic instruction making a request • The type is only known and computed in the Core • The Cache and the Bus sees everything as nothing other than a “request” • Ability to pass such control signals along with requests could help achieve the purpose

  21. Summary • Spin Locks is a common method to achieve mutually exclusive access to a shared data-structure • Multi-Core CPU’s are more common • Spin Lock performance degrades as # of spinning CPU increases • Efficient methods in both SW & HW could be implemented to salvage performance degradation • SW • SW queuing • Performs best at high contention • Ethernet style backoff • Performs best at low contention • HW • For multi-stage NW CPU, HW queuing at a node to combine requests of one type could help save contention • For SMP bus based CPU, intelligent snooping could be implemented to reduce bus traffic • Recommendations to Spin-Lock performance (as above) looks promising • AMB small benchmarks • Benefits to “real” workloads is an open question

More Related