the perf of spin lock alternatives for shared memory multiprocessors by t e anderson
Download
Skip this Video
Download Presentation
The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson

Loading in 2 Seconds...

play fullscreen
1 / 21

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson. Presented by Ashish Jha PSU SP 2010 CS-510 05/20/2010. Agenda. Preview of a SMP single Bus based system $ protocol and the Bus What is a Lock? Usage and operations in a CS What is Spin-Lock?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson' - eavan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the perf of spin lock alternatives for shared memory multiprocessors by t e anderson

The Perf. of Spin Lock Alternatives for Shared-Memory MultiprocessorsBy T. E. Anderson

Presented by Ashish Jha

PSU SP 2010 CS-510

05/20/2010

agenda
Agenda
  • Preview of a SMP single Bus based system
    • $ protocol and the Bus
  • What is a Lock?
    • Usage and operations in a CS
  • What is Spin-Lock?
    • Usage and operations in a CS
  • Problems with Spin-Locks on SMP systems
  • Methods to improve Spin-Lock performance in both SW & HW
  • Summary
preview smp arch
Preview: SMP Arch
  • Shared Bus
    • Coherent, Consistent and Contended Memory
  • Snoopy Invalidation based Cache Coherence Protocol
    • Guarantees Atomicity of a memory operation
  • Sources of Contention
    • Bus
    • Memory Modules

T1: LD reg=[M1]

CPU 0

CPU 1

T2: LD reg=[M1]

CPU N-1

CPU N

T3: ST [M1]=reg

Invalid

L1D

L1D

L1D

L1D

Exclusive

Invalid

Shared

Invalid

Shared

Modified

LN$

LN$

LN$

LN$

BSQ

what is a lock
What is a Lock?
  • Instruction defined and exposed by the ISA
    • To achieve “Exclusive” access to memory
  • Lock is an “Atomic” RMW operation
    • uArch guarantees Atomicity
      • achieved via Cache Coherence Protocol
  • Used to implement a Critical Section
    • A block of code with “Exclusive” access
  • Examples
    • TSL – Test-Set-Lock
    • CAS – Compare-Swap
lock operation
Lock Operation

Reg = TSL [M1]

Y

$Line Exclusive/

Modified?

N

Local $ Miss

Y

N

Bus Tx

Remote $ Miss

N

Y

Invalidate

Other CPU $ Line

Invalidate

Memory $ Line

M1 in Local $

Modified State

N

M1 CLEAN?

Y

Set M1=BUSY

Set M1=BUSY

Reg=CLEAN

GOT LOCK!

Reg=BUSY

NO LOCK

critical section using lock
Critical Section using Lock
  • Simple, Intuitive and Elegant

Reg = TSL [M1]

Reg = CLEAN?

//Got Lock

//[M1]=BUSY

Execute CS

[M1] = CLEAN

//Un-Lock

critical section using spin lock
Critical Section using Spin-Lock
  • Spin on Test-and-Set
  • Yet again, Simple, Intuitive and Elegant

Reg = TSL [M1]

Spin-Lock

BUSY

Reg = ?

//M[1]=BUSY

CLEAN

//Got Lock

Execute CS

[M1] = CLEAN

//Un-Lock

problem with spin lock
Problem with Spin-Lock?
  • A Lock is a RMW operation
    • A “simple?” Store op
  • Works well for UP to few Core environment…next slide…

Reg = TSL [M1]

Spin-Lock

BUSY

Reg = ?

//M[1]=BUSY

CLEAN

//Got Lock

Execute CS

[M1] = CLEAN

//Un-Lock

spin lock in many core env
Spin-Lock in Many-Core Env.
  • Severe Contention on the Bus, with Traffic from
    • Snoops
    • Invalidations
    • Regular Requests
  • Contended Memory module
    • Data requested by diff CPU’s residing in the same module

T1: TSL[M1] //Lock

CPU 0

CPU 1

T2: TSL[M1] //Spin

CPU N-2

CPU N-1

CPU N

T3: [M1]=CLEAN

T3: TSL[M1] //Spin

T3: TSL[M2]

T3: reg=[M2]

T3: TSL[M1] //Spin

L1D

L1D

L1D

L1D

L1D

Modified

Modified

Modified

Invalid

Invalid

Invalid

Modified

Invalid

Exclusive

Modified

Invalid

Modified

Q0: N-1,N-2

Q0: N-2

LN$

LN$

LN$

LN$

LN$

Q1: N

Q2: 1

Q3: 0

BSQ

spin lock in many core env cont d
Spin-Lock in Many-Core Env. Cont’d
  • An avalanche effect on Bus & Mem Module contention with
    • more # of CPU’s – impacts scalability
      • More snoop and coherency traffic with Ping-pong effect on locks – unsuccessful test&set and invalidations
      • More starvation – lock has been released but delayed further with contention on bus
    • Requests conflicting with same mem module
    • Top it off with SW bugs
      • Locks and/or regular requests conflicting with same CL
  • Suppose lock latency was 20 Core Clks
    • Bus runs as much as 10x slower
      • Now latency to acquire the lock could increase by 10x Core clks or more

T1: TSL[M1] //Lock

CPU 0

CPU 1

T2: TSL[M1] //Spin

CPU N-2

CPU N-1

CPU N

T3: [M1]=CLEAN

T3: TSL[M1] //Spin

T3: TSL[M2]

T3: reg=[M2]

T3: TSL[M1] //Spin

L1D

L1D

L1D

L1D

L1D

Invalid

Modified

Invalid

Invalid

Modified

LN$

LN$

LN$

LN$

LN$

BSQ

a better spin lock
A better Spin-Lock
  • Spin on Read (Test-and-Test-and-Set)
    • A bit better as long as Lock not modified while spinning on cached value
      • Doesn’t hold long as # of CPU’s scaled
        • Same set of problems as before – lot of invalidations due to TSL

Spin on Lock RD

Y

[M1]=BUSY?

N

Spin on Lock RD and TSL

Reg = TSL [M1]

BUSY

Reg = ?

//M[1]=BUSY

CLEAN

//Got Lock

Execute CS

[M1] = CLEAN

//Un-Lock

verify through tests
Verify through Tests
  • Spin Lock latency and perf with small and large amounts of contention
  • Result confirms
    • Sharp degradation in perf for spin on test-set as #CPU’s sclaed
      • Spin on read slightly better
    • Both methods degrades badly (scales poorly) as CPUs increased
      • Peak perf never reached – time to quiesce almost linear with CPU count, hurting communication BW

Time to quiesce, spin on read (usec)

SOURCE: Figure’s copied from paper

  • 20 CPU Symmetric Model B SMP
    • WB-Invalidate $
    • Shared Bus – one same bus for Lock and regular requests
  • Lock acquire-release=5.6 usec
  • elapsed time for CPU to exe CS 1M times
    • Ea CPU loops: wait for lock, do CS, release and delay for a time randomly selected
what can be done
What can be done?
  • Can Spin-Lock performance be improved by
    • SW
      • Any efficient algorithm for busy locks?
    • HW
      • Any more complex HW needed?
sw impr 1a delay tsl
SW Impr. #1a: Delay TSL

Spin on Test -and- Test-and-Set Spin-Lock

Spin on Test -and- Delay Test-and-Set Spin-Lock

  • By delaying the TSL
    • Reduce # of Invalidations and Bus Contentions
  • Delay could be set
    • Statically – delay slots for each processor, could be prioritized
    • Dynamically – as in CSMA NW – exponential back-off
  • Performance good with
    • Short delay and few spinners
    • Long delay and many spinners

N

//Lock RD

[M1]=BUSY?

Y

Y

Y

//Spin on Lock RD

//Spin on Lock RD

[M1]=BUSY?

[M1]=BUSY?

N

N

DELAY

//DELAY before TSL

Reg = TSL [M1]

//M[1]=BUSY

Reg = TSL [M1]

//M[1]=BUSY

BUSY

BUSY

Reg = ?

Reg = ?

//Got Lock

//Got Lock

CLEAN

CLEAN

Execute CS

Execute CS

[M1] = CLEAN

//Un-Lock

[M1] = CLEAN

//Un-Lock

sw impr 1b delay after ea lock access
SW Impr. #1b: Delay after ea. Lock access
  • Delay after each Lock access
    • Check lock less frequently
      • TSL – less misses due to invalidation, less bus contention
      • Lock RD – less misses due to invalidation, less bus contention
  • Good for architecture with no caches
    • Communication (Bus, NW) BW overflow

Spin on Test -and- Test-and-Set Spin-Lock

Delay on Test -and- Delay on Test-and-Set Spin-Lock

Y

N

//Spin on Lock RD

//Lock RD

[M1]=BUSY?

[M1]=BUSY?

Y

N

//1. DELAY after Lock RD, before TSL

DELAY

//2. DELAY after TSL

Reg = TSL [M1]

//M[1]=BUSY

Reg = TSL [M1]

//M[1]=BUSY

BUSY

BUSY

Reg = ?

Reg = ?

//Got Lock

//Got Lock

CLEAN

CLEAN

Execute CS

Execute CS

[M1] = CLEAN

//Un-Lock

[M1] = CLEAN

//Un-Lock

sw impr 2 queuing
SW Impr. #2: Queuing
  • To resolve contention
    • Delay uses time
    • Queue uses space
  • Queue Implementation
    • Basic
      • Allocate slot for each waiting CPU in a queue
        • Requires insertion and deletion – atomic op’s
          • Not good for small CS
    • Efficient
      • Each CPU get unique seq# - atomic op
      • One completing the lock, the current CPU activates one with next seq# - no atomic op
  • Q Performance
      • Works well (offers low contention) for bus based arch and NW based arch with invalidation
      • Less valuable for Bus based arch with no caches as still contention on bus for polling by each CPU
      • Increased Lock latency under low contention due to overhead in attaining the lock
      • Preemption on CPU holding the lock could further starve the CPU waiting on the lock – pass token before switching out
      • Centralized Q become bottleneck as # of CPU’s increased – solutions like divide Q between nodes, etc

0 –to- (N-1) Q’s

  • //slot for each CPU’s
  • //in separate CL
  • //continuous polling
  • //no coherence Traffic

CPU’s spin on its own slot

  • //no atomic TSL – the Lock is your slot
  • //slot for each Lock!

CPU (0) get the lock

CPU (0) unlocking pass

token to another CPU (5)

  • //requires an atomic TSL on that slot
  • //some criteria for “another” e.g. priority, FIFO
sw impr test results
SW Impr.: Test Results
  • 20 CPU Symmetric Model B
  • Static & Dynamic Delay=0-15 usec
  • TSL =1 usec
  • No atomic incr, Q uses explicit lock w/ backoff to access seq #
  • Ea CPU loops 1M/#P times to acquire, do CS, release and compute
    • Spin-waiting overhead (sec) in executing the b’mark
  • At low CPU count (low contention)
    • Queue has high latency due to lock overhead
  • At high CPU count
    • Queue performs best
    • back-off performs slightly worse than static delays

SOURCE: Figure copied from paper

hw solutions
HW Solutions
  • Separate Bus for Lock and Regular memory requests
    • As in Balance
      • Regular req follows invalidation based $ coherence
      • Lock req follows distributed-write based $ coherence
  • Expensive solution
    • Little benefit to Apps which don’t spend much time spin-waiting
    • How to manage if the two buses are slower
hw sol multistage interconnect nw cpu
HW Sol. – Multistage interconnect NW CPU
  • NUMA type of arch
    • “SMP view” as a “combination of memory” across the “nodes”
  • Collapse all simultaneous req’s for a single lock from a node into one
    • Well value would be same for all requests
    • Saves contention BW
      • But performance could be offset by increased latency of “combining switches”
        • Could be defeated by normal NW with backoff or queuing
  • HW queue
    • Such as maintained by the cache controller
      • Uses same method as SW to pass token to next CPU
      • One proposal by Goodman et al. combines HW and SW to maintain the queue
    • HW implementation though complex could be faster
hw sol single bus cpu
HW Sol. – Single Bus CPU
  • Single Bus had ping-pong problem with constant invalidations even if lock wasn’t available
    • Much due to “atomic” nature of RMW Lock instructions
  • Minimize invalidations by restricting it to only when the value has really changed
    • makes sense and solves problem when spinning on read
    • However, there would still be invalidation when lock finally released
      • Cache miss by each spinning CPU and further failed TSL consume BW
        • Time to quiesce reduced but not fully eliminated
  • Special handling of Read requests by improving snooping and coherence protocol
      • Broadcast on a Read which could eliminate duplicate read misses
        • First read after an invalidation (such as making lock available) will full-fill further read requests on the same lock
          • Requires implementing fully distributed write-coherence
  • Special handling of test-and-set requests in cache and bus controllers
    • If it doesn’t increase bus or cache cycle time, then it should be better than SW queuing or backoff
  • None of the methods show achieving ideal perf as measured and tested on Symmetry
    • The difficulty is knowing the type of atomic instruction making a request
      • The type is only known and computed in the Core
        • The Cache and the Bus sees everything as nothing other than a “request”
      • Ability to pass such control signals along with requests could help achieve the purpose
summary
Summary
  • Spin Locks is a common method to achieve mutually exclusive access to a shared data-structure
  • Multi-Core CPU’s are more common
    • Spin Lock performance degrades as # of spinning CPU increases
  • Efficient methods in both SW & HW could be implemented to salvage performance degradation
    • SW
      • SW queuing
        • Performs best at high contention
      • Ethernet style backoff
        • Performs best at low contention
    • HW
      • For multi-stage NW CPU, HW queuing at a node to combine requests of one type could help save contention
      • For SMP bus based CPU, intelligent snooping could be implemented to reduce bus traffic
  • Recommendations to Spin-Lock performance (as above) looks promising
    • AMB small benchmarks
    • Benefits to “real” workloads is an open question
ad