The perf of spin lock alternatives for shared memory multiprocessors by t e anderson
Download
1 / 21

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson. Presented by Ashish Jha PSU SP 2010 CS-510 05/20/2010. Agenda. Preview of a SMP single Bus based system $ protocol and the Bus What is a Lock? Usage and operations in a CS What is Spin-Lock?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson' - eavan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The perf of spin lock alternatives for shared memory multiprocessors by t e anderson

The Perf. of Spin Lock Alternatives for Shared-Memory MultiprocessorsBy T. E. Anderson

Presented by Ashish Jha

PSU SP 2010 CS-510

05/20/2010


Agenda
Agenda Multiprocessors

  • Preview of a SMP single Bus based system

    • $ protocol and the Bus

  • What is a Lock?

    • Usage and operations in a CS

  • What is Spin-Lock?

    • Usage and operations in a CS

  • Problems with Spin-Locks on SMP systems

  • Methods to improve Spin-Lock performance in both SW & HW

  • Summary


Preview smp arch
Preview: SMP Arch Multiprocessors

  • Shared Bus

    • Coherent, Consistent and Contended Memory

  • Snoopy Invalidation based Cache Coherence Protocol

    • Guarantees Atomicity of a memory operation

  • Sources of Contention

    • Bus

    • Memory Modules

T1: LD reg=[M1]

CPU 0

CPU 1

T2: LD reg=[M1]

CPU N-1

CPU N

T3: ST [M1]=reg

Invalid

L1D

L1D

L1D

L1D

Exclusive

Invalid

Shared

Invalid

Shared

Modified

LN$

LN$

LN$

LN$

BSQ


What is a lock
What is a Lock? Multiprocessors

  • Instruction defined and exposed by the ISA

    • To achieve “Exclusive” access to memory

  • Lock is an “Atomic” RMW operation

    • uArch guarantees Atomicity

      • achieved via Cache Coherence Protocol

  • Used to implement a Critical Section

    • A block of code with “Exclusive” access

  • Examples

    • TSL – Test-Set-Lock

    • CAS – Compare-Swap


Lock operation
Lock Operation Multiprocessors

Reg = TSL [M1]

Y

$Line Exclusive/

Modified?

N

Local $ Miss

Y

N

Bus Tx

Remote $ Miss

N

Y

Invalidate

Other CPU $ Line

Invalidate

Memory $ Line

M1 in Local $

Modified State

N

M1 CLEAN?

Y

Set M1=BUSY

Set M1=BUSY

Reg=CLEAN

GOT LOCK!

Reg=BUSY

NO LOCK


Critical section using lock
Critical Section using Lock Multiprocessors

  • Simple, Intuitive and Elegant

Reg = TSL [M1]

Reg = CLEAN?

//Got Lock

//[M1]=BUSY

Execute CS

[M1] = CLEAN

//Un-Lock


Critical section using spin lock
Critical Section using Spin-Lock Multiprocessors

  • Spin on Test-and-Set

  • Yet again, Simple, Intuitive and Elegant

Reg = TSL [M1]

Spin-Lock

BUSY

Reg = ?

//M[1]=BUSY

CLEAN

//Got Lock

Execute CS

[M1] = CLEAN

//Un-Lock


Problem with spin lock
Problem with Spin-Lock? Multiprocessors

  • A Lock is a RMW operation

    • A “simple?” Store op

  • Works well for UP to few Core environment…next slide…

Reg = TSL [M1]

Spin-Lock

BUSY

Reg = ?

//M[1]=BUSY

CLEAN

//Got Lock

Execute CS

[M1] = CLEAN

//Un-Lock


Spin lock in many core env
Spin-Lock in Many-Core Env. Multiprocessors

  • Severe Contention on the Bus, with Traffic from

    • Snoops

    • Invalidations

    • Regular Requests

  • Contended Memory module

    • Data requested by diff CPU’s residing in the same module

T1: TSL[M1] //Lock

CPU 0

CPU 1

T2: TSL[M1] //Spin

CPU N-2

CPU N-1

CPU N

T3: [M1]=CLEAN

T3: TSL[M1] //Spin

T3: TSL[M2]

T3: reg=[M2]

T3: TSL[M1] //Spin

L1D

L1D

L1D

L1D

L1D

Modified

Modified

Modified

Invalid

Invalid

Invalid

Modified

Invalid

Exclusive

Modified

Invalid

Modified

Q0: N-1,N-2

Q0: N-2

LN$

LN$

LN$

LN$

LN$

Q1: N

Q2: 1

Q3: 0

BSQ


Spin lock in many core env cont d
Spin-Lock in Many-Core Env. Cont’d Multiprocessors

  • An avalanche effect on Bus & Mem Module contention with

    • more # of CPU’s – impacts scalability

      • More snoop and coherency traffic with Ping-pong effect on locks – unsuccessful test&set and invalidations

      • More starvation – lock has been released but delayed further with contention on bus

    • Requests conflicting with same mem module

    • Top it off with SW bugs

      • Locks and/or regular requests conflicting with same CL

  • Suppose lock latency was 20 Core Clks

    • Bus runs as much as 10x slower

      • Now latency to acquire the lock could increase by 10x Core clks or more

T1: TSL[M1] //Lock

CPU 0

CPU 1

T2: TSL[M1] //Spin

CPU N-2

CPU N-1

CPU N

T3: [M1]=CLEAN

T3: TSL[M1] //Spin

T3: TSL[M2]

T3: reg=[M2]

T3: TSL[M1] //Spin

L1D

L1D

L1D

L1D

L1D

Invalid

Modified

Invalid

Invalid

Modified

LN$

LN$

LN$

LN$

LN$

BSQ


A better spin lock
A better Spin-Lock Multiprocessors

  • Spin on Read (Test-and-Test-and-Set)

    • A bit better as long as Lock not modified while spinning on cached value

      • Doesn’t hold long as # of CPU’s scaled

        • Same set of problems as before – lot of invalidations due to TSL

Spin on Lock RD

Y

[M1]=BUSY?

N

Spin on Lock RD and TSL

Reg = TSL [M1]

BUSY

Reg = ?

//M[1]=BUSY

CLEAN

//Got Lock

Execute CS

[M1] = CLEAN

//Un-Lock


Verify through tests
Verify through Tests Multiprocessors

  • Spin Lock latency and perf with small and large amounts of contention

  • Result confirms

    • Sharp degradation in perf for spin on test-set as #CPU’s sclaed

      • Spin on read slightly better

    • Both methods degrades badly (scales poorly) as CPUs increased

      • Peak perf never reached – time to quiesce almost linear with CPU count, hurting communication BW

Time to quiesce, spin on read (usec)

SOURCE: Figure’s copied from paper

  • 20 CPU Symmetric Model B SMP

    • WB-Invalidate $

    • Shared Bus – one same bus for Lock and regular requests

  • Lock acquire-release=5.6 usec

  • elapsed time for CPU to exe CS 1M times

    • Ea CPU loops: wait for lock, do CS, release and delay for a time randomly selected


What can be done
What can be done? Multiprocessors

  • Can Spin-Lock performance be improved by

    • SW

      • Any efficient algorithm for busy locks?

    • HW

      • Any more complex HW needed?


Sw impr 1a delay tsl
SW Impr. #1a: Delay TSL Multiprocessors

Spin on Test -and- Test-and-Set Spin-Lock

Spin on Test -and- Delay Test-and-Set Spin-Lock

  • By delaying the TSL

    • Reduce # of Invalidations and Bus Contentions

  • Delay could be set

    • Statically – delay slots for each processor, could be prioritized

    • Dynamically – as in CSMA NW – exponential back-off

  • Performance good with

    • Short delay and few spinners

    • Long delay and many spinners

N

//Lock RD

[M1]=BUSY?

Y

Y

Y

//Spin on Lock RD

//Spin on Lock RD

[M1]=BUSY?

[M1]=BUSY?

N

N

DELAY

//DELAY before TSL

Reg = TSL [M1]

//M[1]=BUSY

Reg = TSL [M1]

//M[1]=BUSY

BUSY

BUSY

Reg = ?

Reg = ?

//Got Lock

//Got Lock

CLEAN

CLEAN

Execute CS

Execute CS

[M1] = CLEAN

//Un-Lock

[M1] = CLEAN

//Un-Lock


Sw impr 1b delay after ea lock access
SW Impr. #1b: Delay after ea. Lock access Multiprocessors

  • Delay after each Lock access

    • Check lock less frequently

      • TSL – less misses due to invalidation, less bus contention

      • Lock RD – less misses due to invalidation, less bus contention

  • Good for architecture with no caches

    • Communication (Bus, NW) BW overflow

Spin on Test -and- Test-and-Set Spin-Lock

Delay on Test -and- Delay on Test-and-Set Spin-Lock

Y

N

//Spin on Lock RD

//Lock RD

[M1]=BUSY?

[M1]=BUSY?

Y

N

//1. DELAY after Lock RD, before TSL

DELAY

//2. DELAY after TSL

Reg = TSL [M1]

//M[1]=BUSY

Reg = TSL [M1]

//M[1]=BUSY

BUSY

BUSY

Reg = ?

Reg = ?

//Got Lock

//Got Lock

CLEAN

CLEAN

Execute CS

Execute CS

[M1] = CLEAN

//Un-Lock

[M1] = CLEAN

//Un-Lock


Sw impr 2 queuing
SW Impr. #2: Queuing Multiprocessors

  • To resolve contention

    • Delay uses time

    • Queue uses space

  • Queue Implementation

    • Basic

      • Allocate slot for each waiting CPU in a queue

        • Requires insertion and deletion – atomic op’s

          • Not good for small CS

    • Efficient

      • Each CPU get unique seq# - atomic op

      • One completing the lock, the current CPU activates one with next seq# - no atomic op

  • Q Performance

    • Works well (offers low contention) for bus based arch and NW based arch with invalidation

    • Less valuable for Bus based arch with no caches as still contention on bus for polling by each CPU

    • Increased Lock latency under low contention due to overhead in attaining the lock

    • Preemption on CPU holding the lock could further starve the CPU waiting on the lock – pass token before switching out

    • Centralized Q become bottleneck as # of CPU’s increased – solutions like divide Q between nodes, etc

0 –to- (N-1) Q’s

  • //slot for each CPU’s

  • //in separate CL

  • //continuous polling

  • //no coherence Traffic

CPU’s spin on its own slot

  • //no atomic TSL – the Lock is your slot

  • //slot for each Lock!

CPU (0) get the lock

CPU (0) unlocking pass

token to another CPU (5)

  • //requires an atomic TSL on that slot

  • //some criteria for “another” e.g. priority, FIFO


Sw impr test results
SW Impr.: Test Results Multiprocessors

  • 20 CPU Symmetric Model B

  • Static & Dynamic Delay=0-15 usec

  • TSL =1 usec

  • No atomic incr, Q uses explicit lock w/ backoff to access seq #

  • Ea CPU loops 1M/#P times to acquire, do CS, release and compute

    • Spin-waiting overhead (sec) in executing the b’mark

  • At low CPU count (low contention)

    • Queue has high latency due to lock overhead

  • At high CPU count

    • Queue performs best

    • back-off performs slightly worse than static delays

SOURCE: Figure copied from paper


Hw solutions
HW Solutions Multiprocessors

  • Separate Bus for Lock and Regular memory requests

    • As in Balance

      • Regular req follows invalidation based $ coherence

      • Lock req follows distributed-write based $ coherence

  • Expensive solution

    • Little benefit to Apps which don’t spend much time spin-waiting

    • How to manage if the two buses are slower


Hw sol multistage interconnect nw cpu
HW Sol. – Multistage interconnect NW CPU Multiprocessors

  • NUMA type of arch

    • “SMP view” as a “combination of memory” across the “nodes”

  • Collapse all simultaneous req’s for a single lock from a node into one

    • Well value would be same for all requests

    • Saves contention BW

      • But performance could be offset by increased latency of “combining switches”

        • Could be defeated by normal NW with backoff or queuing

  • HW queue

    • Such as maintained by the cache controller

      • Uses same method as SW to pass token to next CPU

      • One proposal by Goodman et al. combines HW and SW to maintain the queue

    • HW implementation though complex could be faster


Hw sol single bus cpu
HW Sol. – Single Bus CPU Multiprocessors

  • Single Bus had ping-pong problem with constant invalidations even if lock wasn’t available

    • Much due to “atomic” nature of RMW Lock instructions

  • Minimize invalidations by restricting it to only when the value has really changed

    • makes sense and solves problem when spinning on read

    • However, there would still be invalidation when lock finally released

      • Cache miss by each spinning CPU and further failed TSL consume BW

        • Time to quiesce reduced but not fully eliminated

  • Special handling of Read requests by improving snooping and coherence protocol

    • Broadcast on a Read which could eliminate duplicate read misses

      • First read after an invalidation (such as making lock available) will full-fill further read requests on the same lock

        • Requires implementing fully distributed write-coherence

  • Special handling of test-and-set requests in cache and bus controllers

    • If it doesn’t increase bus or cache cycle time, then it should be better than SW queuing or backoff

  • None of the methods show achieving ideal perf as measured and tested on Symmetry

    • The difficulty is knowing the type of atomic instruction making a request

      • The type is only known and computed in the Core

        • The Cache and the Bus sees everything as nothing other than a “request”

      • Ability to pass such control signals along with requests could help achieve the purpose


  • Summary
    Summary Multiprocessors

    • Spin Locks is a common method to achieve mutually exclusive access to a shared data-structure

    • Multi-Core CPU’s are more common

      • Spin Lock performance degrades as # of spinning CPU increases

    • Efficient methods in both SW & HW could be implemented to salvage performance degradation

      • SW

        • SW queuing

          • Performs best at high contention

        • Ethernet style backoff

          • Performs best at low contention

      • HW

        • For multi-stage NW CPU, HW queuing at a node to combine requests of one type could help save contention

        • For SMP bus based CPU, intelligent snooping could be implemented to reduce bus traffic

    • Recommendations to Spin-Lock performance (as above) looks promising

      • AMB small benchmarks

      • Benefits to “real” workloads is an open question