The perf of spin lock alternatives for shared memory multiprocessors by t e anderson
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson PowerPoint PPT Presentation


  • 38 Views
  • Uploaded on
  • Presentation posted in: General

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson. Presented by Ashish Jha PSU SP 2010 CS-510 05/20/2010. Agenda. Preview of a SMP single Bus based system $ protocol and the Bus What is a Lock? Usage and operations in a CS What is Spin-Lock?

Download Presentation

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The perf of spin lock alternatives for shared memory multiprocessors by t e anderson

The Perf. of Spin Lock Alternatives for Shared-Memory MultiprocessorsBy T. E. Anderson

Presented by Ashish Jha

PSU SP 2010 CS-510

05/20/2010


Agenda

Agenda

  • Preview of a SMP single Bus based system

    • $ protocol and the Bus

  • What is a Lock?

    • Usage and operations in a CS

  • What is Spin-Lock?

    • Usage and operations in a CS

  • Problems with Spin-Locks on SMP systems

  • Methods to improve Spin-Lock performance in both SW & HW

  • Summary


Preview smp arch

Preview: SMP Arch

  • Shared Bus

    • Coherent, Consistent and Contended Memory

  • Snoopy Invalidation based Cache Coherence Protocol

    • Guarantees Atomicity of a memory operation

  • Sources of Contention

    • Bus

    • Memory Modules

T1: LD reg=[M1]

CPU 0

CPU 1

T2: LD reg=[M1]

CPU N-1

CPU N

T3: ST [M1]=reg

Invalid

L1D

L1D

L1D

L1D

Exclusive

Invalid

Shared

Invalid

Shared

Modified

LN$

LN$

LN$

LN$

BSQ


What is a lock

What is a Lock?

  • Instruction defined and exposed by the ISA

    • To achieve “Exclusive” access to memory

  • Lock is an “Atomic” RMW operation

    • uArch guarantees Atomicity

      • achieved via Cache Coherence Protocol

  • Used to implement a Critical Section

    • A block of code with “Exclusive” access

  • Examples

    • TSL – Test-Set-Lock

    • CAS – Compare-Swap


Lock operation

Lock Operation

Reg = TSL [M1]

Y

$Line Exclusive/

Modified?

N

Local $ Miss

Y

N

Bus Tx

Remote $ Miss

N

Y

Invalidate

Other CPU $ Line

Invalidate

Memory $ Line

M1 in Local $

Modified State

N

M1 CLEAN?

Y

Set M1=BUSY

Set M1=BUSY

Reg=CLEAN

GOT LOCK!

Reg=BUSY

NO LOCK


Critical section using lock

Critical Section using Lock

  • Simple, Intuitive and Elegant

Reg = TSL [M1]

Reg = CLEAN?

//Got Lock

//[M1]=BUSY

Execute CS

[M1] = CLEAN

//Un-Lock


Critical section using spin lock

Critical Section using Spin-Lock

  • Spin on Test-and-Set

  • Yet again, Simple, Intuitive and Elegant

Reg = TSL [M1]

Spin-Lock

BUSY

Reg = ?

//M[1]=BUSY

CLEAN

//Got Lock

Execute CS

[M1] = CLEAN

//Un-Lock


Problem with spin lock

Problem with Spin-Lock?

  • A Lock is a RMW operation

    • A “simple?” Store op

  • Works well for UP to few Core environment…next slide…

Reg = TSL [M1]

Spin-Lock

BUSY

Reg = ?

//M[1]=BUSY

CLEAN

//Got Lock

Execute CS

[M1] = CLEAN

//Un-Lock


Spin lock in many core env

Spin-Lock in Many-Core Env.

  • Severe Contention on the Bus, with Traffic from

    • Snoops

    • Invalidations

    • Regular Requests

  • Contended Memory module

    • Data requested by diff CPU’s residing in the same module

T1: TSL[M1] //Lock

CPU 0

CPU 1

T2: TSL[M1] //Spin

CPU N-2

CPU N-1

CPU N

T3: [M1]=CLEAN

T3: TSL[M1] //Spin

T3: TSL[M2]

T3: reg=[M2]

T3: TSL[M1] //Spin

L1D

L1D

L1D

L1D

L1D

Modified

Modified

Modified

Invalid

Invalid

Invalid

Modified

Invalid

Exclusive

Modified

Invalid

Modified

Q0: N-1,N-2

Q0: N-2

LN$

LN$

LN$

LN$

LN$

Q1: N

Q2: 1

Q3: 0

BSQ


Spin lock in many core env cont d

Spin-Lock in Many-Core Env. Cont’d

  • An avalanche effect on Bus & Mem Module contention with

    • more # of CPU’s – impacts scalability

      • More snoop and coherency traffic with Ping-pong effect on locks – unsuccessful test&set and invalidations

      • More starvation – lock has been released but delayed further with contention on bus

    • Requests conflicting with same mem module

    • Top it off with SW bugs

      • Locks and/or regular requests conflicting with same CL

  • Suppose lock latency was 20 Core Clks

    • Bus runs as much as 10x slower

      • Now latency to acquire the lock could increase by 10x Core clks or more

T1: TSL[M1] //Lock

CPU 0

CPU 1

T2: TSL[M1] //Spin

CPU N-2

CPU N-1

CPU N

T3: [M1]=CLEAN

T3: TSL[M1] //Spin

T3: TSL[M2]

T3: reg=[M2]

T3: TSL[M1] //Spin

L1D

L1D

L1D

L1D

L1D

Invalid

Modified

Invalid

Invalid

Modified

LN$

LN$

LN$

LN$

LN$

BSQ


A better spin lock

A better Spin-Lock

  • Spin on Read (Test-and-Test-and-Set)

    • A bit better as long as Lock not modified while spinning on cached value

      • Doesn’t hold long as # of CPU’s scaled

        • Same set of problems as before – lot of invalidations due to TSL

Spin on Lock RD

Y

[M1]=BUSY?

N

Spin on Lock RD and TSL

Reg = TSL [M1]

BUSY

Reg = ?

//M[1]=BUSY

CLEAN

//Got Lock

Execute CS

[M1] = CLEAN

//Un-Lock


Verify through tests

Verify through Tests

  • Spin Lock latency and perf with small and large amounts of contention

  • Result confirms

    • Sharp degradation in perf for spin on test-set as #CPU’s sclaed

      • Spin on read slightly better

    • Both methods degrades badly (scales poorly) as CPUs increased

      • Peak perf never reached – time to quiesce almost linear with CPU count, hurting communication BW

Time to quiesce, spin on read (usec)

SOURCE: Figure’s copied from paper

  • 20 CPU Symmetric Model B SMP

    • WB-Invalidate $

    • Shared Bus – one same bus for Lock and regular requests

  • Lock acquire-release=5.6 usec

  • elapsed time for CPU to exe CS 1M times

    • Ea CPU loops: wait for lock, do CS, release and delay for a time randomly selected


What can be done

What can be done?

  • Can Spin-Lock performance be improved by

    • SW

      • Any efficient algorithm for busy locks?

    • HW

      • Any more complex HW needed?


Sw impr 1a delay tsl

SW Impr. #1a: Delay TSL

Spin on Test -and- Test-and-Set Spin-Lock

Spin on Test -and- Delay Test-and-Set Spin-Lock

  • By delaying the TSL

    • Reduce # of Invalidations and Bus Contentions

  • Delay could be set

    • Statically – delay slots for each processor, could be prioritized

    • Dynamically – as in CSMA NW – exponential back-off

  • Performance good with

    • Short delay and few spinners

    • Long delay and many spinners

N

//Lock RD

[M1]=BUSY?

Y

Y

Y

//Spin on Lock RD

//Spin on Lock RD

[M1]=BUSY?

[M1]=BUSY?

N

N

DELAY

//DELAY before TSL

Reg = TSL [M1]

//M[1]=BUSY

Reg = TSL [M1]

//M[1]=BUSY

BUSY

BUSY

Reg = ?

Reg = ?

//Got Lock

//Got Lock

CLEAN

CLEAN

Execute CS

Execute CS

[M1] = CLEAN

//Un-Lock

[M1] = CLEAN

//Un-Lock


Sw impr 1b delay after ea lock access

SW Impr. #1b: Delay after ea. Lock access

  • Delay after each Lock access

    • Check lock less frequently

      • TSL – less misses due to invalidation, less bus contention

      • Lock RD – less misses due to invalidation, less bus contention

  • Good for architecture with no caches

    • Communication (Bus, NW) BW overflow

Spin on Test -and- Test-and-Set Spin-Lock

Delay on Test -and- Delay on Test-and-Set Spin-Lock

Y

N

//Spin on Lock RD

//Lock RD

[M1]=BUSY?

[M1]=BUSY?

Y

N

//1. DELAY after Lock RD, before TSL

DELAY

//2. DELAY after TSL

Reg = TSL [M1]

//M[1]=BUSY

Reg = TSL [M1]

//M[1]=BUSY

BUSY

BUSY

Reg = ?

Reg = ?

//Got Lock

//Got Lock

CLEAN

CLEAN

Execute CS

Execute CS

[M1] = CLEAN

//Un-Lock

[M1] = CLEAN

//Un-Lock


Sw impr 2 queuing

SW Impr. #2: Queuing

  • To resolve contention

    • Delay uses time

    • Queue uses space

  • Queue Implementation

    • Basic

      • Allocate slot for each waiting CPU in a queue

        • Requires insertion and deletion – atomic op’s

          • Not good for small CS

    • Efficient

      • Each CPU get unique seq# - atomic op

      • One completing the lock, the current CPU activates one with next seq# - no atomic op

  • Q Performance

    • Works well (offers low contention) for bus based arch and NW based arch with invalidation

    • Less valuable for Bus based arch with no caches as still contention on bus for polling by each CPU

    • Increased Lock latency under low contention due to overhead in attaining the lock

    • Preemption on CPU holding the lock could further starve the CPU waiting on the lock – pass token before switching out

    • Centralized Q become bottleneck as # of CPU’s increased – solutions like divide Q between nodes, etc

0 –to- (N-1) Q’s

  • //slot for each CPU’s

  • //in separate CL

  • //continuous polling

  • //no coherence Traffic

CPU’s spin on its own slot

  • //no atomic TSL – the Lock is your slot

  • //slot for each Lock!

CPU (0) get the lock

CPU (0) unlocking pass

token to another CPU (5)

  • //requires an atomic TSL on that slot

  • //some criteria for “another” e.g. priority, FIFO


Sw impr test results

SW Impr.: Test Results

  • 20 CPU Symmetric Model B

  • Static & Dynamic Delay=0-15 usec

  • TSL =1 usec

  • No atomic incr, Q uses explicit lock w/ backoff to access seq #

  • Ea CPU loops 1M/#P times to acquire, do CS, release and compute

    • Spin-waiting overhead (sec) in executing the b’mark

  • At low CPU count (low contention)

    • Queue has high latency due to lock overhead

  • At high CPU count

    • Queue performs best

    • back-off performs slightly worse than static delays

SOURCE: Figure copied from paper


Hw solutions

HW Solutions

  • Separate Bus for Lock and Regular memory requests

    • As in Balance

      • Regular req follows invalidation based $ coherence

      • Lock req follows distributed-write based $ coherence

  • Expensive solution

    • Little benefit to Apps which don’t spend much time spin-waiting

    • How to manage if the two buses are slower


Hw sol multistage interconnect nw cpu

HW Sol. – Multistage interconnect NW CPU

  • NUMA type of arch

    • “SMP view” as a “combination of memory” across the “nodes”

  • Collapse all simultaneous req’s for a single lock from a node into one

    • Well value would be same for all requests

    • Saves contention BW

      • But performance could be offset by increased latency of “combining switches”

        • Could be defeated by normal NW with backoff or queuing

  • HW queue

    • Such as maintained by the cache controller

      • Uses same method as SW to pass token to next CPU

      • One proposal by Goodman et al. combines HW and SW to maintain the queue

    • HW implementation though complex could be faster


Hw sol single bus cpu

HW Sol. – Single Bus CPU

  • Single Bus had ping-pong problem with constant invalidations even if lock wasn’t available

    • Much due to “atomic” nature of RMW Lock instructions

  • Minimize invalidations by restricting it to only when the value has really changed

    • makes sense and solves problem when spinning on read

    • However, there would still be invalidation when lock finally released

      • Cache miss by each spinning CPU and further failed TSL consume BW

        • Time to quiesce reduced but not fully eliminated

  • Special handling of Read requests by improving snooping and coherence protocol

    • Broadcast on a Read which could eliminate duplicate read misses

      • First read after an invalidation (such as making lock available) will full-fill further read requests on the same lock

        • Requires implementing fully distributed write-coherence

  • Special handling of test-and-set requests in cache and bus controllers

    • If it doesn’t increase bus or cache cycle time, then it should be better than SW queuing or backoff

  • None of the methods show achieving ideal perf as measured and tested on Symmetry

    • The difficulty is knowing the type of atomic instruction making a request

      • The type is only known and computed in the Core

        • The Cache and the Bus sees everything as nothing other than a “request”

      • Ability to pass such control signals along with requests could help achieve the purpose


  • Summary

    Summary

    • Spin Locks is a common method to achieve mutually exclusive access to a shared data-structure

    • Multi-Core CPU’s are more common

      • Spin Lock performance degrades as # of spinning CPU increases

    • Efficient methods in both SW & HW could be implemented to salvage performance degradation

      • SW

        • SW queuing

          • Performs best at high contention

        • Ethernet style backoff

          • Performs best at low contention

      • HW

        • For multi-stage NW CPU, HW queuing at a node to combine requests of one type could help save contention

        • For SMP bus based CPU, intelligent snooping could be implemented to reduce bus traffic

    • Recommendations to Spin-Lock performance (as above) looks promising

      • AMB small benchmarks

      • Benefits to “real” workloads is an open question


  • Login