The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson Murphy-Hill Slide 1 (of 23)

Introduction • Shared Memory Multiprocessors • Mutual exclusion required • Almost always hardware primitives provided • Direct mutual exclusion • Mutual exclusion through locking • Interest here: short critical regions, spin locks • The problem: spinning processors cost communication bandwidth – how can we cut it? Slide 2 (of 23)

Range of Architectures • Two dimensions: • Interconnect type (multistage network or bus) • Cache type • So six architectures considered: • Multistage network without private caches • Multistage network, invalidation based cache coherence using RD • Bus without coherent private cache • Bus w/snoopy write through invalidation-based cache coherence • Bus with snoopy write-back invalidation based cache coherence • Bus with snoopy distributed write cache coherence • Architectures generally read, modify, and write atomically Slide 3 (of 23)

Why Spinlocks are Slow • Tradeoff: frequent polling gets you the lock faster, but slows everyone else down • Latency is an issue: some overhead for complicated spinlock algorithm Slide 4 (of 23)

A Spin-Waiting Algorithm • Spin on Test-and-Set while(TestAndSet(lock) = BUSY); <criticial section> Lock := CLEAR; • Slow, because: • Lock holder must content with non-lock holders • Spinning requests slow other requests Slide 5 (of 23)

Another Spin-Waiting Algorithm • Spin on Read (Test-and-Test-and-Set) while(lock=BUSY or TestAndSet(lock)=BUSY); <criticial section> lock := CLEAR; • For architectures with per-processor cache • Like previous, but no network/bus communication on read • For short critical sections, this is slow, because the time to quiesce (all processors resume spinning) dominates Slide 6 (of 23)

Reasons Why Quiescence is Slow • Elapsed time between Read and Test-and-Set • All cached copies of a lock are invalidated on a Test-and-Set, even if the test fails • Invalidation-based cache-coherence requires O(P) bus/network cycles, because a written value has to be propegated to every processor (the same one!) Slide 7 (of 23)

Validation Slide 8 (of 23)

Validation (a bit more) Slide 9 (of 23)

Now, Speed it Up… • Author presents 5 alternative approaches • Interesting approach – 4 are based on the observation that communication during spin waiting is like CSMA (Ethernet) networking protocols Slide 10 (of 23)

1/5: Static Delay on Lock Release • When a processor notices the lock has been released, it waits a fixed amount of time before trying a Test-And-Set • Each processor is assigned a static delay (slot) • Good performance: • Fewer slots, fewer spinning processors • Many slots, more spinning processors Slide 11 (of 23)

2/5: Backoff on Lock Release • Like Ethernet backoff • Wait a small amount of time between Read and Test-and-Set • If processor collides with another processor, it backs off for a greater random interval • Indirectly, processors base backoff interval on the number of spinning processors • But… Slide 12 (of 23)

More on Backoff… • Processors should not change their mean delay if another processor acquires the lock • Maximum time to delay should be bounded • Initial delay on arrival should be a fraction of the last delay Slide 13 (of 23)

3/5: Static Delay before Reference while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); <criticial section> • Here you just check the lock less often • Good when: • Checking frequently, and few other spinners • Checking infrequently, many spinners Slide 14 (of 23)

4/5: Backoff before Reference while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); delay += randomBackoff(); <criticial section> • Analogous to backoff on lock release • Both dynamic and static backoff are bad when the critical section is long: they just keep backing off while the lock is being held Slide 15 (of 23)

5/5: Queue • Can’t estimate backoff by number of waiting processes, can’t keep a process queue (just as slow as the lock!) • This author’s contribution (finally): Init flags[0] := HAS_LOCK; flags[1..P-1] := MUST_WAIT; queueLast := 0; Lock myPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P]=MUST_WAIT); <critical section> Unlock flags[myPlace mod P] := MUST_WAIT; flags[(myPlace+1) mod P] := HAS_LOCK; Slide 16 (of 23)

More on Queuing • Works especially well for multistage networks – each flag can be on a separate module, so a single memory location isn’t saturated with requests • Works less well if there’s a bus without cache coherence, because we still have the problem that each process has to poll for a single value in one place • Lock latency is increased (overhead), so poor performance when there’s no contention Slide 17 (of 23)

Benchmark Spin-lock Alternatives Slide 18 (of 23)

Overhead vs. Number of Slots Slide 19 (of 23)

Spin-waiting Overhead for a Burst Slide 20 (of 23)

Network Hardware Solutions • Combining Networks • Multiple paths to same memory location • Hardware Queuing • Eliminates polling across the network • Goodman’s Queue Links • Stores the name of the next processor in the queue directly in each processor’s cache • Eliminates need for memory access for queuing Slide 21 (of 23)

Bus Hardware Solutions • Invalidate cache copies ONLY when Test-and-Set succeeds • Read broadcast • Whenever some other processor reads a value which I know is invalid, I get a copy of that value too (piggyback) • Eliminates the cascade of read-misses • Special handling of Test-and-Set • Cache and bus controllers don’t mess with the bus if the lock is busy • Essentially, doesn’t do a test-and-set so long as there is a possibility it might fail Slide 22 (of 23)

Conclusions • Spin-locking performance doesn’t scale • A variant of Ethernet backoff has good results when there is little lock contention • Queuing (parallelizing lock handoff) has good results when there are waiting processors • A little supportive hardware goes a long way towards a healthy multiprocessor relationship Slide 23 (of 23)

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

Presentation Transcript

Shared Memory Multiprocessors

User-Level Interprocess Communication for Shared Memory Multiprocessors

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

User-Level Interprocess Communication for Shared Memory Multiprocessors

Shared Memory Multiprocessors

Shared Memory Multiprocessors

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson

Shared Memory Multiprocessors

Shared Memory Multiprocessors

Cache Coherence in Shared Memory Multiprocessors

URPC for Shared Memory Multiprocessors

Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors

Shared Memory Multiprocessors

Shared Memory Multiprocessors

User-Level Interprocess Communication for Shared Memory Multiprocessors

Shared Memory Multiprocessors

User-Level Interprocess Communication for Shared Memory Multiprocessors

Lecture 18: Shared-Memory Multiprocessors

Shared Memory Multiprocessors

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors