1 / 23

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. Paper: Thomas E. Anderson Presentation: Emerson Murphy-Hill. Introduction. Shared Memory Multiprocessors Mutual exclusion required Almost always hardware primitives provided Direct mutual exclusion

morse
Download Presentation

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson Murphy-Hill Slide 1 (of 23)

  2. Introduction • Shared Memory Multiprocessors • Mutual exclusion required • Almost always hardware primitives provided • Direct mutual exclusion • Mutual exclusion through locking • Interest here: short critical regions, spin locks • The problem: spinning processors cost communication bandwidth – how can we cut it? Slide 2 (of 23)

  3. Range of Architectures • Two dimensions: • Interconnect type (multistage network or bus) • Cache type • So six architectures considered: • Multistage network without private caches • Multistage network, invalidation based cache coherence using RD • Bus without coherent private cache • Bus w/snoopy write through invalidation-based cache coherence • Bus with snoopy write-back invalidation based cache coherence • Bus with snoopy distributed write cache coherence • Architectures generally read, modify, and write atomically Slide 3 (of 23)

  4. Why Spinlocks are Slow • Tradeoff: frequent polling gets you the lock faster, but slows everyone else down • Latency is an issue: some overhead for complicated spinlock algorithm Slide 4 (of 23)

  5. A Spin-Waiting Algorithm • Spin on Test-and-Set while(TestAndSet(lock) = BUSY); <criticial section> Lock := CLEAR; • Slow, because: • Lock holder must content with non-lock holders • Spinning requests slow other requests Slide 5 (of 23)

  6. Another Spin-Waiting Algorithm • Spin on Read (Test-and-Test-and-Set) while(lock=BUSY or TestAndSet(lock)=BUSY); <criticial section> lock := CLEAR; • For architectures with per-processor cache • Like previous, but no network/bus communication on read • For short critical sections, this is slow, because the time to quiesce (all processors resume spinning) dominates Slide 6 (of 23)

  7. Reasons Why Quiescence is Slow • Elapsed time between Read and Test-and-Set • All cached copies of a lock are invalidated on a Test-and-Set, even if the test fails • Invalidation-based cache-coherence requires O(P) bus/network cycles, because a written value has to be propegated to every processor (the same one!) Slide 7 (of 23)

  8. Validation Slide 8 (of 23)

  9. Validation (a bit more) Slide 9 (of 23)

  10. Now, Speed it Up… • Author presents 5 alternative approaches • Interesting approach – 4 are based on the observation that communication during spin waiting is like CSMA (Ethernet) networking protocols Slide 10 (of 23)

  11. 1/5: Static Delay on Lock Release • When a processor notices the lock has been released, it waits a fixed amount of time before trying a Test-And-Set • Each processor is assigned a static delay (slot) • Good performance: • Fewer slots, fewer spinning processors • Many slots, more spinning processors Slide 11 (of 23)

  12. 2/5: Backoff on Lock Release • Like Ethernet backoff • Wait a small amount of time between Read and Test-and-Set • If processor collides with another processor, it backs off for a greater random interval • Indirectly, processors base backoff interval on the number of spinning processors • But… Slide 12 (of 23)

  13. More on Backoff… • Processors should not change their mean delay if another processor acquires the lock • Maximum time to delay should be bounded • Initial delay on arrival should be a fraction of the last delay Slide 13 (of 23)

  14. 3/5: Static Delay before Reference while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); <criticial section> • Here you just check the lock less often • Good when: • Checking frequently, and few other spinners • Checking infrequently, many spinners Slide 14 (of 23)

  15. 4/5: Backoff before Reference while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); delay += randomBackoff(); <criticial section> • Analogous to backoff on lock release • Both dynamic and static backoff are bad when the critical section is long: they just keep backing off while the lock is being held Slide 15 (of 23)

  16. 5/5: Queue • Can’t estimate backoff by number of waiting processes, can’t keep a process queue (just as slow as the lock!) • This author’s contribution (finally): Init flags[0] := HAS_LOCK; flags[1..P-1] := MUST_WAIT; queueLast := 0; Lock myPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P]=MUST_WAIT); <critical section> Unlock flags[myPlace mod P] := MUST_WAIT; flags[(myPlace+1) mod P] := HAS_LOCK; Slide 16 (of 23)

  17. More on Queuing • Works especially well for multistage networks – each flag can be on a separate module, so a single memory location isn’t saturated with requests • Works less well if there’s a bus without cache coherence, because we still have the problem that each process has to poll for a single value in one place • Lock latency is increased (overhead), so poor performance when there’s no contention Slide 17 (of 23)

  18. Benchmark Spin-lock Alternatives Slide 18 (of 23)

  19. Overhead vs. Number of Slots Slide 19 (of 23)

  20. Spin-waiting Overhead for a Burst Slide 20 (of 23)

  21. Network Hardware Solutions • Combining Networks • Multiple paths to same memory location • Hardware Queuing • Eliminates polling across the network • Goodman’s Queue Links • Stores the name of the next processor in the queue directly in each processor’s cache • Eliminates need for memory access for queuing Slide 21 (of 23)

  22. Bus Hardware Solutions • Invalidate cache copies ONLY when Test-and-Set succeeds • Read broadcast • Whenever some other processor reads a value which I know is invalid, I get a copy of that value too (piggyback) • Eliminates the cascade of read-misses • Special handling of Test-and-Set • Cache and bus controllers don’t mess with the bus if the lock is busy • Essentially, doesn’t do a test-and-set so long as there is a possibility it might fail Slide 22 (of 23)

  23. Conclusions • Spin-locking performance doesn’t scale • A variant of Ethernet backoff has good results when there is little lock contention • Queuing (parallelizing lock handoff) has good results when there are waiting processors • A little supportive hardware goes a long way towards a healthy multiprocessor relationship Slide 23 (of 23)

More Related