1 / 94

Practical Concerns for Scalable Synchronization

Practical Concerns for Scalable Synchronization. Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto). “Life is just one darned thing after another”. - Elbert Hubbard. “Multiprocessing is just one darned thing before, after or simultaneously with another”.

joie
Download Presentation

Practical Concerns for Scalable Synchronization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)

  2. “Life is just one darned thing after another” - Elbert Hubbard

  3. “Multiprocessing is just one darned thing before, after or simultaneously with another”

  4. “Synchronization is about imposing order”

  5. The problem – race conditions • “i++” is dangerous if “i” is global CPU 0 CPU 0 load %1,i inc %1 store %1,i i

  6. The problem – race conditions • “i++” is dangerous if “i” is global CPU 0 load %1,i CPU 0 load %1,i load %1,i inc %1 i i store %1,i i

  7. The problem – race conditions • “i++” is dangerous if “i” is global CPU 0 inc %1 CPU 0 inc %1 load %1,i inc %1 i+1 i+1 store %1,i i

  8. The problem – race conditions • “i++” is dangerous if “i” is global CPU 0 store %1,i CPU 0 store %1,i load %1,i inc %1 i+1 i+1 store %1,i i+1

  9. The solution – critical sections • Classic multiprocessor solution: spinlocks • CPU 1 waits for CPU 0 to release the lock • Counts are accurate, but locks are not free! spin_lock(&mylock); i++; spin_unlock(&mylock);

  10. Critical-section efficiency Lock Acquisition (Ta ) Critical Section (Tc ) Lock Release (Tr ) Tc Critical-section efficiency = Tc+Ta+Tr Ignoring lock contention and cache conflicts in the critical section

  11. Critical section efficiency Critical Section Size

  12. Performance of normal instructions

  13. What’s going on? • Taller memory hierarchies • Memory speeds have not kept up with CPU speeds • 1984: no caches needed, since instructions were slower than memory accesses • 2005: 3-4 level cache hierarchies, since instructions are orders of magnitude faster than memory accesses

  14. Why does this matter? • Synchronization implies sharing data across CPUs • normal instructions tend to hit in top-level cache • synchronization operations tend to miss • Synchronization requires a consistent view of data • between cache and memory • across multiple CPUs • requires CPU-CPU communication • Synchronization instructions see memory latency!

  15. … but that’s not all! • Longer pipelines • 1984: Many clocks per instruction • 2005: Many instructions per clock, 20-stage pipelines • Out of order execution • Keeps the pipelines full • Must not reorder the critical section before its lock! • Synchronization instructions stall the pipeline!

  16. Reordering means weak memory consistency • Memory barriers • - Additional synchronization • instructions are needed to • manage reordering

  17. What is the cost of all this? Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • 1.0 • 1.0

  18. Atomic increment Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • 1.0 • 183.1 • 1.0 • 402.3

  19. Memory barriers Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1.0 • 402.3 • 0.0 • 402.3 • 0.0

  20. Lock acquisition/release with LL/SC Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • Local Lock Round Trip • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1057.5 • 1.0 • 402.3 • 0.0 • 402.3 • 0 • 1138.8

  21. Compare & swap unknown values (NBS) Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • Local Lock Round Trip • CAS Cache Transfer & Invalidate • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1057.5 • 247.1 • 1.0 • 402.3 • 0.0 • 402.3 • 0 • 1138.8 • 847.1

  22. Compare & swap known values (spinlocks) Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • Local Lock Round Trip • CAS Cache Transfer & Invalidate • CAS Blind Cache Transfer • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1057.5 • 247.1 • 257.1 • 1.0 • 402.3 • 0.0 • 402.3 • 0 • 1138.8 • 847.1 • 993.9

  23. The net result? • 1984: Lock contention was the main issue • 2005: Critical section efficiency is a key issue • Even if the lock is always free when you try to acquire it, performance can still suck!

  24. How has this affected OS design? • Multiprocessor OS designers search for “scalable” synchronization strategies • reader-writer locking instead of global locking • data locking and partitioning • Per-CPU reader-writer locking • Non-blocking synchronization • The “common case” is read-mostly access to linked lists and hash-tables • asymmetric strategies favouring readers are good

  25. Review - Global locking • A symmetric approach (also called “code locking”) • A critical section of code is guarded by a lock • Only one thread at a time can hold the lock • Examples include • Monitors • Java “synchronized” on global object • Linux spin_lock() on global spinlock_t • Global locking doesn’t scale due to lock contention!

  26. Review - Reader-writer locking • Many readers can concurrently hold the lock • Writers exclude readers and other writers • The result? • No lock contention in read-mostly scenarios • So it should scale well, right? • … wrong!

  27. CPU 0 critical section read-acquire memory barrier lock read-acquire memory barrier read-acquire memory barrier critical section CPU 1 Scalability of reader/writer locking Reader/writer locking does not scale due to critical section efficiency!

  28. Review - Data locking • A lock per data item instead of one per collection • Per-hash-bucket locking for hash tables • CPUs acquire locks for different hash chains in parallel • CPUs incur memory-latency and pipeline-flush overheads in parallel • Data locking improves scalability by executing critical section overhead in parallel

  29. Review - Per-CPU reader-writer locking • One lock per CPU (called brlock in Linux) • Readers acquire their own CPU’s lock • Writers acquire all CPU’s locks • In read-only workloads CPUs never exchange locks • no memory latency is incurred • Per-CPU R/W locking improves scalability by removing memory latency from read-lock acquisition for read-mostly scenarios

  30. Scalability comparison • Expected scalability on read-mostly workloads • Global locking – poor due to lock contention • R/W locking – poor due to critical section efficiency • Data locking – better? • R/W data locking – better still? • Per-CPU R/W locking – the best we can do?

  31. Actual scalability Scalability of locking strategies using read-only workloads in a hash-table benchmark Measurements taken on a 4-CPU 700 MHz P-III system Similar results are obtained on more recent CPUs

  32. Scalability on 1.45 GHz POWER4 CPUs

  33. Performance at different update fractions on 8 1.45 GHz POWER4 CPUs

  34. What are the lessons so far? • Avoid lock contention ! • Avoid synchronization instructions ! • … especially in the read-path !

  35. How about non-blocking synchronization? • Basic idea – copy & flip pointer (no locks!) • Read a pointer to a data item • Create a private copy of the item to update in place • Swap the old item for the new one using an atomic compare & swap (CAS) instruction on its pointer • CAS fails if current pointer not equal to initial value • Retry on failure • NBS should enable fast reads … in theory!

  36. Problems with NBS in practice • Reusing memory causes problems • Readers holding references can be hijacked during data structure traversals when memory is reclaimed • Readers see inconsistent data structures when memory is reused • How and when should memory be reclaimed?

  37. Immediate reclamation? • In practice, readers must either • Use LL/SC to test if pointers have changed, or • Verify that version numbers associated with data structures have not changed (2 memory barriers) • Synchronization instructions slow NBS readers!

  38. Reader-friendly solutions • Never reclaim memory ? • Type-stable memory ? • Needs free pool per data structure type • Readers can still be hijacked to the free pool • Exposes OS to denial of service attacks • Ideally, defer reclaiming memory until its safe! • Defer reclamation of a data item until references to it are no longer held by any thread

  39. How should we defer reclamation? • Wait for a while then delete? • … but how long should you wait? • Maintain reference counts or per-CPU hazard pointers on data? • Requires synchronization in read path! • Challenge – deferring destruction without using synchronization instructions in the read path

  40. Quiescent-state-based reclamation • Coding convention: • Don’t allow a quiescent state to occur in a read-side critical section • Reclamation strategy: • Only reclaim data after all CPUs in the system have passed through a quiescent state • Example quiescent states: • Context switch in non-preemptive kernel • Yield in preemptive kernel • Return from system call …

  41. Coding conventions for readers • Delineate read-side critical section • Compiles to nothing on most architectures • Don’t hold references outside critical sections • Re-traverse data structure to pick up reference • Don’t yield the CPU during critical sections • Don’t voluntarily yield • Don’t block, don’t leave the kernel …

  42. Overview of the basic idea • Writers create new versions • Using locking or NBS to synchronize with each other • Register call-backs to destroy old versions when safe • Call-backs are deferred and memory reclaimed in batches • Readers do not use synchronization • While they hold a reference to a version it will not be destroyed • Completion of read-side critical sections inferred from observation of quiescent states

  43. Can't hold reference to old version, but RCU can't tell Can't hold reference to old version May hold reference Context Switch CPU 0 RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section CPU 1 Remove Element Context Switch Context Switch Can't hold reference to old version Context switch as a quiescent state

  44. Grace Period Context Switch Context Switch CPU 0 RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section CPU 1 Delete Element Context Switch Context Switch Grace Period Grace Period Grace periods

  45. Quiescent states and grace periods • Example quiescent states • Context switch (non-preemptive kernels) • Voluntary context switch (preemptive kernels) • Kernel entry/exit • Blocking call • Grace periods • A period during which every CPU has gone through a quiescent state

  46. Efficient implementation • Choosing good quiescent states • Occur anyway • Easy to count • Not too frequent or infrequent • Recording and dispatching call-backs • Minimize inter-CPU communication • Maintain per-CPU queues of call-backs • Two queues – waiting for grace period start and end

  47. RCU's data structures Global CPU Bitmask Global Grace-Period Number Counter Counter Snapshot Grace-Period Number 'Next' RCU Callbacks 'Current' RCU Callback End of Previous Grace Period (If Any) End of Current Grace Period call_rcu()

  48. RCU implementations • DYNIX/ptx RCU (data center) • Linux • Multiple implementations (in 2.5 and 2.6 kernels) • Preemptible and nonpreemptible • Tornado/K42 “generations” • Preemptive kernel • Helped generalize usage

  49. Experimental results • How do different combinations of RCU, SMR, NBS and Locking compare? • Hash table mini-benchmark running on 1.45 GHz POWER4 system with 8 CPUs • Various workloads • Read/update fraction • Hash table size • Memory constraints • Number of CPUs

  50. Scalability with working set in cache

More Related