1 / 75

Programming Multi-Core Systems

Programming Multi-Core Systems. 周枫 网易公司 2007-12-28 清华大学 http://zhoufeng.net. Last Week. A trend in software: “The Movement to Safe Languages”  Improving dependability of system software without using safe languages. A Trend in Hardware. “The Movement to Multi-core Processors”

jamar
Download Presentation

Programming Multi-Core Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学 http://zhoufeng.net

  2. Last Week • A trend in software: “The Movement to Safe Languages”  • Improving dependability of system software withoutusing safe languages

  3. A Trend in Hardware “The Movement to Multi-core Processors” • Originates from inability to increase processor clock rate • Profound impact on architecture, OS and applications Topic today: How to program multi-core systems effectively?

  4. Why Multi-Core? • Areas of improving CPU performance in last 30 years • Clock speed • Execution optimization (Cycles-Per-Instruction) • Cache • All 3 are concurrency-agnostic • Now: 1 disappears, 2 slows down, 3 still good • 1: No 4Ghz Intel CPUs • 2: Improves with new micro-architecture (e.g. Core vs. NetBurst)

  5. CPU Speed History From “Computer Architecture: A Quantitative Approach”, 4th edition, 2007

  6. Reality Today • Near-term performance drivers • Multicore • Hardware threads (a.k.a. Simultaneous Multi-Threading) • Cache • 1 and 2 useful only when software is concurrent “Concurrency is the next revolution in how we write software” --- “The Free Lunch is Over”

  7. Costs/Problems of Concurrency • Overhead of locks, message passing… • Not all programs are parallelizable • Programming concurrently is HARD • Complex concepts: mutex, read-write lock, queue… • Correct synchronization: race, deadlocks… • Getting speed-up Our focus today

  8. Potential Multi-Core Apps

  9. Roadmap • Overview of multi-core computing • Overview of transactional programming • Case: Transactions for safe/managed languages • Case: Transactions for languages like C

  10. Status Quo in Synchronization • Current mechanism: manual locking • Organization: lock for each shared structure • Usage: (block)  acquire  access  release • Correctness issues • Under-locking  data races • Acquires in different orders  deadlock • Performance issues • Difficult to find right granularity • Overhead of acquiring vs. allowed concurrency

  11. Transactions / Atomic Sections • Databases has provided automatic concurrency control for 30 years: ACID transactions • Vision: • Atomicity • Isolation • Serialization only on conflicts • (optional) Rollback/abort support Question: Is it possible to provide database transactionsemantics to general programming?

  12. Transactions vs. Manual Locks • Manual locking issues: • Under-locking • Acquires in different orders • Blocking • Conservative serialization • How transactions help: • No explicit locks • No ordering • Can cancel transactions • Serialization only on conflicts Transactions: Potentially simpler and more efficient

  13. Design Space • Hardware Transactional Memory vs. software TM • Granularity: object, word, block • Update method • Deferred: discard private copy on aborts • Direct: control access to data, erase update on aborts • Concurrency control • Pessimistic: prevent conflicts by locking • Optimistic: assumes no conflict and retry if there is • …

  14. Hard Issues • Communications, or side-effects • File I/O • Database accesses • Interaction with other abstractions • Garbage collection • Virtual memory • Work with existing languages, synchronization primitives, …

  15. Why software TM? • More flexible • Easier to modify and evolve • Integrate better with existing systems and languages • Not limited to fixed-size hardware structures, e.g. caches

  16. Proposed Language Support // Insert into a doubly-linked list atomically atomic { newNode->prev = node; newNode->next = node->next; node->next->prev = newNode; node->next = newNode; } Guard condition atomic (queueSize > 0) { // remove item from queue and use it }

  17. Transactions forManaged/Safe Languages

  18. Intro • Optimizing Memory Transactions,Harris et al. PLDI’06 • STM system from MSR Cambridge, for MSIL (.Net) • Design features • Object-level granularity • Direct update • Version numbers to track reads • 2-phase locking to track write • Compiler optimizations

  19. First-gen STM • Very expensive. E.g. Harris+Fraser, OOPSLA ’03): • Every load and store instruction logs information into a thread-local log • A store instruction writes the log only • A load instruction consults the log first • At the end of the block: validate the log; and atomically commit it to shared memory

  20. Direct Update STM • Augment objects with (i) a lock, (ii) a version number • Transactional write: • Lock objects before they are written to (abort if another thread has that lock) • Log the overwritten data – we need it to restore the heap case of retry, transaction abort, or a conflict with a concurrent thread • Make the update in place to the object

  21. Transactional read: • Log the object’s version number • Read from the object itself • Commit: • Check the version numbers of objects we’ve read • Increment the version numbers of object we’ve written, unlocking them

  22. Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 100 ver = 200 val = 10 val = 40 T1’s log: T2’s log:

  23. Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 100 ver = 200 val = 10 val = 40 T1’s log: T2’s log: c1.ver=100 T1 reads from c1: logs that it saw version 100

  24. Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 100 ver = 200 val = 10 val = 40 T1’s log: T2’s log: c1.ver=100 c1.ver=100 T2 also reads from c1: logs that it saw version 100

  25. Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 100 ver = 200 val = 10 val = 40 T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 Suppose T1 now reads from c2, sees it at version 200

  26. Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 200 locked:T2 val = 10 val = 40 T1’s log: T2’s log: c1.ver=100 c1.ver=100 lock: c1, 100 Before updating c1, thread T2 must lock it: record old version number

  27. Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver = 200 locked:T2 val = 11 val = 40 (2) After logging the old value, T2 makes its update in place to c1 T1’s log: T2’s log: c1.ver=100 c1.ver=100 lock: c1, 100 c1.val=10 (1) Before updating c1.val, thread T2 must log the data it’s going to overwrite

  28. Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver=101 ver = 200 val = 11 val = 40 (2) T2’s transaction commits successfully. Unlock the object, installing the new version number T1’s log: T2’s log: c1.ver=100 c1.ver=100 lock: c1, 100 c1.val=10 (1) Check the version we locked matches the version we previously read

  29. Example: contention between transactions c1 c2 Thread T1 Thread T2 int t = 0; atomic { t += c1.val; t += c2.val; } atomic { t = c1.val; t ++; c1.val = t; } ver=101 ver = 200 val = 11 val = 40 T1’s log: T2’s log: c1.ver=100 c1.ver=100 lock: c1, 100 c1.val=10 (1) T1 attempts to commit. Check the versions it read are still up-to-date. (2) Object c1 was updated from version 100 to 101, so T1’s transaction is aborted and re-run.

  30. Compiler integration • We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL) • OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples) • OpenForUpdate – before the first time we update an object • LogOldValue – before the first time we write to a given field Source code Basic intermediate code Optimized intermediate code atomic { … t += n.value; n = n.next; … } OpenForRead(n); t = n.value; OpenForRead(n); n = n.next; OpenForRead(n); t = n.value; n = n.next;

  31. Runtime integration – garbage collection 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100

  32. Runtime integration – garbage collection 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed

  33. Runtime integration – garbage collection 3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed

  34. Runtime integration – garbage collection 3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed 4. Discard log entries for unreachable objects: they’re dead whether or not the block succeeds

  35. Results: Against Previous Work Fine-grained locking (2.57x) Harris+Fraser WSTM (5.69x) Coarse-grained locking (1.13x) Direct-update STM (2.04x) Normalised execution time Direct-update STM + compiler integration (1.46x) Sequential baseline (1.00x) Workload: operations on a red-black tree, 1 thread, 6:1:1 lookup:insert:delete mix with keys 0..65535 Scalable to multicore

  36. Scalability (µ-benchmark) Coarse-grained locking Fine-grained locking WSTM (atomic blocks) DSTM (API) OSTM (API) Microseconds per operation Direct-update STM + compiler integration #threads

  37. Results: long running tests 10.8 73 162 Direct-update STM Run-time filtering Compile-time optimizations Original application (no tx) Normalised execution time tree skip go merge-sort xlisp

  38. Summary • A pure software implementation can perform well and scale to vast transactions • Direct update • Pessimistic & optimistic CC • Compiler support & optimizations • Still need a better understanding of realistic workload distributions

  39. Atomic Sections for C-like Languages

  40. Intro • Autolocker: Synchronization Inference for Atomic SectionsBill McCloskey, Feng Zhou, David Gay, Eric Brewer, POPL 2006 • Question: Can we have language support for atomic sections for languages like C? • No object meta-data • No type safety • No garbage collection • Answer: Let programmer help a bit (w/ annotations) • And do a simpler one: NO aborts!

  41. Autolocker: C + Atomic Sections • Shared data is protected by annotated locks • Threads access shared data in atomic sections: • Threads never deadlock (due to Autolocker) • Threads never race for protected data • How can we implement this semantics? mutex m; int shared_var protected_by(m); atomic { ... x = shared_var; ... } Code runs as if a single lock protects all atomic sections

  42. Autolocker Transformation • Autolocker is a source-to-source transformation C code Autolocker code mutex m1, m2; int x protected_by(m1); int y protected_by(m2); atomic { x = 3; y = 2; } int m1, m2; int x; int y; begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2; end_atomic();

  43. Autolocker Transformation • Autolocker is a source-to-source transformation C code Autolocker code mutex m1, m2; int x protected_by(m1); int y protected_by(m2); atomic { x = 3; y = 2; } int m1, m2; int x; int y; begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2; end_atomic(); Atomic sections can be nested arbitrarily The nesting level is tracked at runtime

  44. Autolocker Transformation • Autolocker is a source-to-source transformation C code Autolocker code mutex m1, m2; int x protected_by(m1); int y protected_by(m2); atomic { x = 3; y = 2; } int m1, m2; int x; int y; begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2; end_atomic(); Locks are acquired as needed Lock acquisitions are reentrant

  45. Autolocker Transformation • Autolocker is a source-to-source transformation C code Autolocker code mutex m1, m2; int x protected_by(m1); int y protected_by(m2); atomic { x = 3; y = 2; } int m1, m2; int x; int y; begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2; end_atomic(); Locks are released when outermost section ends Strict two-phase locking: guarantees atomicity

  46. Autolocker Transformation • Autolocker is a source-to-source transformation C code Autolocker code mutex m1, m2; int x protected_by(m1); int y protected_by(m2); atomic { x = 3; y = 2; } int m1, m2; int x; int y; begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2; end_atomic(); Locks are acquired in a global order Acquiring locks in order will never lead to deadlock

  47. Outline • Introduction • semantics • usage model and benefits • Autolocker algorithm • match locks to data • order lock acquisitions • insert lock acquisitions • Related work • Experimental evaluation • Conclusion

  48. Autolocker Usage Model • Typical process for writing threaded software: • Linux kernel evolved to SMP this way • Autolocker helps you program in this style start here one coarse-grained lock lots of contention little parallelism finish here many fine-grained locks low contention high parallelism shared data lock threads

  49. Granularity and Autolocker • In Autolocker, annotations control performance: • Simpler than fixing all uses of shared_var • Changing annotations won’t introduce bugs • no deadlocks • no new race conditions int shared_var protected_by(kernel_lock); int shared_var protected_by(fs->lock);

  50. Outline • Introduction • semantics • usage model and benefits • Autolocker algorithm • match locks to data • order lock acquisitions • insert lock acquisitions • Related work • Experimental evaluation • Conclusion

More Related