1 / 41

Compiler and Runtime Support for Efficient Software Transactional Memory

Compiler and Runtime Support for Efficient Software Transactional Memory. Vijay Menon. Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman. Motivation. Multi-core architectures are mainstream Software concurrency needed for scalability

afya
Download Presentation

Compiler and Runtime Support for Efficient Software Transactional Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler and Runtime Supportfor EfficientSoftware Transactional Memory Vijay Menon Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

  2. Motivation • Multi-core architectures are mainstream • Software concurrency needed for scalability • Concurrent programming is hard • Difficult to reason about shared data • Traditional mechanism: Lock-based Synchronization • Hard to use • Must be fine-grain for scalability • Deadlocks • Not easily composable • New Solution: Transactional Memory (TM) • Simpler programming model: Atomicity, Consistency, Isolation • No deadlocks • Composability • Optimistic concurrency • Analogy • GC : Memory allocation ≈ TM : Mutual exclusion

  3. Composability class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } … } • Thread-safe – but no scaling • ConcurrentHashMap (Java 5/JSR 166) does not help • Performance requires redesign from scratch & fine-grain locking

  4. Transactional solution • class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } … } Underlying system provide: • isolation (thread safety) • optimistic concurrency

  5. Transactions are Composable Scalability on 16-way 2.2 GHz Xeon System

  6. Our System • A Java Software Transactional Memory (STM) System • Pure software implementation • Language extensions in Java • Integrated with JVM & JIT • Novel Features • Rich transactional language constructs in Java • Efficient, first class nested transactions • Risc-like STM API • Compiler optimizations • Per-type word and object level conflict detection • Complete GC support

  7. Transactional Java Java + STM API ORP VM Transactional STIR Optimized T-STIR McRT STM Native Code System Overview Polyglot StarJIT

  8. Transactional Java • Java + new language constructs: • Atomic: execute block atomically • atomic {S} • Retry: block until alternate path possible • atomic {… retry;…} • Orelse: compose alternate atomic blocks • atomic {S1} orelse{S2} … orelse{Sn} • Tryatomic: atomic with escape hatch • tryatomic {S} catch(TxnFailed e) {…} • When: conditionally atomic region • when (condition) {S} • Builds on prior research Concurrent Haskell, CAML, CILK, Java HPCS languages: Fortress, Chapel, X10

  9. Transactional Java atomic { S; } STM API txnStart[Nested] txnCommit[Nested] txnAbortNested txnUserRetry ... Standard Java + STM API while(true) { TxnHandle th = txnStart(); try { S’; break; } finally { if(!txnCommit(th)) continue; } } Transactional Java → Java

  10. JVM STM support • On-demand cloning of methods called inside transactions • Garbage collection support • Enumeration of refs in read set, write set & undo log • Extra transaction record field in each object • Supports both word & object granularity • Native method invocation throws exception inside transaction • Some intrinsic functions allowed • Runtime STM API • Wrapper around McRT-STM API • Polyglot / StarJIT automatically generates calls to API

  11. Background: McRT-STM STM for • C / C++ (PPoPP 2006) • Java (PLDI 2006) • Writes: • strict two-phase locking • update in place • undo on abort • Reads: • versioning • validation before commit • Granularity per type • Object-level : small objects • Word-level : large arrays • Benefits • Fast memory accesses (no buffering / object wrapping) • Minimal copying (no cloning for large objects) • Compatible with existing types & libraries

  12. Ensuring Atomicity: Novel Combination + In place updates + Fast commits + Fast reads + Caching effects + Avoids lock operations Quantitative results in PPoPP’06

  13. McRT-STM: Example • STM read & write barriers before accessing memory inside transactions • STM tracks accesses & detects data conflicts … … atomic { B = A + 5; } … … … stmStart(); temp = stmRd(A); stmWr(B, temp + 5); stmCommit(); …

  14. Transaction Record • Pointer-sized record per object / word • Two states: • Shared (low bit is 1) • Read-only / multiple readers • Value is version number (odd) • Exclusive • Write-only / single owner • Value is thread transaction descriptor (4-byte aligned) • Mapping • Object : slot in object • Field : hashed index into global record table

  15. vtbl vtbl vtbl hash TxR x x x y y y TxR1 TxR2 TxR3 … TxRn Transaction Record: Example • Every data item has an associated transaction record Extra transaction record field class Foo { int x; int y; } Object granularity Object words hash into table of TxRs Hash is f(obj.hash, offset) class Foo { int x; int y; } Word granularity

  16. Transaction Descriptor • Descriptor per thread • Info for version validation, lock release, undo on abort, … • Read and Write set : {<Ti, Ni>} • Ti: transaction record • Ni: version number • Undo log : {<Ai, Oi, Vi, Ki>} • Ai: field / element address • Oi: containing object (or null for static) • Vi: original value • Ki: type tag (for garbage collection) • In atomic region • Read operation appends read set • Write operation appends write set and undo log • GC enumerates read/write/undo logs

  17. T1 atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; } McRT-STM: Example Class Foo { int x; int y; }; Foo bar, foo; T2 atomic { t1 = bar.x; t2 = bar.y; } • T1 copies foo into bar • T2 reads bar, but should not see intermediate values

  18. T1 stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit(); McRT-STM: Example T2 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit(); • T1 copies foo into bar • T2 reads bar, but should not see intermediate values

  19. T1 3 stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit; hdr x = 9 y = 7 McRT-STM: Example 7 T1 foo 5 bar Abort Commit hdr x = 9 x = 0 y = 7 y = 0 T2 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit(); T2 waits <bar, 7> <bar, 5> Reads Reads <foo, 3> <foo, 3> Writes <bar, 5> • T2 should read [0, 0] or should read [9,7] Undo <bar.y, 0> <bar.x, 0>

  20. Early Results: Overhead breakdown • Time breakdown on single processor • STM read & validation overheads dominate  Good optimization targets

  21. Transactional Java Java + STM API ORP VM Transactional STIR Optimized T-STIR McRT STM Native Code System Overview Polyglot StarJIT

  22. Leveraging the JIT • StarJIT: High-performance dynamic compiler • Identifies transactional regions in Java+STM code • Differentiates top-level and nested transactions • Inserts read/write barriers in transactional code • Maps STM API to first class opcodes in STIR Good compiler representation → greater optimization opportunities

  23. atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } … stmWr(&a.x, t1) stmWr(&a.y, t2) if(stmRd(&a.z) != 0) { stmWr(&a.x, 0); stmWr(&a.z, t3) } Representing Read/Write Barriers Traditional barriers hide redundant locking/logging

  24. Redundancies exposed: atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnOpenForWrite(a) txnLogObjectInt(&a.y, a) a.y = t2 txnOpenForRead(a) if(a.z != 0) { txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = 0 txnOpenForWrite(a) txnLogObjectInt(&a.z, a) a.z = t3 } An STM IR for Optimization

  25. atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 if(a.z != 0) { a.x = 0 txnLogObjectInt(&a.z, a) a.y = t3 } Optimized Code Fewer & cheaper STM operations

  26. Compiler Optimizations for Transactions • Standard optimizations • CSE, Dead-code-elimination, … • Careful IR representation exposes opportunities and enables optimizations with almost no modifications • Subtle in presence of nesting • STM-specific optimizations • Immutable field / class detection & barrier removal (vtable/String) • Transaction-local object detection & barrier removal • Partial inlining of STM fast paths to eliminate call overhead

  27. Experiments • 16-way 2.2 GHz Xeon with 16 GB shared memory • L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four) • Workloads • Hashtable, Binary tree, OO7 (OODBMS) • Mix of gets, in-place updates, insertions, and removals • Object-level conflict detection by default • Word / mixed where beneficial

  28. Effective of Compiler Optimizations • 1P overheads over thread-unsafe baseline Prior STMs typically incur ~2x on 1P With compiler optimizations: - < 40% over no concurrency control - < 30% over synchronization

  29. Scalability: Java HashMap Shootout • Unsafe (java.util.HashMap) • Thread-unsafe w/o Concurrency Control Synchronized • Coarse-grain synchronization via SynchronizedMap wrapper Concurrent (java.util.concurrent.ConcurrentHashMap) • Multi-year effort: JSR 166 -> Java 5 • Optimized for concurrent gets (no locking) • For updates, divides bucket array into 16 segments (size / locking) Atomic • Transactional version via “AtomicMap” wrapper Atomic Prime • Transactional version with minor hand optimization • Tracks size per segment ala ConcurrentHashMap • Execution • 10,000,000 operations / 200,000 elements • Defaults: load factor, threshold, concurrency level

  30. Scalability: 100% Gets Atomic wrapper is competitive with ConcurrentHashMap Effect of compiler optimizations scale

  31. Scalability: 20% Gets / 80% Updates ConcurrentHashMap thrashes on 16 segments Atomic still scales

  32. 20% Inserts and Removes Atomic conflicts on entire bucket array - The array is an object

  33. 20% Inserts and Removes: Word-Level We still conflict on the single size field in java.util.HashMap

  34. 20% Inserts and Removes: Atomic Prime Atomic Prime tracks size / segment – lowering bottleneck No degradation, modest performance gain

  35. 20% Inserts and Removes: Mixed-Level • Mixed-level preserves wins & reduces overheads • word-level for arrays • object-level for non-arrays

  36. Scalability: java.util.TreeMap 100% Gets 80% Gets Results similar to HashMap

  37. Scalability: OO7 – 80% Reads Operations & traversal over synthetic database “Coarse” atomic is competitive with medium-grain synchronization

  38. Key Takeaways • Optimistic reads + pessimistic writes is nice sweet spot • Compiler optimizations significantly reduce STM overhead • - 20-40% over thread-unsafe • - 10-30% over synchronized • Simple atomic wrappers sometimes good enough • Minor modifications give competitive performance to complex fine-grain synchronization • Word-level contention is crucial for large arrays • Mixed contention provides best of both

  39. Research challenges • Performance • Compiler optimizations • Hardware support • Dealing with contention • Semantics • I/O & communication • Strong atomicity • Nested parallelism • Open transactions • Debugging & performance analysis tools • System integration

  40. Conclusions • Rich transactional language constructs in Java • Efficient, first class nested transactions • Risc-like STM API • Compiler optimizations • Per-type word and object level conflict detection • Complete GC support

More Related