compiler and runtime support for efficient software transactional memory l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Compiler and Runtime Support for Efficient Software Transactional Memory PowerPoint Presentation
Download Presentation
Compiler and Runtime Support for Efficient Software Transactional Memory

Loading in 2 Seconds...

play fullscreen
1 / 25

Compiler and Runtime Support for Efficient Software Transactional Memory - PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on

Compiler and Runtime Support for Efficient Software Transactional Memory. Vijay Menon Programming Systems Lab. Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman. Motivation. Locks are hard to get right Programmability vs scalability

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Compiler and Runtime Support for Efficient Software Transactional Memory' - xiu


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
compiler and runtime support for efficient software transactional memory

Compiler and Runtime Supportfor EfficientSoftware Transactional Memory

Vijay Menon

Programming Systems Lab

Ali-Reza Adl-Tabatabai, Brian T. Lewis,

Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

motivation
Motivation

Locks are hard to get right

  • Programmability vs scalability

Transactional memory is appealing alternative

  • Simpler programming model
  • Stronger guarantees
    • Atomicity, Consistency, Isolation
    • Deadlock avoidance
  • Closer to programmer intent
  • Scalable implementations

Questions

  • How to lower TM overheads – particularly in software?
  • How to balance granularity / scalability?
our system
Our System
  • Java Software Transactional Memory (STM) System
      • Pure software implementation (McRT-STM – PPoPP ’06)
      • Language extensions in Java (Polyglot)
      • Integrated with JVM & JIT (ORP & StarJIT)
  • Novel Features
      • Rich transactional language constructs in Java
      • Efficient, first class nested transactions
      • Complete GC support
      • Risc-like STM API / IR
      • Compiler optimizations
      • Per-type word and object level conflict detection
transactional java java
Transactional Java

atomic {

S;

}

Other Language Constructs

Built on prior research

retry (STM Haskell, …)

orelse (STM Haskell)

tryatomic (Fortress)

when (X10, …)

Standard Java + STM API

while(true) {

TxnHandle th = txnStart();

try {

S’;

break;

} finally {

if(!txnCommit(th))

continue;

}

}

Transactional Java → Java
tight integration with jvm jit
Tight integration with JVM & JIT
  • StarJIT & ORP
    • On-demand cloning of methods (Harris ’03)
    • Identifies transactional regions in Java+STM code
    • Inserts read/write barriers in transactional code
    • Maps STM API to first class opcodes in StarJIT IR (STIR)

Good compiler representation →

greater optimization opportunities

representing read write barriers
atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

stmWr(&a.x, t1)

stmWr(&a.y, t2)

if(stmRd(&a.z) != 0) {

stmWr(&a.x, 0);

stmWr(&a.z, t3)

}

Representing Read/Write Barriers

Traditional barriers hide redundant locking/logging

an stm ir for optimization
Redundancies exposed:

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = t1

txnOpenForWrite(a)

txnLogObjectInt(&a.y, a)

a.y = t2

txnOpenForRead(a)

if(a.z != 0) {

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = 0

txnOpenForWrite(a)

txnLogObjectInt(&a.z, a)

a.z = t3

}

An STM IR for Optimization
optimized code
atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = t1

txnLogObjectInt(&a.y, a)

a.y = t2

if(a.z != 0) {

a.x = 0

txnLogObjectInt(&a.z, a)

a.y = t3

}

Optimized Code

Fewer & cheaper STM operations

compiler optimizations for transactions
Compiler Optimizations for Transactions
  • Standard optimizations
    • CSE, Dead-code-elimination, …
    • Careful IR representation exposes opportunities and enables optimizations with almost no modifications
    • Subtle in presence of nesting
  • STM-specific optimizations
    • Immutable field / class detection & barrier removal (vtable/String)
    • Transaction-local object detection & barrier removal
    • Partial inlining of STM fast paths to eliminate call overhead
mcrt stm
McRT-STM
  • PPoPP 2006 (Saha, et. al.)
    • C / C++ STM
    • Pessimistic Writes:
      • strict two-phase locking
      • update in place
      • undo on abort
    • Optimistic Reads:
      • versioning
      • validation before commit
    • Benefits
      • Fast memory accesses (no buffering / object wrapping)
      • Minimal copying (no cloning for large objects)
      • Compatible with existing types & libraries

Similar STMs: Ennals (FastSTM), Harris, et.al (PLDI ’06)

stm data structures
STM Data Structures
  • Per-thread:
    • Transaction Descriptor
      • Per-thread info for version validation, acquired locks, rollback
      • Maintained in Read / Write / Undo logs
    • Transaction Memento
      • Checkpoint of logs for nesting / partial rollback
  • Per-data:
    • Transaction Record
      • Pointer-sized field guarding a set of shared data
      • Transactional state of data
        • Shared: Version number (odd)
        • Exclusive: Owner’s transaction descriptor (even / aligned)
mapping data to transaction record

vtbl

vtbl

hash

TxR

x

x

y

y

TxR1

TxR2

TxR3

TxRn

Mapping Data to Transaction Record
  • Every data item has an associated transaction record

Transaction

record embedded

In object

class Foo {

int x;

int y;

}

Object

granularity

Object words hash

into table of TxRs

Hash is

f(obj.hash, offset)

class Foo {

int x;

int y;

}

Word

granularity

granularity of conflict detection
Object-level

Cheaper operation

Exposes CSE opportunities

Lower overhead on 1P

Word-level

Reduces false sharing

Better scalability

Mix & Match

Per type basis

E.g., word-level for arrays, object-level for non-arrays

// Thread 1

a.x = …

a.y = …

// Thread 2

… = … a.z …

Granularity of Conflict Detection
experiments
Experiments
  • 16-way 2.2 GHz Xeon with 16 GB shared memory
    • L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)
  • Workloads
    • Hashtable, Binary tree, OO7 (OODBMS)
      • Mix of gets, in-place updates, insertions, and removals
    • Object-level conflict detection by default
      • Word / mixed where beneficial
effective of compiler optimizations
Effective of Compiler Optimizations
  • 1P overheads over thread-unsafe baseline

Prior STMs typically incur ~2x on 1P

With compiler optimizations:

- < 40% over no concurrency control

- < 30% over synchronization

scalability java hashmap shootout
Scalability: Java HashMap Shootout
  • Unsafe (java.util.HashMap)
      • Thread-unsafe w/o Concurrency Control

Synchronized

      • Coarse-grain synchronization via SynchronizedMap wrapper

Concurrent (java.util.concurrent.ConcurrentHashMap)

      • Multi-year effort: JSR 166 -> Java 5
      • Optimized for concurrent gets (no locking)
      • For updates, divides bucket array into 16 segments (size / locking)

Atomic

      • Transactional version via “AtomicMap” wrapper

Atomic Prime

      • Transactional version with minor hand optimization
        • Tracks size per segment ala ConcurrentHashMap
  • Execution
      • 10,000,000 operations / 200,000 elements
      • Defaults: load factor, threshold, concurrency level
scalability 100 gets
Scalability: 100% Gets

Atomic wrapper is competitive with ConcurrentHashMap

Effect of compiler optimizations scale

scalability 20 gets 80 updates
Scalability: 20% Gets / 80% Updates

ConcurrentHashMap thrashes on 16 segments

Atomic still scales

20 inserts and removes
20% Inserts and Removes

Atomic conflicts on entire bucket array

- The array is an object

20 inserts and removes word level
20% Inserts and Removes: Word-Level

We still conflict on the single size field in java.util.HashMap

20 inserts and removes atomic prime
20% Inserts and Removes: Atomic Prime

Atomic Prime tracks size / segment – lowering bottleneck

No degradation, modest performance gain

20 inserts and removes mixed level
20% Inserts and Removes: Mixed-Level
  • Mixed-level preserves wins & reduces overheads
  • word-level for arrays
  • object-level for non-arrays
key takeaways
Key Takeaways
  • Optimistic reads + pessimistic writes is nice sweet spot
  • Compiler optimizations significantly reduce STM overhead
  • - 20-40% over thread-unsafe
  • - 10-30% over synchronized
  • Simple atomic wrappers sometimes good enough
  • Minor modifications give competitive performance to complex fine-grain synchronization
  • Word-level contention is crucial for large arrays
  • Mixed contention provides best of both
novel contributions
Novel Contributions
  • Rich transactional language constructs in Java
  • Efficient, first class nested transactions
  • Complete GC support
  • Risc-like STM API
  • Compiler optimizations
  • Per-type word and object level conflict detection