compiler and runtime support for efficient software transactional memory
Download
Skip this Video
Download Presentation
Compiler and Runtime Support for Efficient Software Transactional Memory

Loading in 2 Seconds...

play fullscreen
1 / 41

Compiler and Runtime Support for Efficient Software Transactional Memory - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Compiler and Runtime Support for Efficient Software Transactional Memory. Vijay Menon. Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman. Motivation. Multi-core architectures are mainstream Software concurrency needed for scalability

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Compiler and Runtime Support for Efficient Software Transactional Memory' - afya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
compiler and runtime support for efficient software transactional memory

Compiler and Runtime Supportfor EfficientSoftware Transactional Memory

Vijay Menon

Ali-Reza Adl-Tabatabai, Brian T. Lewis,

Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

motivation
Motivation
  • Multi-core architectures are mainstream
      • Software concurrency needed for scalability
      • Concurrent programming is hard
      • Difficult to reason about shared data
  • Traditional mechanism: Lock-based Synchronization
      • Hard to use
      • Must be fine-grain for scalability
      • Deadlocks
      • Not easily composable
  • New Solution: Transactional Memory (TM)
      • Simpler programming model: Atomicity, Consistency, Isolation
      • No deadlocks
      • Composability
      • Optimistic concurrency
      • Analogy
        • GC : Memory allocation ≈ TM : Mutual exclusion
composability
Composability

class Bank {

ConcurrentHashMap accounts;

void deposit(String name, int amount) {

synchronized (accounts) {

int balance = accounts.get(name); // Get the current balance

balance = balance + amount; // Increment it

accounts.put(name, balance); // Set the new balance

}

}

}

  • Thread-safe – but no scaling
    • ConcurrentHashMap (Java 5/JSR 166) does not help
    • Performance requires redesign from scratch & fine-grain locking
transactional solution
Transactional solution
  • class Bank {

HashMap accounts;

void deposit(String name, int amount) {

atomic {

int balance = accounts.get(name); // Get the current balance

balance = balance + amount; // Increment it

accounts.put(name, balance); // Set the new balance

}

}

}

Underlying system provide:

    • isolation (thread safety)
    • optimistic concurrency
transactions are composable
Transactions are Composable

Scalability on 16-way 2.2 GHz Xeon System

our system
Our System
  • A Java Software Transactional Memory (STM) System
      • Pure software implementation
      • Language extensions in Java
      • Integrated with JVM & JIT
  • Novel Features
      • Rich transactional language constructs in Java
      • Efficient, first class nested transactions
      • Risc-like STM API
      • Compiler optimizations
      • Per-type word and object level conflict detection
      • Complete GC support
system overview

Transactional Java

Java + STM API

ORP VM

Transactional STIR

Optimized T-STIR

McRT STM

Native Code

System Overview

Polyglot

StarJIT

transactional java
Transactional Java
  • Java + new language constructs:
    • Atomic: execute block atomically
      • atomic {S}
    • Retry: block until alternate path possible
      • atomic {… retry;…}
    • Orelse: compose alternate atomic blocks
      • atomic {S1} orelse{S2} … orelse{Sn}
    • Tryatomic: atomic with escape hatch
      • tryatomic {S} catch(TxnFailed e) {…}
    • When: conditionally atomic region
      • when (condition) {S}
  • Builds on prior research

Concurrent Haskell, CAML, CILK, Java

HPCS languages: Fortress, Chapel, X10

transactional java java
Transactional Java

atomic {

S;

}

STM API

txnStart[Nested]

txnCommit[Nested]

txnAbortNested

txnUserRetry

...

Standard Java + STM API

while(true) {

TxnHandle th = txnStart();

try {

S’;

break;

} finally {

if(!txnCommit(th))

continue;

}

}

Transactional Java → Java
jvm stm support
JVM STM support
  • On-demand cloning of methods called inside transactions
  • Garbage collection support
    • Enumeration of refs in read set, write set & undo log
  • Extra transaction record field in each object
    • Supports both word & object granularity
  • Native method invocation throws exception inside transaction
    • Some intrinsic functions allowed
  • Runtime STM API
    • Wrapper around McRT-STM API
    • Polyglot / StarJIT automatically generates calls to API
background mcrt stm
Background: McRT-STM

STM for

  • C / C++ (PPoPP 2006)
  • Java (PLDI 2006)
  • Writes:
    • strict two-phase locking
    • update in place
    • undo on abort
  • Reads:
    • versioning
    • validation before commit
  • Granularity per type
    • Object-level : small objects
    • Word-level : large arrays
  • Benefits
    • Fast memory accesses (no buffering / object wrapping)
    • Minimal copying (no cloning for large objects)
    • Compatible with existing types & libraries
ensuring atomicity novel combination
Ensuring Atomicity: Novel Combination

+ In place updates

+ Fast commits

+ Fast reads

+ Caching effects

+ Avoids lock

operations

Quantitative results in PPoPP’06

mcrt stm example
McRT-STM: Example
  • STM read & write barriers before accessing memory inside transactions
  • STM tracks accesses & detects data conflicts

atomic {

B = A + 5;

}

stmStart();

temp = stmRd(A);

stmWr(B, temp + 5);

stmCommit();

transaction record
Transaction Record
  • Pointer-sized record per object / word
  • Two states:
    • Shared (low bit is 1)
      • Read-only / multiple readers
      • Value is version number (odd)
    • Exclusive
      • Write-only / single owner
      • Value is thread transaction descriptor (4-byte aligned)
  • Mapping
    • Object : slot in object
    • Field : hashed index into global record table
transaction record example

vtbl

vtbl

vtbl

hash

TxR

x

x

x

y

y

y

TxR1

TxR2

TxR3

TxRn

Transaction Record: Example
  • Every data item has an associated transaction record

Extra transaction

record field

class Foo {

int x;

int y;

}

Object

granularity

Object words hash

into table of TxRs

Hash is

f(obj.hash, offset)

class Foo {

int x;

int y;

}

Word

granularity

transaction descriptor
Transaction Descriptor
  • Descriptor per thread
      • Info for version validation, lock release, undo on abort, …
  • Read and Write set : {<Ti, Ni>}
      • Ti: transaction record
      • Ni: version number
  • Undo log : {<Ai, Oi, Vi, Ki>}
      • Ai: field / element address
      • Oi: containing object (or null for static)
      • Vi: original value
      • Ki: type tag (for garbage collection)
  • In atomic region
      • Read operation appends read set
      • Write operation appends write set and undo log
      • GC enumerates read/write/undo logs
mcrt stm example1

T1

atomic {

t = foo.x;

bar.x = t;

t = foo.y;

bar.y = t;

}

McRT-STM: Example

Class Foo {

int x;

int y;

};

Foo bar, foo;

T2

atomic {

t1 = bar.x;

t2 = bar.y;

}

  • T1 copies foo into bar
  • T2 reads bar, but should not see intermediate values
mcrt stm example2

T1

stmStart();

t = stmRd(foo.x);

stmWr(bar.x,t);

t = stmRd(foo.y);

stmWr(bar.y,t);

stmCommit();

McRT-STM: Example

T2

stmStart();

t1 = stmRd(bar.x);

t2 = stmRd(bar.y);

stmCommit();

  • T1 copies foo into bar
  • T2 reads bar, but should not see intermediate values
mcrt stm example3

T1

3

stmStart();

t = stmRd(foo.x);

stmWr(bar.x,t);

t = stmRd(foo.y);

stmWr(bar.y,t);

stmCommit;

hdr

x = 9

y = 7

McRT-STM: Example

7

T1

foo

5

bar

Abort

Commit

hdr

x = 9

x = 0

y = 7

y = 0

T2

stmStart();

t1 = stmRd(bar.x);

t2 = stmRd(bar.y);

stmCommit();

T2 waits

<bar, 7>

<bar, 5>

Reads

Reads

<foo, 3>

<foo, 3>

Writes

<bar, 5>

  • T2 should read [0, 0] or should read [9,7]

Undo

<bar.y, 0>

<bar.x, 0>

early results overhead breakdown
Early Results: Overhead breakdown
  • Time breakdown on single processor
  • STM read & validation overheads dominate

 Good optimization targets

system overview1

Transactional Java

Java + STM API

ORP VM

Transactional STIR

Optimized T-STIR

McRT STM

Native Code

System Overview

Polyglot

StarJIT

leveraging the jit
Leveraging the JIT
  • StarJIT: High-performance dynamic compiler
    • Identifies transactional regions in Java+STM code
    • Differentiates top-level and nested transactions
    • Inserts read/write barriers in transactional code
    • Maps STM API to first class opcodes in STIR

Good compiler representation →

greater optimization opportunities

representing read write barriers
atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

stmWr(&a.x, t1)

stmWr(&a.y, t2)

if(stmRd(&a.z) != 0) {

stmWr(&a.x, 0);

stmWr(&a.z, t3)

}

Representing Read/Write Barriers

Traditional barriers hide redundant locking/logging

an stm ir for optimization
Redundancies exposed:

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = t1

txnOpenForWrite(a)

txnLogObjectInt(&a.y, a)

a.y = t2

txnOpenForRead(a)

if(a.z != 0) {

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = 0

txnOpenForWrite(a)

txnLogObjectInt(&a.z, a)

a.z = t3

}

An STM IR for Optimization
optimized code
atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = t1

txnLogObjectInt(&a.y, a)

a.y = t2

if(a.z != 0) {

a.x = 0

txnLogObjectInt(&a.z, a)

a.y = t3

}

Optimized Code

Fewer & cheaper STM operations

compiler optimizations for transactions
Compiler Optimizations for Transactions
  • Standard optimizations
    • CSE, Dead-code-elimination, …
    • Careful IR representation exposes opportunities and enables optimizations with almost no modifications
    • Subtle in presence of nesting
  • STM-specific optimizations
    • Immutable field / class detection & barrier removal (vtable/String)
    • Transaction-local object detection & barrier removal
    • Partial inlining of STM fast paths to eliminate call overhead
experiments
Experiments
  • 16-way 2.2 GHz Xeon with 16 GB shared memory
    • L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)
  • Workloads
    • Hashtable, Binary tree, OO7 (OODBMS)
      • Mix of gets, in-place updates, insertions, and removals
    • Object-level conflict detection by default
      • Word / mixed where beneficial
effective of compiler optimizations
Effective of Compiler Optimizations
  • 1P overheads over thread-unsafe baseline

Prior STMs typically incur ~2x on 1P

With compiler optimizations:

- < 40% over no concurrency control

- < 30% over synchronization

scalability java hashmap shootout
Scalability: Java HashMap Shootout
  • Unsafe (java.util.HashMap)
      • Thread-unsafe w/o Concurrency Control

Synchronized

      • Coarse-grain synchronization via SynchronizedMap wrapper

Concurrent (java.util.concurrent.ConcurrentHashMap)

      • Multi-year effort: JSR 166 -> Java 5
      • Optimized for concurrent gets (no locking)
      • For updates, divides bucket array into 16 segments (size / locking)

Atomic

      • Transactional version via “AtomicMap” wrapper

Atomic Prime

      • Transactional version with minor hand optimization
        • Tracks size per segment ala ConcurrentHashMap
  • Execution
      • 10,000,000 operations / 200,000 elements
      • Defaults: load factor, threshold, concurrency level
scalability 100 gets
Scalability: 100% Gets

Atomic wrapper is competitive with ConcurrentHashMap

Effect of compiler optimizations scale

scalability 20 gets 80 updates
Scalability: 20% Gets / 80% Updates

ConcurrentHashMap thrashes on 16 segments

Atomic still scales

20 inserts and removes
20% Inserts and Removes

Atomic conflicts on entire bucket array

- The array is an object

20 inserts and removes word level
20% Inserts and Removes: Word-Level

We still conflict on the single size field in java.util.HashMap

20 inserts and removes atomic prime
20% Inserts and Removes: Atomic Prime

Atomic Prime tracks size / segment – lowering bottleneck

No degradation, modest performance gain

20 inserts and removes mixed level
20% Inserts and Removes: Mixed-Level
  • Mixed-level preserves wins & reduces overheads
  • word-level for arrays
  • object-level for non-arrays
scalability java util treemap
Scalability: java.util.TreeMap

100% Gets

80% Gets

Results similar to HashMap

scalability oo7 80 reads
Scalability: OO7 – 80% Reads

Operations & traversal over synthetic database

“Coarse” atomic is competitive with medium-grain synchronization

key takeaways
Key Takeaways
  • Optimistic reads + pessimistic writes is nice sweet spot
  • Compiler optimizations significantly reduce STM overhead
  • - 20-40% over thread-unsafe
  • - 10-30% over synchronized
  • Simple atomic wrappers sometimes good enough
  • Minor modifications give competitive performance to complex fine-grain synchronization
  • Word-level contention is crucial for large arrays
  • Mixed contention provides best of both
research challenges
Research challenges
  • Performance
      • Compiler optimizations
      • Hardware support
      • Dealing with contention
  • Semantics
      • I/O & communication
      • Strong atomicity
      • Nested parallelism
      • Open transactions
  • Debugging & performance analysis tools
  • System integration
conclusions
Conclusions
  • Rich transactional language constructs in Java
  • Efficient, first class nested transactions
  • Risc-like STM API
  • Compiler optimizations
  • Per-type word and object level conflict detection
  • Complete GC support
ad