Compiler and runtime support for efficient software transactional memory
Download
1 / 41

Compiler and Runtime Support for Efficient Software Transactional Memory - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Compiler and Runtime Support for Efficient Software Transactional Memory. Vijay Menon. Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman. Motivation. Multi-core architectures are mainstream Software concurrency needed for scalability

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Compiler and Runtime Support for Efficient Software Transactional Memory' - afya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Compiler and runtime support for efficient software transactional memory

Compiler and Runtime Supportfor EfficientSoftware Transactional Memory

Vijay Menon

Ali-Reza Adl-Tabatabai, Brian T. Lewis,

Brian R. Murphy, Bratin Saha, Tatiana Shpeisman


Motivation
Motivation

  • Multi-core architectures are mainstream

    • Software concurrency needed for scalability

    • Concurrent programming is hard

    • Difficult to reason about shared data

  • Traditional mechanism: Lock-based Synchronization

    • Hard to use

    • Must be fine-grain for scalability

    • Deadlocks

    • Not easily composable

  • New Solution: Transactional Memory (TM)

    • Simpler programming model: Atomicity, Consistency, Isolation

    • No deadlocks

    • Composability

    • Optimistic concurrency

    • Analogy

      • GC : Memory allocation ≈ TM : Mutual exclusion


  • Composability
    Composability

    class Bank {

    ConcurrentHashMap accounts;

    void deposit(String name, int amount) {

    synchronized (accounts) {

    int balance = accounts.get(name); // Get the current balance

    balance = balance + amount; // Increment it

    accounts.put(name, balance); // Set the new balance

    }

    }

    }

    • Thread-safe – but no scaling

      • ConcurrentHashMap (Java 5/JSR 166) does not help

      • Performance requires redesign from scratch & fine-grain locking


    Transactional solution
    Transactional solution

    • class Bank {

      HashMap accounts;

      void deposit(String name, int amount) {

      atomic {

      int balance = accounts.get(name); // Get the current balance

      balance = balance + amount; // Increment it

      accounts.put(name, balance); // Set the new balance

      }

      }

      }

      Underlying system provide:

      • isolation (thread safety)

      • optimistic concurrency


    Transactions are composable
    Transactions are Composable

    Scalability on 16-way 2.2 GHz Xeon System


    Our system
    Our System

    • A Java Software Transactional Memory (STM) System

      • Pure software implementation

      • Language extensions in Java

      • Integrated with JVM & JIT

  • Novel Features

    • Rich transactional language constructs in Java

    • Efficient, first class nested transactions

    • Risc-like STM API

    • Compiler optimizations

    • Per-type word and object level conflict detection

    • Complete GC support


  • System overview

    Transactional Java

    Java + STM API

    ORP VM

    Transactional STIR

    Optimized T-STIR

    McRT STM

    Native Code

    System Overview

    Polyglot

    StarJIT


    Transactional java
    Transactional Java

    • Java + new language constructs:

      • Atomic: execute block atomically

        • atomic {S}

      • Retry: block until alternate path possible

        • atomic {… retry;…}

      • Orelse: compose alternate atomic blocks

        • atomic {S1} orelse{S2} … orelse{Sn}

      • Tryatomic: atomic with escape hatch

        • tryatomic {S} catch(TxnFailed e) {…}

      • When: conditionally atomic region

        • when (condition) {S}

    • Builds on prior research

      Concurrent Haskell, CAML, CILK, Java

      HPCS languages: Fortress, Chapel, X10


    Transactional java java

    Transactional Java

    atomic {

    S;

    }

    STM API

    txnStart[Nested]

    txnCommit[Nested]

    txnAbortNested

    txnUserRetry

    ...

    Standard Java + STM API

    while(true) {

    TxnHandle th = txnStart();

    try {

    S’;

    break;

    } finally {

    if(!txnCommit(th))

    continue;

    }

    }

    Transactional Java → Java


    Jvm stm support
    JVM STM support

    • On-demand cloning of methods called inside transactions

    • Garbage collection support

      • Enumeration of refs in read set, write set & undo log

    • Extra transaction record field in each object

      • Supports both word & object granularity

    • Native method invocation throws exception inside transaction

      • Some intrinsic functions allowed

    • Runtime STM API

      • Wrapper around McRT-STM API

      • Polyglot / StarJIT automatically generates calls to API


    Background mcrt stm
    Background: McRT-STM

    STM for

    • C / C++ (PPoPP 2006)

    • Java (PLDI 2006)

    • Writes:

      • strict two-phase locking

      • update in place

      • undo on abort

    • Reads:

      • versioning

      • validation before commit

    • Granularity per type

      • Object-level : small objects

      • Word-level : large arrays

    • Benefits

      • Fast memory accesses (no buffering / object wrapping)

      • Minimal copying (no cloning for large objects)

      • Compatible with existing types & libraries


    Ensuring atomicity novel combination
    Ensuring Atomicity: Novel Combination

    + In place updates

    + Fast commits

    + Fast reads

    + Caching effects

    + Avoids lock

    operations

    Quantitative results in PPoPP’06


    Mcrt stm example
    McRT-STM: Example

    • STM read & write barriers before accessing memory inside transactions

    • STM tracks accesses & detects data conflicts

    atomic {

    B = A + 5;

    }

    stmStart();

    temp = stmRd(A);

    stmWr(B, temp + 5);

    stmCommit();


    Transaction record
    Transaction Record

    • Pointer-sized record per object / word

    • Two states:

      • Shared (low bit is 1)

        • Read-only / multiple readers

        • Value is version number (odd)

      • Exclusive

        • Write-only / single owner

        • Value is thread transaction descriptor (4-byte aligned)

    • Mapping

      • Object : slot in object

      • Field : hashed index into global record table


    Transaction record example

    vtbl

    vtbl

    vtbl

    hash

    TxR

    x

    x

    x

    y

    y

    y

    TxR1

    TxR2

    TxR3

    TxRn

    Transaction Record: Example

    • Every data item has an associated transaction record

    Extra transaction

    record field

    class Foo {

    int x;

    int y;

    }

    Object

    granularity

    Object words hash

    into table of TxRs

    Hash is

    f(obj.hash, offset)

    class Foo {

    int x;

    int y;

    }

    Word

    granularity


    Transaction descriptor
    Transaction Descriptor

    • Descriptor per thread

      • Info for version validation, lock release, undo on abort, …

  • Read and Write set : {<Ti, Ni>}

    • Ti: transaction record

    • Ni: version number

  • Undo log : {<Ai, Oi, Vi, Ki>}

    • Ai: field / element address

    • Oi: containing object (or null for static)

    • Vi: original value

    • Ki: type tag (for garbage collection)

  • In atomic region

    • Read operation appends read set

    • Write operation appends write set and undo log

    • GC enumerates read/write/undo logs


  • Mcrt stm example1

    T1

    atomic {

    t = foo.x;

    bar.x = t;

    t = foo.y;

    bar.y = t;

    }

    McRT-STM: Example

    Class Foo {

    int x;

    int y;

    };

    Foo bar, foo;

    T2

    atomic {

    t1 = bar.x;

    t2 = bar.y;

    }

    • T1 copies foo into bar

    • T2 reads bar, but should not see intermediate values


    Mcrt stm example2

    T1

    stmStart();

    t = stmRd(foo.x);

    stmWr(bar.x,t);

    t = stmRd(foo.y);

    stmWr(bar.y,t);

    stmCommit();

    McRT-STM: Example

    T2

    stmStart();

    t1 = stmRd(bar.x);

    t2 = stmRd(bar.y);

    stmCommit();

    • T1 copies foo into bar

    • T2 reads bar, but should not see intermediate values


    Mcrt stm example3

    T1

    3

    stmStart();

    t = stmRd(foo.x);

    stmWr(bar.x,t);

    t = stmRd(foo.y);

    stmWr(bar.y,t);

    stmCommit;

    hdr

    x = 9

    y = 7

    McRT-STM: Example

    7

    T1

    foo

    5

    bar

    Abort

    Commit

    hdr

    x = 9

    x = 0

    y = 7

    y = 0

    T2

    stmStart();

    t1 = stmRd(bar.x);

    t2 = stmRd(bar.y);

    stmCommit();

    T2 waits

    <bar, 7>

    <bar, 5>

    Reads

    Reads

    <foo, 3>

    <foo, 3>

    Writes

    <bar, 5>

    • T2 should read [0, 0] or should read [9,7]

    Undo

    <bar.y, 0>

    <bar.x, 0>


    Early results overhead breakdown
    Early Results: Overhead breakdown

    • Time breakdown on single processor

    • STM read & validation overheads dominate

       Good optimization targets


    System overview1

    Transactional Java

    Java + STM API

    ORP VM

    Transactional STIR

    Optimized T-STIR

    McRT STM

    Native Code

    System Overview

    Polyglot

    StarJIT


    Leveraging the jit
    Leveraging the JIT

    • StarJIT: High-performance dynamic compiler

      • Identifies transactional regions in Java+STM code

      • Differentiates top-level and nested transactions

      • Inserts read/write barriers in transactional code

      • Maps STM API to first class opcodes in STIR

        Good compiler representation →

        greater optimization opportunities


    Representing read write barriers

    atomic {

    a.x = t1

    a.y = t2

    if(a.z == 0) {

    a.x = 0

    a.z = t3

    }

    }

    stmWr(&a.x, t1)

    stmWr(&a.y, t2)

    if(stmRd(&a.z) != 0) {

    stmWr(&a.x, 0);

    stmWr(&a.z, t3)

    }

    Representing Read/Write Barriers

    Traditional barriers hide redundant locking/logging


    An stm ir for optimization

    Redundancies exposed:

    atomic {

    a.x = t1

    a.y = t2

    if(a.z == 0) {

    a.x = 0

    a.z = t3

    }

    }

    txnOpenForWrite(a)

    txnLogObjectInt(&a.x, a)

    a.x = t1

    txnOpenForWrite(a)

    txnLogObjectInt(&a.y, a)

    a.y = t2

    txnOpenForRead(a)

    if(a.z != 0) {

    txnOpenForWrite(a)

    txnLogObjectInt(&a.x, a)

    a.x = 0

    txnOpenForWrite(a)

    txnLogObjectInt(&a.z, a)

    a.z = t3

    }

    An STM IR for Optimization


    Optimized code

    atomic {

    a.x = t1

    a.y = t2

    if(a.z == 0) {

    a.x = 0

    a.z = t3

    }

    }

    txnOpenForWrite(a)

    txnLogObjectInt(&a.x, a)

    a.x = t1

    txnLogObjectInt(&a.y, a)

    a.y = t2

    if(a.z != 0) {

    a.x = 0

    txnLogObjectInt(&a.z, a)

    a.y = t3

    }

    Optimized Code

    Fewer & cheaper STM operations


    Compiler optimizations for transactions
    Compiler Optimizations for Transactions

    • Standard optimizations

      • CSE, Dead-code-elimination, …

      • Careful IR representation exposes opportunities and enables optimizations with almost no modifications

      • Subtle in presence of nesting

    • STM-specific optimizations

      • Immutable field / class detection & barrier removal (vtable/String)

      • Transaction-local object detection & barrier removal

      • Partial inlining of STM fast paths to eliminate call overhead


    Experiments
    Experiments

    • 16-way 2.2 GHz Xeon with 16 GB shared memory

      • L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)

    • Workloads

      • Hashtable, Binary tree, OO7 (OODBMS)

        • Mix of gets, in-place updates, insertions, and removals

      • Object-level conflict detection by default

        • Word / mixed where beneficial


    Effective of compiler optimizations
    Effective of Compiler Optimizations

    • 1P overheads over thread-unsafe baseline

    Prior STMs typically incur ~2x on 1P

    With compiler optimizations:

    - < 40% over no concurrency control

    - < 30% over synchronization


    Scalability java hashmap shootout
    Scalability: Java HashMap Shootout

    • Unsafe (java.util.HashMap)

      • Thread-unsafe w/o Concurrency Control

        Synchronized

      • Coarse-grain synchronization via SynchronizedMap wrapper

        Concurrent (java.util.concurrent.ConcurrentHashMap)

      • Multi-year effort: JSR 166 -> Java 5

      • Optimized for concurrent gets (no locking)

      • For updates, divides bucket array into 16 segments (size / locking)

        Atomic

      • Transactional version via “AtomicMap” wrapper

        Atomic Prime

      • Transactional version with minor hand optimization

        • Tracks size per segment ala ConcurrentHashMap

  • Execution

    • 10,000,000 operations / 200,000 elements

    • Defaults: load factor, threshold, concurrency level


  • Scalability 100 gets
    Scalability: 100% Gets

    Atomic wrapper is competitive with ConcurrentHashMap

    Effect of compiler optimizations scale


    Scalability 20 gets 80 updates
    Scalability: 20% Gets / 80% Updates

    ConcurrentHashMap thrashes on 16 segments

    Atomic still scales


    20 inserts and removes
    20% Inserts and Removes

    Atomic conflicts on entire bucket array

    - The array is an object


    20 inserts and removes word level
    20% Inserts and Removes: Word-Level

    We still conflict on the single size field in java.util.HashMap


    20 inserts and removes atomic prime
    20% Inserts and Removes: Atomic Prime

    Atomic Prime tracks size / segment – lowering bottleneck

    No degradation, modest performance gain


    20 inserts and removes mixed level
    20% Inserts and Removes: Mixed-Level

    • Mixed-level preserves wins & reduces overheads

    • word-level for arrays

    • object-level for non-arrays


    Scalability java util treemap
    Scalability: java.util.TreeMap

    100% Gets

    80% Gets

    Results similar to HashMap


    Scalability oo7 80 reads
    Scalability: OO7 – 80% Reads

    Operations & traversal over synthetic database

    “Coarse” atomic is competitive with medium-grain synchronization


    Key takeaways
    Key Takeaways

    • Optimistic reads + pessimistic writes is nice sweet spot

    • Compiler optimizations significantly reduce STM overhead

    • - 20-40% over thread-unsafe

    • - 10-30% over synchronized

    • Simple atomic wrappers sometimes good enough

    • Minor modifications give competitive performance to complex fine-grain synchronization

    • Word-level contention is crucial for large arrays

    • Mixed contention provides best of both


    Research challenges
    Research challenges

    • Performance

      • Compiler optimizations

      • Hardware support

      • Dealing with contention

  • Semantics

    • I/O & communication

    • Strong atomicity

    • Nested parallelism

    • Open transactions

  • Debugging & performance analysis tools

  • System integration


  • Conclusions
    Conclusions

    • Rich transactional language constructs in Java

    • Efficient, first class nested transactions

    • Risc-like STM API

    • Compiler optimizations

    • Per-type word and object level conflict detection

    • Complete GC support