Compiler and runtime support for efficient software transactional memory
Download
1 / 41

Compiler and Runtime Support for Efficient Software Transactional Memory - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Compiler and Runtime Support for Efficient Software Transactional Memory. Vijay Menon. Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman. Motivation. Multi-core architectures are mainstream Software concurrency needed for scalability

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Compiler and Runtime Support for Efficient Software Transactional Memory' - afya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Compiler and runtime support for efficient software transactional memory

Compiler and Runtime Supportfor EfficientSoftware Transactional Memory

Vijay Menon

Ali-Reza Adl-Tabatabai, Brian T. Lewis,

Brian R. Murphy, Bratin Saha, Tatiana Shpeisman


Motivation
Motivation

  • Multi-core architectures are mainstream

    • Software concurrency needed for scalability

    • Concurrent programming is hard

    • Difficult to reason about shared data

  • Traditional mechanism: Lock-based Synchronization

    • Hard to use

    • Must be fine-grain for scalability

    • Deadlocks

    • Not easily composable

  • New Solution: Transactional Memory (TM)

    • Simpler programming model: Atomicity, Consistency, Isolation

    • No deadlocks

    • Composability

    • Optimistic concurrency

    • Analogy

      • GC : Memory allocation ≈ TM : Mutual exclusion


  • Composability
    Composability

    class Bank {

    ConcurrentHashMap accounts;

    void deposit(String name, int amount) {

    synchronized (accounts) {

    int balance = accounts.get(name); // Get the current balance

    balance = balance + amount; // Increment it

    accounts.put(name, balance); // Set the new balance

    }

    }

    }

    • Thread-safe – but no scaling

      • ConcurrentHashMap (Java 5/JSR 166) does not help

      • Performance requires redesign from scratch & fine-grain locking


    Transactional solution
    Transactional solution

    • class Bank {

      HashMap accounts;

      void deposit(String name, int amount) {

      atomic {

      int balance = accounts.get(name); // Get the current balance

      balance = balance + amount; // Increment it

      accounts.put(name, balance); // Set the new balance

      }

      }

      }

      Underlying system provide:

      • isolation (thread safety)

      • optimistic concurrency


    Transactions are composable
    Transactions are Composable

    Scalability on 16-way 2.2 GHz Xeon System


    Our system
    Our System

    • A Java Software Transactional Memory (STM) System

      • Pure software implementation

      • Language extensions in Java

      • Integrated with JVM & JIT

  • Novel Features

    • Rich transactional language constructs in Java

    • Efficient, first class nested transactions

    • Risc-like STM API

    • Compiler optimizations

    • Per-type word and object level conflict detection

    • Complete GC support


  • System overview

    Transactional Java

    Java + STM API

    ORP VM

    Transactional STIR

    Optimized T-STIR

    McRT STM

    Native Code

    System Overview

    Polyglot

    StarJIT


    Transactional java
    Transactional Java

    • Java + new language constructs:

      • Atomic: execute block atomically

        • atomic {S}

      • Retry: block until alternate path possible

        • atomic {… retry;…}

      • Orelse: compose alternate atomic blocks

        • atomic {S1} orelse{S2} … orelse{Sn}

      • Tryatomic: atomic with escape hatch

        • tryatomic {S} catch(TxnFailed e) {…}

      • When: conditionally atomic region

        • when (condition) {S}

    • Builds on prior research

      Concurrent Haskell, CAML, CILK, Java

      HPCS languages: Fortress, Chapel, X10


    Transactional java java

    Transactional Java

    atomic {

    S;

    }

    STM API

    txnStart[Nested]

    txnCommit[Nested]

    txnAbortNested

    txnUserRetry

    ...

    Standard Java + STM API

    while(true) {

    TxnHandle th = txnStart();

    try {

    S’;

    break;

    } finally {

    if(!txnCommit(th))

    continue;

    }

    }

    Transactional Java → Java


    Jvm stm support
    JVM STM support

    • On-demand cloning of methods called inside transactions

    • Garbage collection support

      • Enumeration of refs in read set, write set & undo log

    • Extra transaction record field in each object

      • Supports both word & object granularity

    • Native method invocation throws exception inside transaction

      • Some intrinsic functions allowed

    • Runtime STM API

      • Wrapper around McRT-STM API

      • Polyglot / StarJIT automatically generates calls to API


    Background mcrt stm
    Background: McRT-STM

    STM for

    • C / C++ (PPoPP 2006)

    • Java (PLDI 2006)

    • Writes:

      • strict two-phase locking

      • update in place

      • undo on abort

    • Reads:

      • versioning

      • validation before commit

    • Granularity per type

      • Object-level : small objects

      • Word-level : large arrays

    • Benefits

      • Fast memory accesses (no buffering / object wrapping)

      • Minimal copying (no cloning for large objects)

      • Compatible with existing types & libraries


    Ensuring atomicity novel combination
    Ensuring Atomicity: Novel Combination

    + In place updates

    + Fast commits

    + Fast reads

    + Caching effects

    + Avoids lock

    operations

    Quantitative results in PPoPP’06


    Mcrt stm example
    McRT-STM: Example

    • STM read & write barriers before accessing memory inside transactions

    • STM tracks accesses & detects data conflicts

    atomic {

    B = A + 5;

    }

    stmStart();

    temp = stmRd(A);

    stmWr(B, temp + 5);

    stmCommit();


    Transaction record
    Transaction Record

    • Pointer-sized record per object / word

    • Two states:

      • Shared (low bit is 1)

        • Read-only / multiple readers

        • Value is version number (odd)

      • Exclusive

        • Write-only / single owner

        • Value is thread transaction descriptor (4-byte aligned)

    • Mapping

      • Object : slot in object

      • Field : hashed index into global record table


    Transaction record example

    vtbl

    vtbl

    vtbl

    hash

    TxR

    x

    x

    x

    y

    y

    y

    TxR1

    TxR2

    TxR3

    TxRn

    Transaction Record: Example

    • Every data item has an associated transaction record

    Extra transaction

    record field

    class Foo {

    int x;

    int y;

    }

    Object

    granularity

    Object words hash

    into table of TxRs

    Hash is

    f(obj.hash, offset)

    class Foo {

    int x;

    int y;

    }

    Word

    granularity


    Transaction descriptor
    Transaction Descriptor

    • Descriptor per thread

      • Info for version validation, lock release, undo on abort, …

  • Read and Write set : {<Ti, Ni>}

    • Ti: transaction record

    • Ni: version number

  • Undo log : {<Ai, Oi, Vi, Ki>}

    • Ai: field / element address

    • Oi: containing object (or null for static)

    • Vi: original value

    • Ki: type tag (for garbage collection)

  • In atomic region

    • Read operation appends read set

    • Write operation appends write set and undo log

    • GC enumerates read/write/undo logs


  • Mcrt stm example1

    T1

    atomic {

    t = foo.x;

    bar.x = t;

    t = foo.y;

    bar.y = t;

    }

    McRT-STM: Example

    Class Foo {

    int x;

    int y;

    };

    Foo bar, foo;

    T2

    atomic {

    t1 = bar.x;

    t2 = bar.y;

    }

    • T1 copies foo into bar

    • T2 reads bar, but should not see intermediate values


    Mcrt stm example2

    T1

    stmStart();

    t = stmRd(foo.x);

    stmWr(bar.x,t);

    t = stmRd(foo.y);

    stmWr(bar.y,t);

    stmCommit();

    McRT-STM: Example

    T2

    stmStart();

    t1 = stmRd(bar.x);

    t2 = stmRd(bar.y);

    stmCommit();

    • T1 copies foo into bar

    • T2 reads bar, but should not see intermediate values


    Mcrt stm example3

    T1

    3

    stmStart();

    t = stmRd(foo.x);

    stmWr(bar.x,t);

    t = stmRd(foo.y);

    stmWr(bar.y,t);

    stmCommit;

    hdr

    x = 9

    y = 7

    McRT-STM: Example

    7

    T1

    foo

    5

    bar

    Abort

    Commit

    hdr

    x = 9

    x = 0

    y = 7

    y = 0

    T2

    stmStart();

    t1 = stmRd(bar.x);

    t2 = stmRd(bar.y);

    stmCommit();

    T2 waits

    <bar, 7>

    <bar, 5>

    Reads

    Reads

    <foo, 3>

    <foo, 3>

    Writes

    <bar, 5>

    • T2 should read [0, 0] or should read [9,7]

    Undo

    <bar.y, 0>

    <bar.x, 0>


    Early results overhead breakdown
    Early Results: Overhead breakdown

    • Time breakdown on single processor

    • STM read & validation overheads dominate

       Good optimization targets


    System overview1

    Transactional Java

    Java + STM API

    ORP VM

    Transactional STIR

    Optimized T-STIR

    McRT STM

    Native Code

    System Overview

    Polyglot

    StarJIT


    Leveraging the jit
    Leveraging the JIT

    • StarJIT: High-performance dynamic compiler

      • Identifies transactional regions in Java+STM code

      • Differentiates top-level and nested transactions

      • Inserts read/write barriers in transactional code

      • Maps STM API to first class opcodes in STIR

        Good compiler representation →

        greater optimization opportunities


    Representing read write barriers

    atomic {

    a.x = t1

    a.y = t2

    if(a.z == 0) {

    a.x = 0

    a.z = t3

    }

    }

    stmWr(&a.x, t1)

    stmWr(&a.y, t2)

    if(stmRd(&a.z) != 0) {

    stmWr(&a.x, 0);

    stmWr(&a.z, t3)

    }

    Representing Read/Write Barriers

    Traditional barriers hide redundant locking/logging


    An stm ir for optimization

    Redundancies exposed:

    atomic {

    a.x = t1

    a.y = t2

    if(a.z == 0) {

    a.x = 0

    a.z = t3

    }

    }

    txnOpenForWrite(a)

    txnLogObjectInt(&a.x, a)

    a.x = t1

    txnOpenForWrite(a)

    txnLogObjectInt(&a.y, a)

    a.y = t2

    txnOpenForRead(a)

    if(a.z != 0) {

    txnOpenForWrite(a)

    txnLogObjectInt(&a.x, a)

    a.x = 0

    txnOpenForWrite(a)

    txnLogObjectInt(&a.z, a)

    a.z = t3

    }

    An STM IR for Optimization


    Optimized code

    atomic {

    a.x = t1

    a.y = t2

    if(a.z == 0) {

    a.x = 0

    a.z = t3

    }

    }

    txnOpenForWrite(a)

    txnLogObjectInt(&a.x, a)

    a.x = t1

    txnLogObjectInt(&a.y, a)

    a.y = t2

    if(a.z != 0) {

    a.x = 0

    txnLogObjectInt(&a.z, a)

    a.y = t3

    }

    Optimized Code

    Fewer & cheaper STM operations


    Compiler optimizations for transactions
    Compiler Optimizations for Transactions

    • Standard optimizations

      • CSE, Dead-code-elimination, …

      • Careful IR representation exposes opportunities and enables optimizations with almost no modifications

      • Subtle in presence of nesting

    • STM-specific optimizations

      • Immutable field / class detection & barrier removal (vtable/String)

      • Transaction-local object detection & barrier removal

      • Partial inlining of STM fast paths to eliminate call overhead


    Experiments
    Experiments

    • 16-way 2.2 GHz Xeon with 16 GB shared memory

      • L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)

    • Workloads

      • Hashtable, Binary tree, OO7 (OODBMS)

        • Mix of gets, in-place updates, insertions, and removals

      • Object-level conflict detection by default

        • Word / mixed where beneficial


    Effective of compiler optimizations
    Effective of Compiler Optimizations

    • 1P overheads over thread-unsafe baseline

    Prior STMs typically incur ~2x on 1P

    With compiler optimizations:

    - < 40% over no concurrency control

    - < 30% over synchronization


    Scalability java hashmap shootout
    Scalability: Java HashMap Shootout

    • Unsafe (java.util.HashMap)

      • Thread-unsafe w/o Concurrency Control

        Synchronized

      • Coarse-grain synchronization via SynchronizedMap wrapper

        Concurrent (java.util.concurrent.ConcurrentHashMap)

      • Multi-year effort: JSR 166 -> Java 5

      • Optimized for concurrent gets (no locking)

      • For updates, divides bucket array into 16 segments (size / locking)

        Atomic

      • Transactional version via “AtomicMap” wrapper

        Atomic Prime

      • Transactional version with minor hand optimization

        • Tracks size per segment ala ConcurrentHashMap

  • Execution

    • 10,000,000 operations / 200,000 elements

    • Defaults: load factor, threshold, concurrency level


  • Scalability 100 gets
    Scalability: 100% Gets

    Atomic wrapper is competitive with ConcurrentHashMap

    Effect of compiler optimizations scale


    Scalability 20 gets 80 updates
    Scalability: 20% Gets / 80% Updates

    ConcurrentHashMap thrashes on 16 segments

    Atomic still scales


    20 inserts and removes
    20% Inserts and Removes

    Atomic conflicts on entire bucket array

    - The array is an object


    20 inserts and removes word level
    20% Inserts and Removes: Word-Level

    We still conflict on the single size field in java.util.HashMap


    20 inserts and removes atomic prime
    20% Inserts and Removes: Atomic Prime

    Atomic Prime tracks size / segment – lowering bottleneck

    No degradation, modest performance gain


    20 inserts and removes mixed level
    20% Inserts and Removes: Mixed-Level

    • Mixed-level preserves wins & reduces overheads

    • word-level for arrays

    • object-level for non-arrays


    Scalability java util treemap
    Scalability: java.util.TreeMap

    100% Gets

    80% Gets

    Results similar to HashMap


    Scalability oo7 80 reads
    Scalability: OO7 – 80% Reads

    Operations & traversal over synthetic database

    “Coarse” atomic is competitive with medium-grain synchronization


    Key takeaways
    Key Takeaways

    • Optimistic reads + pessimistic writes is nice sweet spot

    • Compiler optimizations significantly reduce STM overhead

    • - 20-40% over thread-unsafe

    • - 10-30% over synchronized

    • Simple atomic wrappers sometimes good enough

    • Minor modifications give competitive performance to complex fine-grain synchronization

    • Word-level contention is crucial for large arrays

    • Mixed contention provides best of both


    Research challenges
    Research challenges

    • Performance

      • Compiler optimizations

      • Hardware support

      • Dealing with contention

  • Semantics

    • I/O & communication

    • Strong atomicity

    • Nested parallelism

    • Open transactions

  • Debugging & performance analysis tools

  • System integration


  • Conclusions
    Conclusions

    • Rich transactional language constructs in Java

    • Efficient, first class nested transactions

    • Risc-like STM API

    • Compiler optimizations

    • Per-type word and object level conflict detection

    • Complete GC support


    ad