Hardware Transactional Memory

Hardware Transactional Memory Royi Maimon Merav Havuv 27/5/2007

References • M. Herlihy and J. Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures • C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, Charles E. Leiserson, Sean Lie: Unbounded Transactional Memory. • Hammond, Wong, Chen, Carlstrom, Davis (Jun 2004).“Transactional Memory Coherence and Consistency”

Today • What are transactions? • What is Hardware Transactional Memory? • Various implementations of HTM

Outline • Lock-Free • Hardware Transactional Memory (HTM) • Transactions • Cache coherence protocol • General Implementation • Simulation • UTM • LTM • TCC (briefly) • Conclusions

Lock-free • A shared data structure is lock-free if its operations do not require mutual exclusion. • If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object.

Lock-free (cont) • Lock-free data structures avoid common problems associated with conventional locking techniques in highly concurrent systems: • Priority inversion • Convoyingoccurs when a process holding a lock is descheduled, and then, other processes capable of running may be unable to progress. • Deadlock

Priority inversion Priority inversionoccurs when a lower-priority process is preempted while holding a lock needed by higher-priority processes.

Deadlock • Deadlock – two or more processes are waiting indefinitely for an event that can be caused by only one of waiting processes. • Let S and Q be two resources P0P1 Lock(S) Lock(Q) Lock(Q) Lock(S)

What is a transaction? • A transaction is a sequence of memory loads and stores executed by a single process that either commits or aborts • If a transaction commits, all the loads and stores appear to have executed atomically • If a transaction aborts, none of its stores take effect • Transaction operations aren't visible until they commit or abort

Transactions properties: • A transaction satisfies the following properties: • Serializability • Atomicity • Simplified version of traditional ACID database (Atomicity, Consistency, Isolation, and Durability)

Transactional Memory • A new multiprocessor architecture • The goal: Implementing a lock-free synchronization • efficient • easy to use comparing to conventional techniques based on mutual exclusion • Implemented by straightforward extensions to multiprocessor cache-coherence protocols.

An Example • Locks: if (i<j) { a = i; b = j; } else { a = j; b = i; } Lock(L[a]); Lock(L[b]); Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; Unlock(L[b]); Unlock(L[a]); • Transactional Memory: StartTransaction; Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; EndTransaction;

Transaction A Transaction C Transaction B ld 0xdddd ... st 0xbeef ld 0xbeef ld 0xdddd ... ld 0xbbbb Commit Violation! Commit ld 0xbeef Re-execute with new data Transactional Memory • Transactions execute in commit order Time 0xbeef 0xbeef

A protocol for managing the caches of a multiprocessor system: No data is lost No overwritten before the data is transferred from a cache to the target memory. When multiprocessing, each processor may have its own memory cache that is separate from the shared memory Cache-Coherence Protocol

The Problem (Cache-Coherence) • Solving the problem in either of two ways: • directory-based • snooping system

All caches watches the activity (snoop) on a global bus to determine if they have a copy of the block of data that is requested on the bus. Snoopy Cache

Directory-based • The data being shared is placed in a common directory that maintains the coherence between caches. • The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. • When an entry is changed the directory either updates or invalidates the other caches with that entry.

How it Works? The following primitive instructions for accessing memory are provided: • Load-transactional (LT): reads value of a shared memory location into a private register. • Load-transactional-exclusive (LTX): Like LT, but “hinting” that the location is likely to be modified. • Store-transactional (ST) tentatively writes a value from a private register to a shared memory location. • Commit (COMMIT) • Abort (ABORT) • Validate (VALIDATE) tests the current transaction status.

Some definitions • Read set:the set of locations read by LT by a transaction • Write set:the set of locations accessed by LTX or ST by a transaction • Data set(footprints):the union of the read and write sets. • A set of values in memory is inconsistentif it couldn’t have been produced by any serial execution of transactions

Intended Use Instead of acquiring a lock, executing the critical section, and releasing the lock, a process would: • use LT or LTX to read from a set of locations • use VALIDATE to check that the values read are consistent, • use ST to modify a set of locations • use COMMIT to make the changes permanent. If either the VALIDATE or the COMMIT fails, the process returns to Step (1).

Implementation • Transactional memory is implemented by modifyingstandard multiprocessor cache coherence protocols • We describe here how to extend “snoopy” cache protocol for a shared bus to support transactional memory • Our transactions are short-lived activities with relatively small Data set.

The basic idea • Any protocol capable of detecting accessibility conflicts can also detect transaction conflict at no extra cost • Once a transaction conflict is detected, it can be resolved in a variety of ways

Implementation • Each processor maintains two caches • Regular cachefor non-transactional operations, • Transactional cachefor transactional operations.It holds all the tentative writes, without propagating them to other processors or to main memory (until commit) • Why using two caches?

Cache line states • Each cache line (regular or transactional) has one of the following states: • The transactional cache expends these states:

Cleanup • When the transactional cache needs space for a new entry, it searches for: • EMPTY entry • If not found - a NORMAL entry • finally for an XCOMMIT entry.

Each processor maintains two flags: The transaction active (TACTIVE) flag: indicates whether a transaction is in progress The transaction status (TSTATUS) flag: indicates whether that transaction is active (True) or aborted (False) Non-transactional operations behave exactly as in original cache-coherence protocol Processor actions

Look for XABORT entry Return it’s value Look for NORMAL entry Change it to XABORT and allocate another XCOMMIT entry Ask to read this block from the shared memory • Abort the transaction: • TSTATUS=FALSE • Drop XABORT entries • All XCOMMIT entries are set to NORMAL Create two entries: XABORT and XCOMMIT Example – LT operation: Not Found? Found? Cache miss Not Found? Found? Successful read Unsuccessful read

Snoopy cache actions: • Both the regular cache and the transactional cache snoop on the bus. • A cache ignores any bus cycles for lines not in that cache. • The transactional cache’s behavior: • If TSTATUS=False, or if the operation isn’t transactional, the cache acts just like the regular cache, but ignores entries with state other than NORMAL • On LT of other cpu, if the state is VALID, the cache returns the value, and for all other transactional operations it returns BUSY

Simulation • We’ll see an example code for the producer/consumer algorithm using transactional memory architecture. • The simulation runs on both cache coherence protocols: snoopy and directory cache. • The simulation use 32 processors • The simulation finishes when 2^16 operations have completed.

Part Of Producer/Consumer Code unsigned queue_deq(queue *q) { unsigned head, tail, result; unsigned backoff = BACKOFF_MIN unsigned wait; while (1) { result = QUEUE_EMPTY; tail = LTX(&q->enqs); head = LTX(&q->deqs); if (head != tail) { /* queue not empty? */ result = LT(&q->items[head % QUEUE_SIZE]); /* advance counter */ ST(&q->deqs, head + 1); } if (COMMIT()) break; /* abort => backoff */ wait = random() % (01 << backoff); while (wait--); if (backoff < BACKOFF_MAX) backoff++; } return result; } typedef struct { Word deqs; // Holds the head’s index Word enqs; // Holds the tail’s index Word items[QUEUE_SIZE]; } queue;

The results:

So Far: • In both HTM and STM the transactions shouldn’t touch many memory locations • There is a (small) bound on the transactions footprint • In addition, there is a duration limit.

Unbounded Transactional Memory (UTM) • UTM – new thesis: supports transactions of arbitrary footprint and duration. • The UTM architecture allows: • transactions as large as virtual memory • transactions of unlimited duration • transactions which can migrate between processors • UTM supports a semantics for nested transactions • In contrast to previous HTM implementation: UTM is optimized for transactions below a certain size but still operate correctly for larger transactions

The Goal of UTM • The primary goal: • make concurrent programming easier. • Reducing implementation overhead. • Why do we want unbounded TM? • Neither programmers nor compilers can easily cope with an imposed hard limit on transaction size.

UTM architecture • The transaction log – data structure that maintains bookkeeping information for a transaction • Why is it needed? • Enables transactions to survive time slice interrupts • Enables process migration from one processor to another.

All the programmer must specify is where a transaction begins and ends XBEGIN pc Begin a new transaction. Entry point to an abort handler specified by pc. If transaction must fail, roll back processor and memory state to what it was when XBEGIN was executed, and jump to pc. We can think of an XBEGIN instruction as a conditional branch to the abort handler. XEND End the current transaction. If XEND completes, the transaction is committed and appeared atomic. Nested transactions are subsumed into outer transaction. Two new instructions

Transaction Semantics XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XEND L2: XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 XEND • Two transactions: • “A” has an abort handler at L1 • “B” has an abort handler at L2 Here, very simplistic retry. A B

Register renaming • A name dependence occurs when two instructions Inst1 and Inst2 use the same register (or memory location), but there is no data transmitted between Inst1 and Inst2. • If the register is renamed so that Inst1 and Inst2 do not conflict, the two instructions can execute simultaneously or be reordered. • This technique that dynamically eliminates name dependences in registers, is called register renaming. • Register renaming can be done statically (= by compiler) or dynamically (= by hardware).

Rolling back processor state • After XBEGIN instruction we take a snapshot of the rename table • To keep track of busy registers, we maintain an S (saved) bit for each physical register to indicate which registers are part of the active transaction and it includes the S bits with every renaming-table snapshot • An active transaction’s abort handler address, nesting depth, and snapshot are part of its transactional state.

Memory State • UTM represents the set of active transactions with a single data structure held in system memory, the x-state(short for “transaction state”).

Xstate Implementation • The x-state contains a transaction log for each active transaction in the system. • Each log consists of: • A commit record: maintains the transaction’s status: • pending • committed • aborted • A vector of log entries: corresponds to a memory block that the transaction has read or written to. The entry provides: • pointer to the block • The block’s old value (for rollback) • A pointer to the commit record • Pointers that form a linked list of all entries in all transaction logs that refer to the same block. (Reader List)

Xstate Implementation (Cont) • The final part of the x-state consists of: • log pointer • read-write bit for each memory block

X-state Data Structure X-state Application memory Commit record Old value W Block pointer Reader list Commit record pointer log pointer RW bit Transaction log entry R Old value block Block pointer Reader list Commit record pointer

More on x-state • When a processor references a block that is already part of a pending transaction, the system checks the RW bit and log pointer to determine the correct action: • use the old value • use the new value • abort the transaction

Hardware Transactional Memory

Hardware Transactional Memory

Presentation Transcript

Transactional memory

Transactional Memory

Transactional Memory

EazyHTM : Eager-Lazy Hardware Transactional Memory

Does Hardware Transactional Memory Change Everything?

Transactional Memory

OS Support for Virtualizing Hardware Transactional Memory

Hardware Transactional Memory for GPU Architectures*

Hardware Transactional Memory (Herlihy, Moss, 1993)

Hardware Transactional Memory for GPU Architectures

Transactional Memory : Hardware Proposals Overview

Transactional Memory An Overview of Hardware Alternatives

Performance Pathologies in Hardware Transactional Memory

Transactional Memory

Transactional Memory

Hardware Transactional Memory

OS Support for Virtualizing Hardware Transactional Memory

Performance Pathologies in Hardware Transactional Memory

Hardware Transactional Memory (Herlihy, Moss, 1993)

Hardware Transactional Memory for GPU Architectures*

Transactional Memory