Transactional Memory

Transactional Memory Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Art of Multiprocessor Programming

Traditional Software Scaling 7x Speedup 3.6x 1.8x User code Traditional Uniprocessor Time: Moore’s law

7x 3.6x Speedup 1.8x Multicore Software Scaling User code Multicore Unfortunately, not so simple…

Real-World Multicore Scaling Speedup 2.9x 2x 1.8x User code Multicore Parallelization and Synchronization require great care…

Why? Amdahl’s Law: Speedup = 1/(ParallelPart/N + SequentialPart) Pay for N = 8 cores SequentialPart = 25% Speedup = only 2.9 times! As num cores grows the effect of 25% becomes more accute 2.3/4, 2.9/8, 3.4/16, 3.7/32….

Fine grained parallelism has huge performance benefit The reason we get only 2.9 speedup c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Shared Data Structures Fine Grained Coarse Grained 25% Shared 25% Shared 75% Unshared 75% Unshared

b c d a A FIFO Queue Head Tail Dequeue() => a Enqueue(d)

Head Tail a b c d Q: Enqueue(d) P: Dequeue() => a A Concurrent FIFO Queue Simple Code, easy to prove correct Object lock Contention and sequential bottleneck

a b c d Fine Grain Locks Finer Granularity, More Complex Code Head Tail Q: Enqueue(d) P: Dequeue() => a Verification nightmare: worry about deadlock, livelock…

a a b c d b Head Tail Fine Grain Locks Complex boundary cases: empty queue, last item Head Tail Q: Enqueue(b) P: Dequeue() => a Worry how to acquire multiple locks

Moreover: Locking Relies on Conventions • Relation between • Lock bit and object bits • Exists only in programmer’s mind Actual comment from Linux Kernel (hat tip: Bradley Kuszmaul) • /* • * When a locked buffer is visible to the I/O layer • * BH_Launder is set. This means before unlocking • * we must clear BH_Launder,mb() on alpha and then • * clear BH_Lock, so no reader can see BH_Launder set • * on an unlocked buffer and then risk to deadlock. • */ Art of Multiprocessor Programming 11

a b c d Lock-Free (JDK 1.5+) Even Finer Granularity, Even More Complex Code Head Tail Q: Enqueue(d) P: Dequeue() => a Worry about starvation, subtle bugs, hardness to modify…

a b a d b c d c Composing Objects Complex: Move data atomically between structures Enqueue(Q2,a) Head Head Tail Tail P: Dequeue(Q1,a) More than twice the worry…

a b c d Transactional Memory Great Performance, Simple Code Head Tail Q: Enqueue(d) P: Dequeue() => a Don’t worry about deadlock, livelock, subtle bugs, etc…

a a b c d b Head Tail Promise of Transactional Memory Don’t worry which locks need to cover which variables when… Head Tail Q: Enqueue(d) P: Dequeue() => a TM deals with boundary cases under the hood

a b a d b c d c Composing Objects Will be easy to modify multiple structures atomically Enqueue(Q2,a) Head Head Tail Tail P: Dequeue(Q1,a) ProvideComposability…

The Transactional Manifesto • Current practice inadequate • to meet the multicore challenge • Research Agenda • Replace locking with a transactional API • Design languages to support this model • Implement the run-time to be fast enough Art of Multiprocessor Programming 17

Transactions • Atomic • Commit: takes effect • Abort: effects rolled back • Usually retried • Serizalizable • Appear to happen in one-at-a-time order Art of Multiprocessor Programming 18

Atomic Blocks atomic{ x.remove(3); y.add(3);}atomic{ y =null; } Art of Multiprocessor Programming 19

Atomic Blocks atomic { x.remove(3); y.add(3);}atomic {y =null; } No data race Art of Multiprocessor Programming 20

Designing a FIFO Queue Public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Write sequential Code Art of Multiprocessor Programming 21

Designing a FIFO Queue Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } } Art of Multiprocessor Programming 22

Designing a FIFO Queue Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } } Enclose in atomic block Art of Multiprocessor Programming 23

Not always this simple Conditional waits Enhanced concurrency Complex patterns But often it is Works for sadistic homework Warning Art of Multiprocessor Programming 24

Composition Public void Transfer(Queue<T> q1, q2) { atomic { T x = q1.deq(); q2.enq(x); } } Trivial or what? Art of Multiprocessor Programming 25

Roll Back Public T LeftDeq() { atomic { if (this.left == null) retry; … } } Roll back transaction and restart when something changes Art of Multiprocessor Programming 26

OrElse Composition atomic { x = q1.deq(); } orElse { x = q2.deq(); } Run 1st method. If it retries … Run 2nd method. If it retries … Entire statement retries Art of Multiprocessor Programming 27

Transactional Memory • Software transactional memory (STM) • Hardware transactional memory (HTM) • Hybrid transactional memory (HyTM, try in hardware and default to software if unsuccessful) Art of Multiprocessor Programming 28

Hardware versus Software • Do we need hardware at all? • Analogies: • Virtual memory: yes! • Garbage collection: no! • Probably do need HW for performance • Do we need software? • Policy issues don’t make sense for hardware Art of Multiprocessor Programming 29

Transactional Consistency • Memory Transactions are collections of reads and writes executed atomically • Tranactions should maintain internal and external consistency • External: with respect to the interleavings of other transactions. • Internal: the transaction itself should operate on a consistent state.

External Consistency Invariant x = 2y 4 x Transaction A: Write x Write y 8 2 y 4 Transaction B: Read x Read y Compute z = 1/(x-y) = 1/2 Application Memory Art of Multiprocessor Programming

Simple Lock-Based STM • STMs come in different forms • Lock-based • Lock-free • Here we will describe a simple lock-based STM Art of Multiprocessor Programming

Synchronization • Transaction keeps • Read set: locations & values read • Write set: locations & values to be written • Deferred update • Changes installed at commit • Lazy conflict detection • Conflicts detected at commit Art of Multiprocessor Programming

STM: Transactional Locking Map V# Array of Versioned- Write-Locks Application Memory V# V# Art of Multiprocessor Programming 34

V# V# V# V# Reading an Object Mem Locks V# • Put V#s & value in RS • If not already locked Art of Multiprocessor Programming 35

To Write an Object V# V# V# V# Mem Locks V# • Add V# and new value to WS Art of Multiprocessor Programming 36

To Commit V# V# Mem Locks V# • Acquire W locks • Check V#s unchanged • In RS & WS • Install new values • Increment V#s • Release … X V#+1 V# Y V#+1 V# Art of Multiprocessor Programming 37

Problem: Internal Inconsistency • A Zombie is a currently active transaction that is destined to abort because it saw an inconsistent state • If Zombies see inconsistent states errors can occur and the fact that the transaction will eventually abort does not save us

Internal Consistency Invariant x = 2y 4 x Transaction B: Read x = 4 8 2 Transaction A: Write x (kills B) Write y y 4 Transaction B: (zombie) Read y = 4 Compute z = 1/(x-y) Application Memory DIV by 0 ERROR Art of Multiprocessor Programming

Solution: The “Global Clock” • Have one shared global clock • Incremented by (small subset of) writing transactions • Read by all transactions • Used to validate that state worked on is always consistent Art of Multiprocessor Programming

Read-Only Transactions 100 Shared Version Clock Mem Locks • Copy V Clock to RV • Read lock,V# • Read mem • Check unlocked • Recheck V# unchanged • Check V# < RV 12 Reads form a snapshot of memory. No read set! 32 56 100 19 17 Private Read Version (RV) Art of Multiprocessor Programming 41

Regular Transactions 100 Shared Version Clock Mem Locks • Copy V Clock to RV • On read/write, check: • Unlocked • V# ≤ RV • Add to R/W set 12 32 56 19 100 17 Private Read Version (RV) Art of Multiprocessor Programming 42

Regular Transactions Mem Locks • Acquire locks • WV = F&Inc(V Clock) • Check each V# ≤ RV • Update memory • Set write V#s to WV 12 x 100 32 56 19 100 100 101 100 y 17 Private Read Version (RV) Shared Version Clock Art of Multiprocessor Programming 43

Hardware Transactional Memory • Exploit Cache coherence • Already almost does it • Invalidation • Consistency checking • Speculative execution • Branch prediction = optimistic synch! Art of Multiprocessor Programming 44

T HW Transactional Memory read active caches Interconnect memory Art of Multiprocessor Programming 45

T T Transactional Memory read active active caches memory Art of Multiprocessor Programming 46

T T Transactional Memory active committed active caches memory Art of Multiprocessor Programming 47

D T Transactional Memory write committed active caches memory Art of Multiprocessor Programming 48

D T T Rewind write aborted active active caches memory Art of Multiprocessor Programming 49

Transaction Commit • At commit point • If no cache conflicts, we win. • Mark transactional entries • Read-only: valid • Modified: dirty (eventually written back) • That’s all, folks! • Except for a few details … Art of Multiprocessor Programming 50

Transactional Memory