Software Transactional Memory: Where Do We Come From? What Are We? Where Are We Going?

Software Transactional Memory: Where Do We Come From? What Are We? Where Are We Going? Nir Shavit Tel-Aviv University and Sun Labs

Traditional Software Scaling 7x Speedup 3.6x 1.8x User code Traditional Uniprocessor Time: Moore’s law

7x 3.6x Speedup 1.8x Multicore Software Scaling User code Multicore Unfortunately, not so simple…

Real-World Multicore Scaling Speedup 2.9x 2x 1.8x User code Multicore Parallelization and Synchronization require great care…

Why? Amdahl’s Law: Speedup = 1/(ParallelPart/N + SequentialPart) Pay for N = 8 cores SequentialPart = 25% Speedup = only 2.9 times! As num cores grows the effect of 25% becomes more accute 2.3/4, 2.9/8, 3.4/16, 3.7/32….

Fine grained parallelism has huge performance benefit The reason we get only 2.9 speedup c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Shared Data Structures Fine Grained Coarse Grained 25% Shared 25% Shared 75% Unshared 75% Unshared

b c d a A FIFO Queue Head Tail Dequeue() => a Enqueue(d)

Head Tail a b c d Q: Enqueue(d) P: Dequeue() => a A Concurrent FIFO Queue Simple Code, easy to prove correct Object lock Contention and sequential bottleneck

a b c d Fine Grain Locks Finer Granularity, More Complex Code Head Tail Q: Enqueue(d) P: Dequeue() => a Verification nightmare: worry about deadlock, livelock…

a a b c d b Head Tail Fine Grain Locks Complex boundary cases: empty queue, last item Head Tail Q: Enqueue(b) P: Dequeue() => a Worry how to acquire multiple locks

a b c d Lock-Free (JDK 1.5+) Even Finer Granularity, Even More Complex Code Head Tail Q: Enqueue(d) P: Dequeue() => a Worry about starvation, subtle bugs, hardness to modify…

a b a d b c d c Real Applications Complex: Move data atomically between structures Enqueue(Q2,a) Head Head Tail Tail P: Dequeue(Q1,a) More than twice the worry…

Transactional Memory[HerlihyMoss93]

a b c d Promise of Transactional Memory Great Performance, Simple Code Head Tail Q: Enqueue(d) P: Dequeue() => a Don’t worry about deadlock, livelock, subtle bugs, etc…

a a b c d b Head Tail Promise of Transactional Memory Don’t worry which locks need to cover which variables when… Head Tail Q: Enqueue(d) P: Dequeue() => a TM deals with boundary cases under the hood

a b a d b c d c For Real Applications Will be easy to modify multiple structures atomically Enqueue(Q2,a) Head Head Tail Tail P: Dequeue(Q1,a) ProvideSerializability…

Using Transactional Memory enqueue (Q, newnode) { Q.tail-> next = newnode Q.tail = newnode }

Using Transactional Memory enqueue (Q, newnode) { atomic{ Q.tail-> next = newnode Q.tail = newnode } }

Transactions Will Solve Many of Locks’ Problems No need to think what needs to be locked, what not, and at what granularity No worry about deadlocks and livelocks No need to think about read-sharing Can compose concurrent objects in a way that is safe and scalable But there are problems! Performance…?

Hardware TM [HerlihyMoss93] Hardware Transactions 20-30…but not ~1000instructions long Diff Machines… expectdifferent hardware support Hardware is not flexible…abort policies, retry policies, all application dependent…

Software Transactional Memory[ShavitTouitou94] The semantics of hardware transactions…today Tomorrow: serve as a standard interface to hardware Allow to extend hardware features when they arrive Still, we need to have reasonable performance… Today’s focus…

Soft Trans (Ananian, Rinard) Meta Trans (Herlihy, Shavit) T-Monitor (Jagannathan…) Trans Support TM (Moir) AtomJava (Hindman…) WSTM (Fraser, Harris) OSTM (Fraser, Harris) ASTM (Marathe et al) STM (Shavit,Touitou) Lock-OSTM (Ennals) DSTM (Herlihy et al) McTM (Saha et al) TL1/2 (Dice, Shavit)) HybridTM (Moir) 2005 2003 2003 2004 2003 2004 2004 2005 2004 2005 2006 2004 1997 1994 The Brief History of STM 2007-9…New lock based STMs from IBM, Intel, Sun, Microsoft Lock-free Obstruction-free Lock-based

As Good As Fine Grained Locking Postulate (i.e. take it or leave it): If we could implement fine-grained locking with the same simplicity of course grained, we would never think of building a transactional memory. Implication: Lets try to provide STMs that get as close as possible to hand-crafted fine-grained locking.

Transactional Consistency • Memory Transactions are collections of reads and writes executed atomically • Tranactions should maintain internal and external consistency • External: with respect to the interleavings of other transactions. • Internal: the transaction itself should operate on a consistent state.

External Consistency Invariant x = 2y 4 X Transaction A: Write x Write y 8 2 Y 4 Transaction B: Read x Read y Compute z = 1/(x-y) = 1/4 Application Memory

V# Locking STM Design Choices Map Array of Versioned- Write-Locks Application Memory PS = Lock per Stripe (separate array of locks) PO = Lock per Object (embedded in object)

V# 1 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 1 V# 0 V#+1 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V#+1 0 V# 0 V# 0 V# 0 V#+1 0 V#+1 0 V# 0 X Y Encounter Order Locking (Undo Log) [Ennals,Saha,Harris,TinySTM…] Mem Locks Blue code does not change memory, red does • To Read: load lock + location • Check unlocked add to Read-Set • To Write: lock location, store value • Add old value to undo-set • Validate read-set v#’s unchanged • Release each lock with v#+1 X Y Quick read of values freshly written by the reading transaction

V#+1 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V#+1 0 V# 0 V#+1 0 V# 1 V#+1 0 V# 0 V#+1 0 V# 1 V# 1 V# 1 V# 0 V# 0 X Y Commit Time Locking (Write Log) [TL,TL2] Mem Locks • To Read: load lock + location • Location in write-set? (Bloom Filter) • Check unlocked add to Read-Set • To Write: add value to write set • Acquire Locks • Validate read/write v#’s unchanged • Release each lock with v#+1 X Y Hold locks for very short duration

Hand COM ENC Lock COM vs. ENC High Load Red-Black Tree 20% Delete 20% Update 60% Lookup

Hand COM ENC Lock COM vs. ENC Low Load Red-Black Tree 5% Delete 5% Update 90% Lookup

Subliminal Cut Technion 2008

Problem: Internal Inconsistency • A Zombie is a currently active transaction that is destined to abort because it saw an inconsistent state • If Zombies see inconsistent states errors can occur and the fact that the transaction will eventually abort does not save us

Internal Inconsistency Invariant x = 2y 4 X Transaction B: Read x = 4 8 2 Transaction A: Write x Write y Y 4 Transaction B: Read y = 4 {trans is zombie} Compute z = 1/(x-y) Application Memory DIV by 0 ERROR

Past Approaches • Design STMs that allow internal inconsistency. • To detect zombies introduce validation into user code at fixed intervals or loops, used traps, OS support • Still there are cases where zombie’s cannot be detected  infinite loops in user code…

Global Clock [TL2/SnapIsolation] [DiceShalevShavit06/ReigelFelberFetzer06] • Have a shared global version clock • Incremented by writing transactions (as infrequently as possible) • Read by all transactions • Used to validate that the state viewed by a transaction is always consistent

50 0 99 0 50 0 87 0 87 0 34 0 88 0 V# 0 44 0 34 0 99 0 V# 0 V# 0 50 0 34 0 88 0 50 0 99 0 34 0 87 0 44 0 V# 0 99 0 87 0 100 RV (private) TL2 Version Clock: Read-Only Trans 100 Mem Locks Vclock (shared) • RV VClock • To Read: read lock, read mem, read lock, check unlocked, unchanged, and v# <= RV • Commit. Reads form a snapshot of memory. No read set!

99 1 87 0 50 0 50 0 121 0 87 0 87 0 88 0 44 0 V# 0 34 0 99 0 99 0 50 0 34 0 V# 0 87 0 34 0 99 0 50 0 34 1 44 0 50 0 87 0 V# 0 88 0 121 0 121 0 121 0 100 RV TL2 Version Clock: Writing Trans 100 120 121 100 VClock Mem Locks • RV VClock • To Read/Write: check unlocked and v# <= RV then add to Read/Write-Set • Acquire Locks • WV = F&I(VClock) • Validate each v# <= RV • Release locks with v#  WV X X Y Y Reads+Inc+Writes =serializable Commit

How we learned to stop worrying and love the clock Version clock rate is a progress concern, not a safety concern, so .. • (GV4) if failed to increment VClock using CAS use VClock set by winner • (GV5) use WV = VClock + 2; inc VClock on abort • (GV7) localized clocks… [AvniShavit08]

Uncontended Large Red-Black Tree Hand-crafted 5% Delete 5% Update 90% Lookup TL/PO TL2/P0 Ennals TL/PS TL2/PS FraserHarris Lock-free

Contended Small RB-Tree TL/P0 30% Delete 30% Update 40% Lookup TL2/P0 Ennals

Implicit Privatization [Menon et al] • In real apps: often want to “privatize” data • Then operate on it non-transactionally • Many STMs (like TL2) based on “Invisible Readers” • Invisible readers are a problem if we want implicit privatization…

a c d b Privatization Pathology P privatizes node b then modifies it non-transactionally P 0 P:atomically{ a.next = c; } // b is private b.value = 0; P: atomically{ a.next = c; } // b is private b.value = 0; P:atomically{ a.next = c; } // b is private b.value = 0;

a c d b Privatization Pathology Invisible reader Q cannot detect non-transactional modification to node b P Q Q: divide by 0 error 0 Q: atomically{ tmp = a.next; foo = (1/tmp.value) } P:atomically{ a.next = c; } // b is private b.value = 0; P: atomically{ a.next = c; } // b is private b.value = 0; P: atomically{ a.next = c; } // b is private b.value = 0; Q:atomically{ tmp = a.next; foo = (1/tmp.value) } Q: atomically{ tmp = a.next; foo = (1/tmp.value) }

Visible Readers • Use read-write locks. Trans. also lock to read. • Privatization is immediate… • But RW-locks will make us burn in coherence traffic hell: CAS to increment/decrement reader-count • Which is why we had invisible readers in the first place  Or is it ?

Read-Write Bytelocks [DiceShavit09] Array of read-write byte-locks a bytelock Map • An new read-write lock for multicores • Common case: no CAS, only store + membar to read • Claim: on modern multicores cost of coherent stores not too bad… Application Memory

The ByteLock Lock Record traditional • Writer ID • Visible readers : • Reader count for unslotted threads • CAS to increment and decrement • Reader array for slotted threads • Array of atomically addressable bytes • 48 or 112 slots, Write + Membar to Modify Slots Writer id a byte per slot counter for unslotted 1 2 3 4 5 wrtid 0 1 0 0 1 rdcnt Single $ line

ByteLock Write Writers wait till readers drain out Writer i CAS Spin until all 0 Mem 1 2 3 4 5 X i 0 0 1 0 0 1 3 wrtid rdcnt Intel, AMD, Sun read 8 or 16 bytes at a time

Slotted Read Readers give pref to writers Slotted Reader i Mem No Writer? 1 2 3 4 5 0 0 1 0 0 1 3 Read Mem wrtid rdcnt Release: simple store On Intel, AMD, Sun store to byte + membar is very fast 1 0

Slotted Read Slow-Path Readers give pref to writers Slotted Reader i Spin until non-0 then retry Mem If non-0 retry 1 2 3 4 5 i 0 1 0 0 1 3 wrtid rdcnt Intel, AMD, Sun store to byte + membar is very fast 0 1

Unslotted Read Unslotted Reader i Unslotted readers like in traditional RW Mem CAS If non-0 1 2 3 4 5 3 4 i 0 1 0 1 3 wrtid rdcnt Decrement using CAS and wait for writer to go away

Software Transactional Memory: Where Do We Come From? What Are We? Where Are We Going?