CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Set 16: Distributed Shared Memory CSCE 668DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch

Distributed Shared Memory • A model for inter-process communication • Provides illusion of shared variables on top of message passing • Shared memory is often considered a more convenient programming platform than message passing • Formally, give a simulation of the shared memory model on top of the message passing model • We'll consider the special case of • no failures • only read/write variables to be simulated Set 16: Distributed Shared Memory

The Simulation users of read/write shared memory read/write return/ack read/write return/ack Shared Memory … alg0 algn-1 send recv send recv Message Passing System Set 16: Distributed Shared Memory

Shared Memory Issues • A process invokes a shared memory operation (read or write) at some time • The simulation algorithm running on the same node executes some code, possibly involving exchanges of messages • Eventually the simulation algorithm informs the process of the result of the shared memory operation. • So shared memory operations are not instantaneous! • Operations (invoked by different processes) can overlap • What values should be returned by operations that overlap other operations? • defined by a memory consistency condition Set 16: Distributed Shared Memory

Sequential Specifications • Each shared object has a sequential specification: specifies behavior of object in the absence of concurrency. • Object supports operations • invocations • matching responses • Set of sequences of operations that are legal Set 16: Distributed Shared Memory

Sequential Spec for R/W Registers • Each operation has two parts, invocation and response • Read operation has invocation readi(X) and response returni(X,v) (subscript i indicates proc.) • Write operation has invocation writei(X,v) and response acki(X) (subscript i indicates proc.) • A sequence of operations is legal iff each read returns the value of the latest preceding write. • Ex: [write0(X,3) ack0(X)] [read1(X) return1(X,3)] Set 16: Distributed Shared Memory

Memory Consistency Conditions • Consistency conditions tie together the sequential specification with what happens in the presence of concurrency. • We will study two well-known conditions: • linearizability • sequential consistency • We will only consider read/write registers, in the absence of failures. Set 16: Distributed Shared Memory

Definition of Linearizability • Suppose  is a sequence of invocations and responses for a set of operations. • an invocation is not necessarily immediately followed by its matching response, can have concurrent, overlapping ops •  is linearizableif there exists a permutation  of all the operations in  (now each invocation is immediately followed by its matching response) s.t. • |X is legal (satisfies sequential spec) for all vars X, and • if response of operation O1 occurs in  before invocation of operation O2, then O1 occurs in  before O2 ( respects real-time order of non-overlapping operations in ). Set 16: Distributed Shared Memory

read(X) write(X,1) read(Y) write(Y,1) return(X,1) ack(X) ack(Y) return(Y,1) Linearizability Examples Suppose there are two shared variables, X and Y, both initially 0 p0 1 3 0 p1 2 4 Is this sequence linearizable? Yes - brown triangles. What if p1's read returns 0? No - see arrow. Set 16: Distributed Shared Memory

Definition of Sequential Consistency • Suppose  is a sequence of invocations and responses for some set of operations. •  is sequentially consistentif there exists a permutation  of all the operations in  s.t. • |X is legal (satisfies sequential spec) for all vars X, and • if response of operation O1 occurs in  before invocation of operation O2 at the same process, then O1 occurs in  before O2 ( respects real-time order of operations by the same process in ). Set 16: Distributed Shared Memory

read(X) write(X,1) read(Y) write(Y,1) return(X,0) ack(X) ack(Y) return(Y,1) Sequential Consistency Examples Suppose there are two shared variables, X and Y, both initially 0 0 p0 3 4 p1 1 2 Is this sequence sequentially consistent? Yes - brown numbers. What if p0's read returns 0? No - see arrows. Set 16: Distributed Shared Memory

Specification of Linearizable Shared Memory Comm. System • Inputs are invocations on the shared objects • Outputs are responses from the shared objects • A sequence  is in the allowable set iff • Correct Interaction: each proc. alternates invocations and matching responses • Liveness:each invocation has a matching response • Linearizability: is linearizable Set 16: Distributed Shared Memory

Specification of Sequentially Consistent Shared Memory • Inputs are invocations on the shared objects • Outputs are responses from the shared objects • A sequence  is in the allowable set iff • Correct Interaction: each proc. alternates invocations and matching responses • Liveness:each invocation has a matching response • Sequential Consistency: is sequentially consistent Set 16: Distributed Shared Memory

Algorithm to Implement Linearizable Shared Memory • Uses totally ordered broadcast as the underlying communication system. • Each proc keeps a replica for each shared variable • When read request arrives: • send bcast msg containing request • when own bcast msg arrives, return value in local replica • When write request arrives: • send bcast msg containing request • upon receipt, each proc updates its replica's value • when own bcast msg arrives, respond with ack Set 16: Distributed Shared Memory

The Simulation users of read/write shared memory read/write return/ack read/write return/ack Shared Memory … alg0 algn-1 to-bc-send to-bc-recv to-bc-send to-bc-recv Totally Ordered Broadcast Set 16: Distributed Shared Memory

Correctness of Linearizability Algorithm • Consider any admissible execution  of the algorithm in which • underlying totally ordered broadcast behaves properly • users interact properly (alternate invocations and responses • Show that , the restriction of  to the events of the top interface, satisfies Liveness and Linearizability. Set 16: Distributed Shared Memory

Correctness of Linearizability Algorithm • Liveness (every invocation has a response): By Liveness property of the underlying totally ordered broadcast. • Linearizability: Define the permutation  of the operations to be the order in which the corresponding broadcasts are received. •  is legal: because all the operations are consistently ordered by the TO bcast. •  respects real-time order of operations: if O1 finishes before O2 begins, O1's bcast is ordered before O2's bcast. Set 16: Distributed Shared Memory

Why is Read Bcast Needed? • The bcast done for a read causes no changes to any replicas, just delays the response to the read. • Why is it needed? • Let's see what happens if we remove it. Set 16: Distributed Shared Memory

Why Read Bcast is Needed read return(1) p0 write(1) p1 to-bc-send p2 read return(0) Not linearizable! Set 16: Distributed Shared Memory

Algorithm for Sequential Consistency • The linearizability algorithm, without doing a bcast for reads: • Uses totally ordered broadcast as the underlying communication system. • Each proc keeps a replica for each shared variable • When read request arrives: • immediately return the value stored in the local replica • When write request arrives: • send bcast msg containing request • upon receipt, each proc updates its replica's value • when own bcast msg arrives, respond with ack Set 16: Distributed Shared Memory

Correctness of SC Algorithm Lemma (9.3): The local copies at each proc. take on all the values appearing in write operations, in the same order, which preserves the order of non-overlapping writes - implies per-process order of writes is preserved Lemma (9.4): If pi writes Y and later reads X, then pi's update of its local copy of Y (on behalf of that write) precedes its read of its local copy of X (on behalf of that read). Set 16: Distributed Shared Memory

Correctness of the SC Algorithm (Theorem 9.5) Why does SC hold? • Given any admissible execution , must come up with a permutation  of the shared memory operations that is • legal and • respects per-proc. ordering of operations Set 16: Distributed Shared Memory

The Permutation  • Insert all writes into  in their to-bcast order. • Consider each read R in  in the order of invocation: • suppose R is a read by piof X • place R in  immediately after the later of • the operation by pi that immediately precedes R in , and • the write that R "read from" (caused the latest update of pi's local copy of X preceding the response for R) Set 16: Distributed Shared Memory

Permutation Example 4 read return(2) p0 write(2) 3 ack p1 to-bc-send to-bc-send p2 write(1) ack read return(1) 1 2 permutation is given by brown numbers Set 16: Distributed Shared Memory

Permutation  Respects Per Proc. Ordering For a specific proc: • Relative ordering of two writes is preserved by Lemma 9.3 • Relative ordering of two reads is preserved by the construction of  • If write W precedes read R in exec. , then W precedes R in  by construction • Suppose read R precedes write W in . Show same is true in . Set 16: Distributed Shared Memory

Permutation  Respects Ordering • Suppose in contradiction R and W are swapped in : • There is a read R' by pi that equals or precedes R in  • There is a write W' that equals W or follows W in the to-bcast order • And R' "reads from" W'. R W R' |pi : …W … W' … R' … R …  : • But: • R' finishes before W starts in  and • updates are done to local replicas in to-bcast order (Lemma 9.3) so update for W' does not precede update for W • so R' cannot read from W'. Set 16: Distributed Shared Memory

…W … W' … R …  : Permutation  is Legal • Consider some read R of X by pi and some write W s.t. R reads from W in . • Suppose in contradiction, some other write W' to X falls between W and R in : • Why does R follow W' in ? Set 16: Distributed Shared Memory

Permutation  is Legal Case 1:W' is also by pi. Then R follows W' in  because R follows W' in . • Update for W at pi precedes update for W' at pi in  (Lemma 9.3). • Thus R does not read from W, contradiction. Set 16: Distributed Shared Memory

…W … W' … O … R …  : Permutation  is Legal Case 2:W' is not by pi. Then R follows W' in  due to some operation O, also by pi , s.t. • O precedes R in , and • O is placed between W' and R in  Consider the earliest such O. • Case 2.1:O is a write (not necessarily to X). • update for W' at pi precedes update for O at pi in  (Lemma 9.3) • update for O at pi precedes pi's local read for R in  (Lemma 9.4) • So R does not read from W, contradiction. Set 16: Distributed Shared Memory

Permutation  is Legal …W … W' … O … R …  : • Case 2.2:O is a read. • By construction of , O must read X and in fact read from W' (otherwise O would not be after W') • Update for W at piprecedes update for W' at pi in  (Lemma 9.3). • Update for W' at pi precedes local read for O at piin  (otherwise O would not read from W'). • Thus R cannot read from W, contradiction. Set 16: Distributed Shared Memory

Performance of SC Algorithm • Read operations are implemented "locally", without requiring any inter-process communication. • Thus reads can be viewed as "fast": time between invocation and response is only that needed for some local computation. • Time for a write is time for delivery of one totally ordered broadcast (depends on how to-bcast is implemented). Set 16: Distributed Shared Memory

Alternative SC Algorithm • It is possible to have an algorithm that implements sequentially consistent shared memory on top of totally ordered broadcast that has reverse performance: • writes are local/fast (even though bcasts are sent, don't wait for them to be received) • reads can require waiting for some bcasts to be received • Like the previous SC algorithm, this one does not implement linearizable shared memory. Set 16: Distributed Shared Memory

Time Complexity for DSM Algorithms • One complexity measure of interest for DSM algorithms is how long it takes for operations to complete. • The linearizability algorithm required D time for both reads and writes, where D is the maximum time for a totally-ordered broadcast message to be received. • The sequential consistency algorithm required D time for writes and 0 time for reads, since we are assuming time for local computation is negligible. • Can we do better? To answer this question, we need some kind of timing model. Set 16: Distributed Shared Memory

Timing Model • Assume the underlying communication system is the point-to-point message passing system (not totally ordered broadcast). • Assume that every message has delay in the range [d-u,d]. • Claim:Totally ordered broadcast can be implemented in this model so that D, the maximum time for delivery, is O(d). Set 16: Distributed Shared Memory

Time and Clocks in Layered Model • Timed execution: associate an occurrence time with each node input event. • Times of other events are "inherited" from time of triggering node input • recall assumption that local processing time is negligible. • Model hardware clocks as before: run at same rate as real time, but not synchronized • Notions of view, timed view, shifting are same: • Shifting Lemma still holds (relates h/w clocks and msg delays between original and shifted execs) Set 16: Distributed Shared Memory

Lower Bound for SC Let Tread = worst-case time for a read to complete Let Twrite = worst-case time for a write to complete Theorem (9.7): In any simulation of sequentially consistent shared memory on top of point-to-point message passing, Tread + Twrited. Set 16: Distributed Shared Memory

SC Lower Bound Proof • Consider any SC simulation with Tread + Twrite < d. • Let X and Y be two shared variables, both initially 0. • Let 0 be admissible execution whose top layer behavior is write0(X,1) ack0(X) read0(Y) return0(Y,0) • write begins at time 0, read ends before time d • every msg has delay d • Why does 0 exist? • The alg. must respond correctly to any sequence of invocations. • Suppose user at p0 wants to do a write, immediately followed by a read. • By SC, read must return 0. • By assumption, total elapsed time is less than d. Set 16: Distributed Shared Memory

write(X,1) read(Y,0) p0 0 p1 SC Lower Bound Proof time 0 d Set 16: Distributed Shared Memory

SC Lower Bound Proof • Similarly, let 1 be admissible execution whose top layer behavior is write1(Y,1) ack1(Y) read1(X) return1(X,0) • write begins at time 0, read ends before time d • every msg has delay d • 1 exists for similar reason. Set 16: Distributed Shared Memory

write(X,1) read(Y,0) p0 0 p1 p0 1 p1 write(Y,1) read(X,0) SC Lower Bound Proof time 0 d Set 16: Distributed Shared Memory

SC Lower Bound Proof • Now merge p0's timed view in 0 with p1's timed view in 1 to create admissible execution '. • But ' is not SC, contradiction! Set 16: Distributed Shared Memory

write(X,1) read(Y,0) p0 0 p1 p0 1 p1 write(Y,1) read(X,0) write(X,1) read(Y,0) p0 ' p1 write(Y,1) read(X,0) SC Lower Bound Proof time 0 d not SC - contradiction! Set 16: Distributed Shared Memory

Linearizability Write Lower Bound Theorem (9.8): In any simulation of linearizable shared memory on top of point-to-point message passing, Twrite ≥ u/2. Proof: Consider any linearizable simulation with Twrite < u/2. • Let be an admissible exec. whose top layer behavior is: p1 writes 1 to X, p2 writes 2 to X, p0 reads 2 from X • Shift to create admissible exec. in which p1 and p2's writes are swapped, causing p0's read to violate linearizability. Set 16: Distributed Shared Memory

u time: 0 u/2 read 2 p0 write 1  : p1 write 2 p2 p0 d - u/2 d - u/2 d - u/2 delay pattern p1 d - u/2 d d - u p2 Linearizability Write Lower Bound linearizable admissible Set 16: Distributed Shared Memory

p0 d d - u d delay pattern p1 d - u d- u d p2 Linearizability Write Lower Bound u time: 0 u/2 read 2 p0 not linearizable write 1 shift p1 by u/2 p1 shift p2 by -u/2 write 2 p2 contradiction! admissible Set 16: Distributed Shared Memory

Linearizability Read Lower Bound • Approach is similar to the write lower bound. • Assume in contradiction there is an algorithm with Tread < u/4. • Identify a particular execution: • fix a pattern of read and write invocations, occurring at particular times • fix the pattern of message delays • Shift this execution to get one that is • still admissible • but not linearizable Set 16: Distributed Shared Memory

Linearizability Read Lower Bound Original execution: • p1reads X and gets 0 (old value). • Then p0 starts writing 1 to X. • When write is done, p0 reads X and gets 1 (new value). • Also, during the write, p0and p1 alternate reading X. • At some point, the reads stop getting the old value (0) and start getting the new value (1) Set 16: Distributed Shared Memory

Linearizability Read Lower Bound • Set all delays in this execution to be d - u/2. • Now shift p2 earlier by u/2. • Verify that result is still admissible (every delay either stays the same or becomes d or d - u). • But in shifted execution, sequence of values read is 0, 0, …, 0, 1, 0, 1, 1, …, 1 not linearizable! Set 16: Distributed Shared Memory

u/2 read 1 read 1 read 0 read 1 read 1 p2 read 0 read 0 read 0 read 1 read 1 p1 write 1 p0 read 1 read 1 read 0 p2 read 0 read 1 read 1 p1 write 1 p0 Linearizability Read Lower Bound Set 16: Distributed Shared Memory

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS