Distributed Algorithms (22903)

DistributedAlgorithms (22903) Shared objects: linearizability, wait-freedom and simulations Lecturer: Danny Hendler Most of this presentation is based on the book “Distributed Computing” by Hagit attiya & Jennifer Welch. Some slides are based on presentations of Nir Shavit.

Shared Objects (cont’d) • Each object has a state • Usually given by a set of shared memory fields • Objects may be implemented from simpler base objects. • Each object supports a set ofoperations • Only way to manipulate state • E.g. – a shared counter supports the fetch&increment operation.

Shared Objects Correctness • Correctness of a sequential counter • fetch&increment, applied to a counter with value v, returns v and increments the counter’s value to (v+1). • Values returned by consecutive operations: 0, 1, 2, … But how do we define the correctness of a shared counter?

Shared Objects Correctness (cont’d) Invocation Response fetch&inc q.enq(x) q.deq(y) q.enq(y) fetch&inc fetch&inc q.deq(x) fetch&inc time time There is only a partial order between operations!

method object arguments Shared Objects Correctness (cont’d) An invocation calls an operation on an object. c.f&I ()

response object Shared Objects Correctness (cont’d) An object returns the response of the operation. c: 12

Shared Objects Correctness (cont’d) A sequential object history is a sequence of matching invocations and responses on the object. Example: a sequential history of a queueq.enq(3) q:void q.enq(7) q:void q.deq() q:3q.deq()q:7

Shared Objects Correctness (cont’d) Sequential specification The correct behavior of the object in the absence of concurrency. A set of legal sequential object histories. Example: the sequential spec of a counter H0: H1: c.f&i() c:0H2: c.f&i() c:0 c.f&i() c:1 H3: c.f&i() c:0 c.f&i() c:1 c.f&i() c:2H4: c.f&i() c:0 c.f&i() c:1 c.f&i() c:2 c.f&i() c:3 ...

Shared Objects Correctness (cont’d) • Linearizability • An execution is linearizable if there exists a permutation of the operations on each object o, , such that •  is a sequential history of o •  preserves the partial order of the execution.

q.enq(x) q.deq(y) q.enq(y) q.deq(x) time Example linearizable q.enq(x) q.deq(y) q.enq(y) q.deq(x) time (6)

q.enq(x) q.enq(y) time Example not linearizable q.enq(x) q.deq(y) q.enq(y) (5)

q.enq(x) q.deq(x) time Example linearizable q.enq(x) q.deq(x) time (4)

q.enq(x) q.enq(x) q.deq(y) q.deq(y) q.enq(y) q.enq(y) q.deq(x) q.deq(x) time Example multiple orders OK linearizable q.enq(x) q.deq(y) q.enq(y) q.deq(x) time (8)

Wait freedom Wait-freedom An algorithm is wait-free if every operation terminates after performing some finite number of events. Wait-freedom implies that there is no use of locks (no mutual exclusion). Thus the problems inherent to locks are avoided: • Deadlock • Priority inversion

Wait-free linearizable implementations Example: the sequential spec of a register H0: H1: r.read() r:initH2: r.write(v1) r:ack H3: r.write(v1) r:ack r.read() r:v1 r.read() r:v1 H4: r.write(v1) r:ack r.write(v2) r:ack r.read() r:v2 ... Read returns the value written by last Write (or init value if there were no preceding writes)

Wait-free (linearizable) register simulations multi-reader/multi-writer register multi-reader/single-writer register (Multi-valued) single-reader/single-writer register Binary single-reader/single-writer register

A wait-free (linearizable) implementation of a single-writer-single-reader (SRSW) multi-valued register from binary SRSW registers Initially B[0]…B[k-1]=0, B[i]=1 (i is the initial value of R) Read(R)Return the index of the single entry of B that equals 1 Write(R, v) Write 1 to B[v], clear the entry corresponding to the previous value (if other than v). Would the above implementation of a k-valued register (initialized to i) work? No!

An example of a non-linearizable execution = linearization point Initially B[0]…B[2]=0, B[3]=1 Return 2 Read Read Return 1 Return1 Read B[2] Read B[0] Read B[1] Read B[0] Return 0 Read B[1] Return 0 Return 0 Return 1 Ack Ack Write(1) Write(2) Write 0 to B[1] Write 1 to B[2] Ack Ack Write 1 to B[1] Write 0to B[3] Ack Ack Write(1) precedes Write(2) ANDRead(2) precedes Read(1). This is not linearizable!

A Wait-free Linearizable Implementation Initially B[v]=1 and all other entries equal 0, where v is the initial value of R. • Read(R) • i:=0 • while B[i]=0 do i:=i+1 • up:= i, v:=i • for i=up –1 downto 0 do • if B[i]=1 then v:=i • return v • Write(R,v) • B[v]:=1 • For i:=v-1 downto 0 do B[i]:=0 • return ack

The linearization order Write2(R,4) Write3(R,3) Write1(R,1) Write4(R,1) Read5(R, 1) Read4(R, 3) Read2(R, 4) Read3(R, 4) Read1(R, init) Writes linearized first Read1(R, init) Write1(R, 1) All reads from a specific write linearized after it, in their real-time order. Write2(R, 4) Read2(R, 4) Read3(R, 4) Write3(R, 3) Read4(R, 3) Write4(R, 1) Read5(R, 1)

Correctness proof for the SRSW multi-valued register simulation

v1 v1 Written by W1 Illustration for Lemma 1 1 Written by W v 0 u 1 0 B

Written by W2 v2 v2 v1 Illustration for Lemma 1 1 Written by W v Written by W1 0 0 u 1 0 B

W(v) R E: R W(v) π: Case 1: v’ ≤ v v W’(v’) W’(v’) Written by W’ Written by W 1 1 v’ 0 0 Illustration for Lemma 2 (v’) 0 0 0

W(v) R E: Case 2: v’ > v v’ W’(v’) W’(v’) W’(v’) W’(v’) Written by W’’ 0 Written by W’ Written by W 1 1 v Illustration for Lemma 2 (cont’d) R W(v) W’’(x) (v’) π: From Lemma 1,R returns a value written by an operation that does not precede W!

W1(v1) W2(v2) Case 1: v1= v2 R1 π: R2 Written by W1 v1=v2 1 Written by W2 1 Illustration for Lemma 3 R2 R1 E:

W1(v1) W2(v2) Case 2: v1> v2 R1 π: R2 Written by W1 Written by W2 1 1 Illustration for Lemma 3 (cont’d) R2 R1 E: Since R1 precedes R2 and R2 reads from W2, R1 must see 1 in v2 when scanning down v1 v2

Case 3: v1< v2 R1 π: R2 Written by W3 0 Written by W1 Written by W2 1 1 Illustration for Lemma 3 (cont’d) R2 R1 E: W1(v1) W3(v3) W2(v2) From Lemma 1, R2 returns a value written by an operation no sooner than W3! v2 v1

A wait-free Implementation of a (muti-valued) multi-reader register from (multi-valued) SRSW registers.

Would this work? SRSW Val[i]: The value written by the writer for reader pi • Read(R) by pi • return Val[i] • Write(R,v) • For i:=0 to n-1 do Val[i]:=v • return ack Yes Is the algorithm wait-free? Nope Is the algorithm linearziable?

Ack Ack Ack Write 1 to Val[0] Read Val[0] Return 1 Read Val[1] Return 0 An example of a non-linearizable execution = linearization point Initially Val[0]=Val[1]=0 Write(1) Pw: Write 1to Val[1] Read Return 1 P0: Read Return 0 P1: Read(1) precedes Read(0). This is not linearizable!

A proof that: no such simulation is possible, unless some readers…write! 33

A wait-free Implementation of a (muti-valued) multi-reader register from (multi-valued) SRSW registers. • Data structures used • Values are pairs of the form: <val, sequence-number>. • Sequence-numbers are ever increasing. • Val[i]: The value written by pw for reader pi, for 1 ≤ i ≤ n • Report[i,j]: The value returned by the most recent read operation performed by pi; written by pi and read by pj, 1 ≤ i,j ≤ n.

A wait-free Implementation of a multi-reader register from SRSW registers (cont’d). Initially Report[i,j]=Val[i]=(v0, 0), where v0 is R’s initial value. • Read(R) ; performed by process pr • (v[0],s[0]):=Val[r] ; most recent value returned by writer • for (i:=1 to n do) (v[i],s[i])=Report[i,r] ; most recent value reported to pr by reader pi • Let j be such that s[j]=max{s[0], s[1], …, s[n]} • for i:=1 to n do Report[r,i]=(v[j],s[j]) ; pr reports to all readers • Return (v[j]) • Write(R,v) ; performed by the single writer • seq:=seq+1 • for i=1 to n do Val[i]=(v,seq) • return ack

The linearization order Write(v2,2) Write(v3,3) Write(v1, 1) Write(v4,4) Read5(v4, 4) Read4(v2, 2) Read1(init, 0) Read2(v1, 1) Read3(v2, 2) Writes linearized first Read1(init, 0) Write(v1, 1) Reads considered according to increasing order of response, and put after the write with same sequence ID. Read2(v1, 1) Write(v2, 2) Read3(v2, 2) Read4(v2, 2) Write(v3, 3) Write(v4, 4) Read5(v4, 4)

A wait-free Implementation of a multi-reader-multi-writer register from multi-reader-single-writer registers

A wait-free Implementation of a MRMW register from MRSW registers. • Data structures used • Values are pairs of the form: <val, sequence-number>. • Sequence-numbers are ever increasing. • TS[i]: The vector timestamp of writer pi, for 0 ≤ i ≤ m-1. Written by pi and read by all writers. • Val[i]: The latest value written by writer pi, for 0 ≤ i ≤ m-1, together with the vector timestamp associated with that value. Written by pi and read by all n readers.

Concurrent timestamps • Provide a total order for write operations • The total order respects the partial order of write operations • Timestamp implemented as vectors • Ordered by lexicographic order • Each writer increments its vector entry

Concurrent timestamps example < , , > 0 0 1 Writer 1 Writer 2 Writer 3 TS[1] 0 <0,0,0> <0,0,0> TS[2] TS[3] <0,0,0> Order: <0,0,0>

Concurrent timestamps example < , , > 0 0 1 Writer 1 < , , > 1 1 0 Writer 2 Writer 3 TS[1] <1,0,0> <0,0,0> TS[2] TS[3] <0,0,0> Order: <0,0,0> <1,0,0>

< , , > < , , > <1,1,1> <1,2,1> Concurrent timestamps example < , , > 0 0 1 Writer 1 1 2 1 < , , > 1 1 0 Writer 2 1 1 1 Writer 3 TS[1] <1,0,0> <1,1,0> TS[2] TS[3] <0,0,0> Order: <0,0,0> <1,0,0> <1,1,0>

A wait-free Implementation of a MRMW register from MRSW registers. Initially TS[i]=<0,0,…,0> and Val[i] equals the initial value of R • Read(R) ; performed by reader pr • for i:=0 to m-1 do (v[i], t[i]):=Val[i] ; v and t are local • Let j be such that t[j]=max{t[0],…,t[m-1]} ; Lexicographic max • Return v[j] • Write(R,v) ; performed by the writer pw • ts=NewCTS() ; Writer pw obtains a new vector timestamp • Val[w]:=(v,ts) • return ack • Procedure NewCTS() ; called by writer pw • for i:=0 to m-1 do • lts[i]:=TS[i].i ; extract the i’th entry from TS of the i’th writer • lts[w]=lts[w]+1 ; Increment own entry • TS[w]=lts ; write pw’s new timestamp • return lts

The linearization order Write(v4,<2,2>) Write(v1, <1,0>) Writer 1 Write(v3,<1,2>) Write(v2, <1,1>) Writer 2 Read5(v4, <2,2>) Read4(v2, <1,1>) Read1(init, <0,0>) Reader 1 Read3(v2, <1,1>) Read2(init, <0,0>) Reader 2 Writes linearized first by timestamp order Read1(init, <0,0>) Read2(init, <0,0>) Write(v1, <1,0>) Reads considered according to increasing order of response, and put after the write with same timestamped Write(v2, <1,1>) Read3(v2, <1,1>) Read4(v2, <1,1>) Write(v3, <1,2>) Write(v4, <2,2>) Read5(v4, <2,2>)

Distributed Algorithms (22903)