Flexible Update Propagation for Weakly Consistent Replication

Flexible Update Propagation for Weakly Consistent Replication Karin Petersen, Mike K. Spreitzer, Douglas B. Terry, Marvin M. Theimer and Alan J. Demers Presented by: Ryan Huebsch CS294-4 P2P Systems – 10/13/03

Outline • Anti-Entropy • Goals • Data Structures • Ordering • The Algorithm • Creation and Retirement • Discussion • Performance • P2P discussion/questions

Anti-Entropy • Entropy - a process of degradation or running down or a trend to disorder. • Bring 2 replicas up-to-date • Three Major Design Decisions • Pairwise communication between replicas • Exchange of update operations • Ordered propagation of operations

Goals • Support for arbitrary communication topologies • Operation over low-bandwidth networks • Incremental progress • Eventual consistency • Efficient storage management • Light-weight management of dynamic replica sets • Arbitrary policy choices

Data Structures • Replica: • Database • Write Log • Server: • Clock • V, O • CSN, OSN Database Committed (< CSN) Truncated (< OSN) Log A B C A B C V O Truncated Log Highest A.Clockfor server Athat is in log Highest A.Clock for server A that has been truncated …

Orderings • Prefix Property • If R has write Wi that was accepted by server X, it has all writes X accepted before Wi • Stable (Committed Order) • Decided by primary replica • Assigns the final CSN, which is < infinity • New CSN is propagated to nodes • Accept Order • Partial order of all writes accepted by a particular server • Accept stamp – logical or real-time clock

Orderings, continued • Causal-Accept Order • Accept-stamp is a logical clock • Clock is advanced when a write is received (through anti-entropy) that has a higher accept-stamp. • Provides better chances of a node seeing the same database from different servers • If they have the same writes, even if uncommitted, will be same order

The Algorithm (Quick Version) • R is being updated by S • S retrieves R.V and R.CSN • STEP 1: Decide if a full transfer is needed • IF (S.OSN > R.CSN) THEN [If S does have enough log] Rollback S’s database to the state corresponding to S.O [Remove all writes that S has a log for] OutputDatabase(S.DB) OutputVector(S.O) OutputOSN(S.OSN)[R now has the same database and truncated the write log to the same point as S]END

The Algorithm, continued • Step 2: Bring R up-to-date with remaining committed writes • IF R.CSN < S.CSN THEN[If R is missing committed writes] w = first write after CSNWHILE (w) DO IF w.accept-stamp <= R.V(w.server-id) THEN [Check R’s vector to see if it has the write]OutputCommitNotification(w) ELSE OutputWrite(w) END w = next commited write in S.log ENDEND

The Algorithm, continued • Step 3: Bring R up-to-date with remaining uncommitted writes • w = first tentative write in S.logWHILE (w) DO IF R.V(w.server-id) < w.accept-stamp THEN[Check R’s vector to see if has the write] OutputWrite(w) END w = next write in S.logEND • Step 4: Finish Up • OutputCSN(S.CSN)OutputVector(S.V)

Creation and Retirement • Treated just like a write (elegant) • Si is trying to join via server Sx • Sx creates a new write • <infinity, Tk,i, Sk> • Si is server id, <Tk,i, Sk> • Si sets clock to Tk,i + 1 • Notice the new server id is globally unique, recursive, and could be long • The write is propagated to other nodes through anti-entropy

Creation and Retirement, continued • Server S is updating server R • Server S.V has an entry for server Si (<Tk,i, Sk>), while R does not. • 2 Cases: • R has not seen the creation of Si • Then R.V(Sk) < Tk,i • S has not seen the retirement of Si • Then R.V(Sk) >= Tk,i • Why? Creation/Deletion is recorded as a normal write, thus the prefix property will hold. • Recursive naming helps too, if Sk retired, can still trace back and decide the proper state. This is explained as the virtual CompleteV in the paper.

Discussion

Discussion, continued • Most properties are not special in themselves, the combination is novel • Different decisions are mostly independent • Ideas can be applied to other systems (other than Bayou) • Security • Use certificates to insure user can make update • Not much detail given • Used later on as an excuse for high overheads • Lots of policy decisions to be made • When to reconcile, with whom, when to truncate log

Performance • 1316 bytes of update overhead • 520 bytes for certificate • Network transfer most significant cost

Performance, continued • Hard to know if the numbers are good, nothing to compare them to • Would have been nice to see a larger deployment and measure propagation delay, consistency, etc.

P2P? • Is Anti-Entropy applicable to P2P systems? • Review the goals… arbitrary topology, low b/w, aggressive storage management… • There is a centralized component (the serializer)… is this okay? • Can it handle failures/churn? • Security, what happens if there is a faulty node?

Flexible Update Propagation for Weakly Consistent Replication