Computing with Byzantine Shared Memory

Computing with Byzantine Shared Memory Topics in Reliable Distributed Systems Fall 2004-2005, Idit Keidar

Sources • Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory • Abraham, Chockler, Keidar, Malkhi, PODC04 • Optimal Resilience Wait-Free Storage from Byzantine Components: Inherent Costs and Solutions • Chockler, Keidar, Malkhi, FuDiCo II, June 2004. • Some slides stolen from Chockler and Malkhi

Internet Information Services Internet

Internet Information Services Network storage [Fleet (Hebrew U and Lucent) Agile (GTech), Coca (Cornell), SBQL (UTexas), others] web

Internet Information Services Peer-to-peer storage [Oceanstore, etc.] web

Storage Area Network (SAN)

Large-Scale Deployment • Clients come and go • cannot be expected to be around • hence, direct communication infeasible • Communication among storage nodes is also infeasible • In SAN impossible • Thin (scalable) servers, more logic at clients • Some of the storage nodes can be compromised by a malicious adversary

Fault/Security Scenarios • Client crash, unresponsiveness • Access control for clients • access restricted to data they own • if bypassed, what can we do? • A malicious client can mess up the data anyway • Servers entrusted with others’ data • compromised servers a bigger problem • Byzantine faults possible

Assumptions • Any number of client failures (crash) • no Byzantine faults (assume access control) • Threshold (t-out-of-n) of storage components (per object) can be faulty • Byzantine, unresponsive: NR-Arbitrary Faults • Monitoring/reconfiguration service • Each object has enough healthy replicas • Naming service: e.g., DHT

NS ha={X,Y,Z,A} Example: Internet-Wide Store Y X Z Client: ha=lookup(a); read(ha); write(ha); … A

Data-Centric Replication No server-to-server comm NR-arbitrary failures No client-to-client comm Bounded, unknown number of clients

Some System Examples • Fleet, HUJI and Lucent • Rosebud, MIT • Agile, GTech • Coca, Cornell • SBQL, UTexas • OceanStore, Berkley • Pasis, CMU • others

Our Focus: Foundations • Wanted: generic services useful for applications • Reliable R/W registers • Consensus • Which register semantics can be supported? • Atomic, regular, safe • Termination conditions • At which cost? • Failure resilience • Communication rounds • Memory consumption • By understanding tradeoffs, system designers will be able to intelligently decide what to choose

Formal Model • Asynchronous shared memory • servers=shared objects, clients=processes • Shared objects may experience NR-arbitrary failures [JCT98] • faulty object can respond with arbitrary value or fail to respond • Processes can fail by crashing

Previous Work I: Wait-Free Constructions • Safe register: A read that does not overlap a write returns the last register’s value • Malkhi & Reiter 98: n > 4t • One round read/write, unbounded timestamps • Jayanti, Chandra & Toueg 98: n > 5t • One round read/write, no timestamps • Self-construction: implement safe register from collection of fault-prone safe registers

x = 7 t = 1 x = 2 t = 5 x = 0 t = 0 x = 7 t = 1 x = 7, t = 1 x = 7 Byzantine Quorum Systems: Example [Malkhi and Reiter 98] • At most one server can be penetrated

x = 7 t = 1 x = 0 t = 0 x = 0 t = 0 x = 7 t = 1 x = 7, t = 1 x = 7 Byzantine Quorum Systems: Example [Malkhi and Reiter 98] • Why timestamps?

Previous Work II: Optimal Failure Resilience (n>3t) • Synchronous system [Bazzi, DC’00] • Safe register • Servers communicate with each other (finish operations for faulty clients) [Martin, Alvisi & Dahlin, DISC’02] • MWMR atomic register • Reliable clients [Attiya & Bar-Or, SRDS’03] • Missing: optimal-resilience data-centric asynchronous implementations resilient to process failures

Optimal Resilience: The Challenge n=4, t=1 (v0,0) (v0,0) (v0,0) delayed (v, 1) The writer The reader

Optimal Resilience: The Challenge (v,1) (v,1) (v0,0) ack The writer The reader

Optimal Resilience: The Challenge (v,1) (v,1) (v0,0) delayed The writer The reader

Optimal Resilience: The Challenge Cannot return v0 (v,1) (v,1) (v0,0) (v0,0) (v0,0) ? (v,1) The writer The reader

Optimal Resilience: The Challenge Cannot return v1 (v0,0) (v0,0) (v0,0) No write happened (v0,0) (v0,0) ? (v,1) The writer The reader

Reliable Writer Solution: Wait (v,1) (v,1) (v0,0) (v0,0) (v0,0) (v,1) (v,1) (v,1) The writer The reader

Faulty Writer Scenario Cannot wait! (v, 1) (v0,0) (v0,0) (v, 1) ? (v,1) The writer The reader

What Does This Mean? Is a solution with n=3t+1 impossible?

Lower Bound for Optimal Resilience • The reader cannot return any value, and cannot wait for more, so what can it do? • Invoke more rounds! • Will this help? • No! Exactly the same thing can happen in every round. • Conclusion?

Write Lower Bound For 1W1R binary safe register (weakest meaningful object type) construction from any base object type if n ≤ 4t and processes can crash emulating Write in one round is impossible • To emulate Write, there must be at least one base object on which two operations are invoked

pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v,1 pw=v,1 w=v0,0 pw=v,1 w=v,1 Two Rounds Save the Day! pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 (v, 1) The writer The reader

Writer Fails During Pre-Write Can return v0 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 (v, 1) The writer The reader

Writer Fails During Write Can wait to hear more v,1 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v,1 (v, 1) The writer The reader

Write Never Happened Can wait to hear more without v,1 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v,1 The writer The reader

Can Read Always Complete in One Round?

Overlapping Write Scenario Cannot wait pw=v,1 w=v,1 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v0,0 (v, 1) (v0,0) (v0,0) (v,1) The writer The reader

Solution? • More read rounds • By the next round, all objects responding to the Read will have seen the first phase of the Write • By causality • But what if there are additional overlapping writes? • Safe register implementation can deduce in the next round (in general, in min(f+2,t+1) rounds) that there is an overlapping write; returns arbitrary value

Determining Value to Return • Read values (from w field) are candidates to return • If there are 2t+1 responses without v (in either pw or w) then v is no longer a candidate • Either was not written before Read, or some Write overlaps read • If the highest timestamped candidate occurs t+1 times, it can be returned • If the candidates set is empty, return any value • There must be an overlapping Write • Each round, wait for at least one more response than in previous rounds • After min(t+1,f+2) rounds, it will be possible to return

Summary: Wait-Free Safe Register Construction for n>3t • Invokes a bounded number of Read rounds • Optimal round complexity • Constant storage • Unbounded timestamps • Read takes two rounds (after synchronization) in (eventually) synchronous runs

Summary: Lower Bounds for n>3t and Fault-Prone Clients • 1W1R binary safe register construction: • WRITE: 2 communication rounds are necessary irrespective of base object types • Termination: obstruction freedom • READ: min(t+1,f+2) communication rounds are necessary irrespective of base object types if readers do not modify base objects • Termination: Every write terminates, read terminates if eventually runs in isolation (Finite-Write Termination)

A Note on Safe Registers • Wait-free safe registers are too weak to be directly useful by applications • They are good for lower bounds • Bounded constructions of wait-free atomic registers from safe registers known, but… • Complicated • Add communication rounds and memory • Too costly to deploy in a distributed setting

Registers with Stronger Semantics • Can we come up with an efficient, direct construction of a regular/atomic register? • Yes, if we are ready to compromise wait freedom

Termination Conditions • Wait-Freedom: • Every operation must complete within a finite number of steps regardless of steps of other processes • Lock-Freedom: • If there are concurrent operations, at least one of them must complete • Obstruction- Freedom: • If one process runs alone (without interference) for sufficiently long, it must complete its operation

Notes • Wait-freedom often hard to achieve • Lock-freedom and obstruction-freedom are achieved by simpler algorithms

What is the Right Condition? • Ultimately, we want to solve consensus • For state machine replication • Want it to be wait-free • Alas, impossible in asynchronous systems • We need a leader-oracle in order to solve consensus in asynchronous shared memory • ensures that at most one process writes (proposes values), while others only read

forever() if leader (by leader oracle) then b ← choose new unique rank write b,v,n,^ to xi read all xj’s if some xj.bal > b then continue if all read xj.val’s = then v ← my initial value else v ← read val with highest num n ← b write b,v,n,^ to xi read all xj’s if some xi.bal >b then continue write b,v,n,v to xi return v else /* non-leader */ read xld where ld is leader if xld .dec ≠ bot then return xld .dec Reminder: Disk Paxos Observation: Only leader writes;others just read.Leader stops writing upon deciding

What is the Right Condition? (Cont’d) • Will Lock-Freedom help? • No, because as long as the leader does not succeed to write, no decision is possible • Is Wait-Freedom really needed?

Finite-Writes Termination • Finite-Writes Termination (FW-termination) • Every write (by a correct process) eventually returns • Every read (by a correct process) eventually returns unless infinitely many writes are invoked • Wait-free consensus with an  leader oracle is solvable with FW-terminating regular 1WMR registers

Finite-Writes Termination

Consensus w/ FW-Terminating Registers • We observe that existing wait-free shared-memory failure-detector-based () consensus algorithms* work with FW-Terminating registers • Lo and Hadzilacos’96 • Gafni and Lamport’00 *with small modifications

Back to the Register Emulation • To implement a regular FW-Termination register, Read keeps invoking rounds until the highest timestamped value appears in t+1 responses • When overlapping Writes stop occurring, Read eventually returns • Number of rounds unbounded by construction

The Complete System

Computing with Byzantine Shared Memory