Computing with Byzantine Shared Memory
This document explores the intricacies of implementing reliable storage systems in the presence of Byzantine faults, focusing on optimal resilience with Byzantine shared memory. It discusses various approaches such as the Byzantine Disk Paxos, wait-free storage solutions, and non-reliable arbitrary fault tolerance. Key examples of applications include Fleet, Rosebud, and OceanStore. Emphasizing the importance of client-server communication in distributed systems, it provides insights on achieving reliable read/write registers and consensus, while addressing the challenges related to failures and the trade-offs involved in achieving fault tolerance.
Computing with Byzantine Shared Memory
E N D
Presentation Transcript
Computing with Byzantine Shared Memory Topics in Reliable Distributed Systems Fall 2004-2005, Idit Keidar
Sources • Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory • Abraham, Chockler, Keidar, Malkhi, PODC04 • Optimal Resilience Wait-Free Storage from Byzantine Components: Inherent Costs and Solutions • Chockler, Keidar, Malkhi, FuDiCo II, June 2004. • Some slides stolen from Chockler and Malkhi
Internet Information Services Internet
Internet Information Services Network storage [Fleet (Hebrew U and Lucent) Agile (GTech), Coca (Cornell), SBQL (UTexas), others] web
Internet Information Services Peer-to-peer storage [Oceanstore, etc.] web
Large-Scale Deployment • Clients come and go • cannot be expected to be around • hence, direct communication infeasible • Communication among storage nodes is also infeasible • In SAN impossible • Thin (scalable) servers, more logic at clients • Some of the storage nodes can be compromised by a malicious adversary
Fault/Security Scenarios • Client crash, unresponsiveness • Access control for clients • access restricted to data they own • if bypassed, what can we do? • A malicious client can mess up the data anyway • Servers entrusted with others’ data • compromised servers a bigger problem • Byzantine faults possible
Assumptions • Any number of client failures (crash) • no Byzantine faults (assume access control) • Threshold (t-out-of-n) of storage components (per object) can be faulty • Byzantine, unresponsive: NR-Arbitrary Faults • Monitoring/reconfiguration service • Each object has enough healthy replicas • Naming service: e.g., DHT
NS ha={X,Y,Z,A} Example: Internet-Wide Store Y X Z Client: ha=lookup(a); read(ha); write(ha); … A
Data-Centric Replication No server-to-server comm NR-arbitrary failures No client-to-client comm Bounded, unknown number of clients
Some System Examples • Fleet, HUJI and Lucent • Rosebud, MIT • Agile, GTech • Coca, Cornell • SBQL, UTexas • OceanStore, Berkley • Pasis, CMU • others
Our Focus: Foundations • Wanted: generic services useful for applications • Reliable R/W registers • Consensus • Which register semantics can be supported? • Atomic, regular, safe • Termination conditions • At which cost? • Failure resilience • Communication rounds • Memory consumption • By understanding tradeoffs, system designers will be able to intelligently decide what to choose
Formal Model • Asynchronous shared memory • servers=shared objects, clients=processes • Shared objects may experience NR-arbitrary failures [JCT98] • faulty object can respond with arbitrary value or fail to respond • Processes can fail by crashing
Previous Work I: Wait-Free Constructions • Safe register: A read that does not overlap a write returns the last register’s value • Malkhi & Reiter 98: n > 4t • One round read/write, unbounded timestamps • Jayanti, Chandra & Toueg 98: n > 5t • One round read/write, no timestamps • Self-construction: implement safe register from collection of fault-prone safe registers
x = 7 t = 1 x = 2 t = 5 x = 0 t = 0 x = 7 t = 1 x = 7, t = 1 x = 7 Byzantine Quorum Systems: Example [Malkhi and Reiter 98] • At most one server can be penetrated
x = 7 t = 1 x = 0 t = 0 x = 0 t = 0 x = 7 t = 1 x = 7, t = 1 x = 7 Byzantine Quorum Systems: Example [Malkhi and Reiter 98] • Why timestamps?
Previous Work II: Optimal Failure Resilience (n>3t) • Synchronous system [Bazzi, DC’00] • Safe register • Servers communicate with each other (finish operations for faulty clients) [Martin, Alvisi & Dahlin, DISC’02] • MWMR atomic register • Reliable clients [Attiya & Bar-Or, SRDS’03] • Missing: optimal-resilience data-centric asynchronous implementations resilient to process failures
Optimal Resilience: The Challenge n=4, t=1 (v0,0) (v0,0) (v0,0) delayed (v, 1) The writer The reader
Optimal Resilience: The Challenge (v,1) (v,1) (v0,0) ack The writer The reader
Optimal Resilience: The Challenge (v,1) (v,1) (v0,0) delayed The writer The reader
Optimal Resilience: The Challenge Cannot return v0 (v,1) (v,1) (v0,0) (v0,0) (v0,0) ? (v,1) The writer The reader
Optimal Resilience: The Challenge Cannot return v1 (v0,0) (v0,0) (v0,0) No write happened (v0,0) (v0,0) ? (v,1) The writer The reader
Reliable Writer Solution: Wait (v,1) (v,1) (v0,0) (v0,0) (v0,0) (v,1) (v,1) (v,1) The writer The reader
Faulty Writer Scenario Cannot wait! (v, 1) (v0,0) (v0,0) (v, 1) ? (v,1) The writer The reader
What Does This Mean? Is a solution with n=3t+1 impossible?
Lower Bound for Optimal Resilience • The reader cannot return any value, and cannot wait for more, so what can it do? • Invoke more rounds! • Will this help? • No! Exactly the same thing can happen in every round. • Conclusion?
Write Lower Bound For 1W1R binary safe register (weakest meaningful object type) construction from any base object type if n ≤ 4t and processes can crash emulating Write in one round is impossible • To emulate Write, there must be at least one base object on which two operations are invoked
pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v,1 pw=v,1 w=v0,0 pw=v,1 w=v,1 Two Rounds Save the Day! pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 (v, 1) The writer The reader
Writer Fails During Pre-Write Can return v0 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 (v, 1) The writer The reader
Writer Fails During Write Can wait to hear more v,1 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v,1 (v, 1) The writer The reader
Write Never Happened Can wait to hear more without v,1 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v,1 The writer The reader
Overlapping Write Scenario Cannot wait pw=v,1 w=v,1 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v0,0 (v, 1) (v0,0) (v0,0) (v,1) The writer The reader
Solution? • More read rounds • By the next round, all objects responding to the Read will have seen the first phase of the Write • By causality • But what if there are additional overlapping writes? • Safe register implementation can deduce in the next round (in general, in min(f+2,t+1) rounds) that there is an overlapping write; returns arbitrary value
Determining Value to Return • Read values (from w field) are candidates to return • If there are 2t+1 responses without v (in either pw or w) then v is no longer a candidate • Either was not written before Read, or some Write overlaps read • If the highest timestamped candidate occurs t+1 times, it can be returned • If the candidates set is empty, return any value • There must be an overlapping Write • Each round, wait for at least one more response than in previous rounds • After min(t+1,f+2) rounds, it will be possible to return
Summary: Wait-Free Safe Register Construction for n>3t • Invokes a bounded number of Read rounds • Optimal round complexity • Constant storage • Unbounded timestamps • Read takes two rounds (after synchronization) in (eventually) synchronous runs
Summary: Lower Bounds for n>3t and Fault-Prone Clients • 1W1R binary safe register construction: • WRITE: 2 communication rounds are necessary irrespective of base object types • Termination: obstruction freedom • READ: min(t+1,f+2) communication rounds are necessary irrespective of base object types if readers do not modify base objects • Termination: Every write terminates, read terminates if eventually runs in isolation (Finite-Write Termination)
A Note on Safe Registers • Wait-free safe registers are too weak to be directly useful by applications • They are good for lower bounds • Bounded constructions of wait-free atomic registers from safe registers known, but… • Complicated • Add communication rounds and memory • Too costly to deploy in a distributed setting
Registers with Stronger Semantics • Can we come up with an efficient, direct construction of a regular/atomic register? • Yes, if we are ready to compromise wait freedom
Termination Conditions • Wait-Freedom: • Every operation must complete within a finite number of steps regardless of steps of other processes • Lock-Freedom: • If there are concurrent operations, at least one of them must complete • Obstruction- Freedom: • If one process runs alone (without interference) for sufficiently long, it must complete its operation
Notes • Wait-freedom often hard to achieve • Lock-freedom and obstruction-freedom are achieved by simpler algorithms
What is the Right Condition? • Ultimately, we want to solve consensus • For state machine replication • Want it to be wait-free • Alas, impossible in asynchronous systems • We need a leader-oracle in order to solve consensus in asynchronous shared memory • ensures that at most one process writes (proposes values), while others only read
forever() if leader (by leader oracle) then b ← choose new unique rank write b,v,n,^ to xi read all xj’s if some xj.bal > b then continue if all read xj.val’s = then v ← my initial value else v ← read val with highest num n ← b write b,v,n,^ to xi read all xj’s if some xi.bal >b then continue write b,v,n,v to xi return v else /* non-leader */ read xld where ld is leader if xld .dec ≠ bot then return xld .dec Reminder: Disk Paxos Observation: Only leader writes;others just read.Leader stops writing upon deciding
What is the Right Condition? (Cont’d) • Will Lock-Freedom help? • No, because as long as the leader does not succeed to write, no decision is possible • Is Wait-Freedom really needed?
Finite-Writes Termination • Finite-Writes Termination (FW-termination) • Every write (by a correct process) eventually returns • Every read (by a correct process) eventually returns unless infinitely many writes are invoked • Wait-free consensus with an leader oracle is solvable with FW-terminating regular 1WMR registers
Consensus w/ FW-Terminating Registers • We observe that existing wait-free shared-memory failure-detector-based () consensus algorithms* work with FW-Terminating registers • Lo and Hadzilacos’96 • Gafni and Lamport’00 *with small modifications
Back to the Register Emulation • To implement a regular FW-Termination register, Read keeps invoking rounds until the highest timestamped value appears in t+1 responses • When overlapping Writes stop occurring, Read eventually returns • Number of rounds unbounded by construction