1 / 51

Computing with Byzantine Shared Memory

Computing with Byzantine Shared Memory. Topics in Reliable Distributed Systems Fall 2004-2005, Idit Keidar. Sources. Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory Abraham, Chockler, Keidar, Malkhi, PODC04

torgny
Download Presentation

Computing with Byzantine Shared Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing with Byzantine Shared Memory Topics in Reliable Distributed Systems Fall 2004-2005, Idit Keidar

  2. Sources • Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory • Abraham, Chockler, Keidar, Malkhi, PODC04 • Optimal Resilience Wait-Free Storage from Byzantine Components: Inherent Costs and Solutions • Chockler, Keidar, Malkhi, FuDiCo II, June 2004. • Some slides stolen from Chockler and Malkhi

  3. Internet Information Services Internet

  4. Internet Information Services Network storage [Fleet (Hebrew U and Lucent) Agile (GTech), Coca (Cornell), SBQL (UTexas), others] web

  5. Internet Information Services Peer-to-peer storage [Oceanstore, etc.] web

  6. Storage Area Network (SAN)

  7. Large-Scale Deployment • Clients come and go • cannot be expected to be around • hence, direct communication infeasible • Communication among storage nodes is also infeasible • In SAN impossible • Thin (scalable) servers, more logic at clients • Some of the storage nodes can be compromised by a malicious adversary

  8. Fault/Security Scenarios • Client crash, unresponsiveness • Access control for clients • access restricted to data they own • if bypassed, what can we do? • A malicious client can mess up the data anyway • Servers entrusted with others’ data • compromised servers a bigger problem • Byzantine faults possible

  9. Assumptions • Any number of client failures (crash) • no Byzantine faults (assume access control) • Threshold (t-out-of-n) of storage components (per object) can be faulty • Byzantine, unresponsive: NR-Arbitrary Faults • Monitoring/reconfiguration service • Each object has enough healthy replicas • Naming service: e.g., DHT

  10. NS ha={X,Y,Z,A} Example: Internet-Wide Store Y X Z Client: ha=lookup(a); read(ha); write(ha); … A

  11. Data-Centric Replication No server-to-server comm NR-arbitrary failures No client-to-client comm Bounded, unknown number of clients

  12. Some System Examples • Fleet, HUJI and Lucent • Rosebud, MIT • Agile, GTech • Coca, Cornell • SBQL, UTexas • OceanStore, Berkley • Pasis, CMU • others

  13. Our Focus: Foundations • Wanted: generic services useful for applications • Reliable R/W registers • Consensus • Which register semantics can be supported? • Atomic, regular, safe • Termination conditions • At which cost? • Failure resilience • Communication rounds • Memory consumption • By understanding tradeoffs, system designers will be able to intelligently decide what to choose

  14. Formal Model • Asynchronous shared memory • servers=shared objects, clients=processes • Shared objects may experience NR-arbitrary failures [JCT98] • faulty object can respond with arbitrary value or fail to respond • Processes can fail by crashing

  15. Previous Work I: Wait-Free Constructions • Safe register: A read that does not overlap a write returns the last register’s value • Malkhi & Reiter 98: n > 4t • One round read/write, unbounded timestamps • Jayanti, Chandra & Toueg 98: n > 5t • One round read/write, no timestamps • Self-construction: implement safe register from collection of fault-prone safe registers

  16. x = 7 t = 1 x = 2 t = 5 x = 0 t = 0 x = 7 t = 1 x = 7, t = 1 x = 7 Byzantine Quorum Systems: Example [Malkhi and Reiter 98] • At most one server can be penetrated

  17. x = 7 t = 1 x = 0 t = 0 x = 0 t = 0 x = 7 t = 1 x = 7, t = 1 x = 7 Byzantine Quorum Systems: Example [Malkhi and Reiter 98] • Why timestamps?

  18. Previous Work II: Optimal Failure Resilience (n>3t) • Synchronous system [Bazzi, DC’00] • Safe register • Servers communicate with each other (finish operations for faulty clients) [Martin, Alvisi & Dahlin, DISC’02] • MWMR atomic register • Reliable clients [Attiya & Bar-Or, SRDS’03] • Missing: optimal-resilience data-centric asynchronous implementations resilient to process failures

  19. Optimal Resilience: The Challenge n=4, t=1 (v0,0) (v0,0) (v0,0) delayed (v, 1) The writer The reader

  20. Optimal Resilience: The Challenge (v,1) (v,1) (v0,0) ack The writer The reader

  21. Optimal Resilience: The Challenge (v,1) (v,1) (v0,0) delayed The writer The reader

  22. Optimal Resilience: The Challenge Cannot return v0 (v,1) (v,1) (v0,0) (v0,0) (v0,0) ? (v,1) The writer The reader

  23. Optimal Resilience: The Challenge Cannot return v1 (v0,0) (v0,0) (v0,0) No write happened (v0,0) (v0,0) ? (v,1) The writer The reader

  24. Reliable Writer Solution: Wait (v,1) (v,1) (v0,0) (v0,0) (v0,0) (v,1) (v,1) (v,1) The writer The reader

  25. Faulty Writer Scenario Cannot wait! (v, 1) (v0,0) (v0,0) (v, 1) ? (v,1) The writer The reader

  26. What Does This Mean? Is a solution with n=3t+1 impossible?

  27. Lower Bound for Optimal Resilience • The reader cannot return any value, and cannot wait for more, so what can it do? • Invoke more rounds! • Will this help? • No! Exactly the same thing can happen in every round. • Conclusion?

  28. Write Lower Bound For 1W1R binary safe register (weakest meaningful object type) construction from any base object type if n ≤ 4t and processes can crash emulating Write in one round is impossible • To emulate Write, there must be at least one base object on which two operations are invoked

  29. pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v,1 pw=v,1 w=v0,0 pw=v,1 w=v,1 Two Rounds Save the Day! pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 (v, 1) The writer The reader

  30. Writer Fails During Pre-Write Can return v0 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 (v, 1) The writer The reader

  31. Writer Fails During Write Can wait to hear more v,1 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v,1 (v, 1) The writer The reader

  32. Write Never Happened Can wait to hear more without v,1 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v,1 The writer The reader

  33. Can Read Always Complete in One Round?

  34. Overlapping Write Scenario Cannot wait pw=v,1 w=v,1 pw=v0,0 w=v0,0 pw=v0,0 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v0,0 pw=v,1 w=v0,0 (v, 1) (v0,0) (v0,0) (v,1) The writer The reader

  35. Solution? • More read rounds • By the next round, all objects responding to the Read will have seen the first phase of the Write • By causality • But what if there are additional overlapping writes? • Safe register implementation can deduce in the next round (in general, in min(f+2,t+1) rounds) that there is an overlapping write; returns arbitrary value

  36. Determining Value to Return • Read values (from w field) are candidates to return • If there are 2t+1 responses without v (in either pw or w) then v is no longer a candidate • Either was not written before Read, or some Write overlaps read • If the highest timestamped candidate occurs t+1 times, it can be returned • If the candidates set is empty, return any value • There must be an overlapping Write • Each round, wait for at least one more response than in previous rounds • After min(t+1,f+2) rounds, it will be possible to return

  37. Summary: Wait-Free Safe Register Construction for n>3t • Invokes a bounded number of Read rounds • Optimal round complexity • Constant storage • Unbounded timestamps • Read takes two rounds (after synchronization) in (eventually) synchronous runs

  38. Summary: Lower Bounds for n>3t and Fault-Prone Clients • 1W1R binary safe register construction: • WRITE: 2 communication rounds are necessary irrespective of base object types • Termination: obstruction freedom • READ: min(t+1,f+2) communication rounds are necessary irrespective of base object types if readers do not modify base objects • Termination: Every write terminates, read terminates if eventually runs in isolation (Finite-Write Termination)

  39. A Note on Safe Registers • Wait-free safe registers are too weak to be directly useful by applications • They are good for lower bounds • Bounded constructions of wait-free atomic registers from safe registers known, but… • Complicated • Add communication rounds and memory • Too costly to deploy in a distributed setting

  40. Registers with Stronger Semantics • Can we come up with an efficient, direct construction of a regular/atomic register? • Yes, if we are ready to compromise wait freedom

  41. Termination Conditions • Wait-Freedom: • Every operation must complete within a finite number of steps regardless of steps of other processes • Lock-Freedom: • If there are concurrent operations, at least one of them must complete • Obstruction- Freedom: • If one process runs alone (without interference) for sufficiently long, it must complete its operation

  42. Notes • Wait-freedom often hard to achieve • Lock-freedom and obstruction-freedom are achieved by simpler algorithms

  43. What is the Right Condition? • Ultimately, we want to solve consensus • For state machine replication • Want it to be wait-free • Alas, impossible in asynchronous systems • We need a leader-oracle in order to solve consensus in asynchronous shared memory • ensures that at most one process writes (proposes values), while others only read

  44. forever() if leader (by leader oracle) then b ← choose new unique rank write b,v,n,^ to xi read all xj’s if some xj.bal > b then continue if all read xj.val’s = then v ← my initial value else v ← read val with highest num n ← b write b,v,n,^ to xi read all xj’s if some xi.bal >b then continue write b,v,n,v to xi return v else /* non-leader */ read xld where ld is leader if xld .dec ≠ bot then return xld .dec Reminder: Disk Paxos Observation: Only leader writes;others just read.Leader stops writing upon deciding

  45. What is the Right Condition? (Cont’d) • Will Lock-Freedom help? • No, because as long as the leader does not succeed to write, no decision is possible • Is Wait-Freedom really needed?

  46. Finite-Writes Termination • Finite-Writes Termination (FW-termination) • Every write (by a correct process) eventually returns • Every read (by a correct process) eventually returns unless infinitely many writes are invoked • Wait-free consensus with an  leader oracle is solvable with FW-terminating regular 1WMR registers

  47. Finite-Writes Termination

  48. Consensus w/ FW-Terminating Registers • We observe that existing wait-free shared-memory failure-detector-based () consensus algorithms* work with FW-Terminating registers • Lo and Hadzilacos’96 • Gafni and Lamport’00 *with small modifications

  49. Back to the Register Emulation • To implement a regular FW-Termination register, Read keeps invoking rounds until the highest timestamped value appears in t+1 responses • When overlapping Writes stop occurring, Read eventually returns • Number of rounds unbounded by construction

  50. The Complete System

More Related