1 / 37

Replication techniques Primary-backup, RSM, Paxos

Replication techniques Primary-backup, RSM, Paxos. Jinyang Li. Fault tolerance => replication. How to recover a single node from power failure? Wait for reboot Data is durable, but service is unavailable temporarily Use multiple nodes to provide service

emosca
Download Presentation

Replication techniques Primary-backup, RSM, Paxos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Replication techniquesPrimary-backup, RSM, Paxos Jinyang Li

  2. Fault tolerance => replication • How to recover a single node from power failure? • Wait for reboot • Data is durable, but service is unavailable temporarily • Use multiple nodes to provide service • Another node takes over to provide service

  3. Replicated state machine (RSM) • RSM is a general replication method • Lab 6: apply RSM to lock service • RSM Rules: • All replicas start in the same initial state • Every replica apply operations in the same order • All operations must be deterministic • All replicas end up in the same state

  4. RSM opA opB opB opA • How to maintain a single order in the face of concurrent client requests? opA opA opB opB

  5. RSM: primary/backup opA opB opA opB • Primary/backup: ensure a single order of ops: • Primary orders operations • Backups execute operations in order primary backup opB opA

  6. Case study: Hypervisor [Bressoud and Schnider] • Hypervisor’s goal: fault tolerance • Banks, telephone exchanges, NASA need fault tolerant computing • CPUs are most likely to fail due to complexity • Hypervisor: primary/backup replication • If primary fails, backup takes over • Caveat: assuming failure detection is perfect

  7. Hypervisor replicates at VM-level • Why replicating at VM-level? • Hardware fault-tolerant machines is big in 80s • Software solution is more economical than hardware ones • Replicating at O/S level is messy (many interfaces) • Replicating at app level requires programmer efforts (web developers are not used to writing RSM codes) • Replicating at VM level has a cleaner interface (and no need to change O/S or app) • Primary and backup executes the same sequence of machine instructions

  8. A Strawman design mem mem • Two identical machines • Same initial memory/disk content • Start execute on both machines • Will they perform the same computation?

  9. A strawman design • Property: nodes executing the same sequence of ops see same effect if ops are deterministic • What are deterministic operations? • ADD, MUL etc. • Read time-of-day register, cycle counter, privilege level? • Read memory? • Read disk? • Interrupt timing? • External input devices (network, keyboard) • External output (network, display)

  10. Deal with I/O devices Strawman replicates disks at both machines Problem: hardwares might not behave identically (e.g. fail at different sectors) mem mem SCSI bus primary • Hypervisor connects devices to • at both machines • Only primary reads/writes to devices • Primary sends read values to backup • Only primary handles interrupts from h/w • Primary sends interrupts to backup ethernet backup

  11. Hypervisor executes in epochs • How to execute interrupts at the same time on both nodes? • Strawman: execute instructions one at a time • Backup waits from primary to send interrupt at end of each instruction • Very slow…. • Hypervisor executes in epochs • CPU h/w interrupts every N instructions (so both nodes stop at the same point) • Primary delays all interrupts till end of an epoch • Primary sends all interrupts to backup

  12. Hypervisor failover • If primary fails, backup must handle I/O • How does backup know primary has failed? • How des backup know if primary has handled I/O writes in the last epoch? • Relies on O/S to re-try the I/O • Device needs to support repeated ops • OK for disk writes/reads • OK for network (TCP will figure it out) • How about keyboard, printer, ATM cash machine?

  13. Hypervisor implementation • Hypervisor needs to trap every non-deterministic instruction • Time-of-day register • HP TLB replacement • HP branch-and-link instruction • Memory-mapped I/O loads/stores • Performance penalty is reasonable • A factor of two slow down • How about its performance on modern hardware?

  14. Caveats in Hypervisor • What if the network between primary and backup fails? • Primary is still running • Backup becomes a new primary • Two primaries at the same time! • Can timeouts detect failures correctly? • Pings from backup to primary are lost • Pings from backup to primary are delayed

  15. Paxos: general approach • One (or more) node decides to be the leader • Leader proposes a value and solicits acceptance from others • Leader announces result or try again

  16. Paxos: fault tolerant agreement • Paxos lets all nodes agree on the same value despite node failures, network failures and delays • Extremely useful: • e.g. Nodes agree that X is the primary • e.g. Nodes agree that the last instruction I is executed

  17. Paxos requirement • Correctness: • All nodes agree on the same value • The agreed value X has been proposed by some node • Fault-tolerance: • If less than N/2 nodes fail, the rest nodes should reach agreement eventually w.h.p • Liveness is not guranteed

  18. Why is agreement hard? • What if >1 nodes become leaders simultaneously? • What if there is a network partition? • What if a leader crashes in the middle of solicitation? • What if a leader crashes after deciding but before announcing results? • What if the new leader proposes different values than already decided value?

  19. Paxos setup • Each node proposes a value for a particular view • Agreement is of type vid=<XYZ> • Each node runs as a proposer, acceptor or learner

  20. Strawman • Designate a single node X as acceptor (e.g. one with smallest id) • Each proposer sends its value to X • X decides on one of the values • X announces its decision to all learners • Problem? • Failure of the single acceptor halts decision • Need multiple acceptors!

  21. Strawman 2: multiple acceptors • Each proposer (leader) prpose to all acceptors • Each acceptor accepts the first proposal it receives • If the leader receives positive replies from a majority of acceptors, it decides on its own value • There is at most 1 majority, hence only a single value is decided • Leader sends decided value to all learners • Problem: • What if multiple leaders propose simultaneously so there is no majority accepting?

  22. Paxos solution • Proposals are ordered by proposal # • Each acceptor may accept multiple proposals • If a proposal with value v is chosen, all higher proposals have value v

  23. Paxos operation: node state • Each node maintains: • na, va: highest proposal # and its corresponding value accepted by me • nh: highest proposal # seen • myn: my proposal # in current Paxos

  24. Paxos operation: 3P protocol • Phase 1 (Prepare) • A node decides to be leader (and propose) • Leader choose myn > nh • Leader sends <prepare, myn>to all nodes • Upon receiving <prepare, n> If n < nh reply <prepare-reject> Else nh = n reply <prepare-ok, na,va>

  25. Paxos operation • Phase 2 (Accept): • If leader gets prepare-ok from a majority V = non-empty value corresponding to the highest na received If V= null, then leader can pick any V Send <accept, myn, V> to all nodes • If leader fails to get majority prepare-ok • Delay and restart Paxos • Upon receiving <accept, n, V> If n > nh na = n; va = V reply with <accept-ok> Else reply with reject

  26. Paxos operation • Phase 3 (Decide) • If leader gets accept-ok from a majority • Send <decide, va> to all nodes • If leader fails to get accept-ok from a majority • Delay and restart Paxos

  27. Paxos operation: an example hn: N1:0 na = va = null hn: N2,0 na = N1,1 va = null hn: N0:0 na = va = null Prepare,N1:1 Prepare,N1:1 hn: N1:1 na = null va = null hn: N1,1 na = null va = null ok, na= va=null ok, na =va=nulll Accept,N1:1,val1 Accept,N1:1,val1 hn: N1,1 na = N1,1 va = val1 hn: N1,1 na = N1,1 va = val1 ok ok Decide,val1 Decide,val1 N0 N1 N2

  28. Paxos properties • When is the value V chosen? • When leader receives a majority prepare-ok and proposes V • When a majority nodes accept V • When the leader receives a majority accept-ok for value V

  29. Understanding Paxos • What if more than one leader is active? • Suppose two leaders use different proposal number, N0:10 N1:11 • Can both leaders see a majority of prepare-ok?

  30. Understanding Paxos • What if leader fails while sending accept? • What if a node fails after receiving accept? • If it doesn’t restart, ? • If it reboots, ? • What if a node fails after sending prepare-ok? • If it reboots, ?

  31. Using Paxos for RSM • Fault-tolerant RSM requires consistent replica membership • Membership: <primary, backups> • RSM goes through a series of membership changes <vid-0, primary, backups><vid-1, primary, backups> .. • Use Paxos to agree on the <primary, backups> for a particular vid

  32. Lab5: Using Paxos for RSM All nodes start with static config vid1:N1 vid1: N1 N2 joins A majority in vid1:N1 accept vid2: N1,N2 vid2: N1,N2 N3 joins A majority in vid2:N1,N2 accept vid3: N1,N2,N3 vid3: N1,N2, N3 N3 fails A majority in vid3:N1,N2,N3 accept vid4: N1,N2 vid4: N1,N2

  33. Lab5: Using Paxos for RSM vid1: N1 N1 vid2: N1,N2 Prepare, vid2, N3:1 oldview, vid2=N1,N2 N3 vid1: N1 vid1: N1 N2 N3 joins vid2: N1,N2

  34. Lab5: Using Paxos for RSM vid1: N1 N1 vid2: N1,N2 Prepare, vid3, N3:1 N3 vid1: N1 Prepare, vid3, N3:1 vid2: N1,N2 vid1: N1 N2 N3 joins vid2: N1,N2

  35. Lab6: re-configurable RSM • Use RSM to replicate lock_server • Primary in each view assigns a viewstamp to each client requests • Viewstamp is a tuple (vid:seqno) • (0:0)(0:1)(0:2)(0:3)(1:0)(1:1)(1:2)(2:0)(2:1) • All replicas execute client requests in viewstamp order

  36. Lab6: Viewstamp replication • To execute an op with viewstamp (vs), a replica must have executed all ops < vs • A replica can transfer state from another node to ensure its state reflect executions of all ops < vs

  37. Lab5: Using Paxos for RSM vid1: N1 N1 … Prepare, accept,decide etc. myvs:(1:50) Largest vs (1:50) N2 vid1: N1 Transfer state to bring N2 up-to-date N2 joins

More Related