Decentralized Consensus Algorithms in Distributed Systems

State Machines Sabina Petride

General Problems • Consensus • a particular problem • algorithms and different formulations • correctness and time analysis • Application To Data Replication • replica coordination • group membership; reintegration • unique identifiers using logical/real clocks

The Paxos Parliament And The Consensus Problem • The Paxos Parliament • determine the law of the land, defined by the sequence of decrees passed • each legislator had his own ledger with decrees, their unique number and their contents • entries in ledgers could not be modified or deleted • legislators could leave the court for very long periods of time and return later • communication only by messangers (could lose the message, send it many times or lose the messages) • Requirements • consistency of the ledgers • progress to ensure that some decree will eventually be passed • The Synod • basically, the same problem as with the Parliament, just that a single decree had to be passed • the group of priests/legislators asked to vote for a decree was called the quorum

This can be modelled as a consensus problem: • Agreement: no two ledgers should contain different decrees with the same number (no conflicts among ledgers) • Validity: any decree should be written in the standard form • Termination (the progress condition) • Agreement and validation are guaranteed and progress is possible if three conditions are satisfied: • B1 Each ballot has a unique number. • B2 The quorums of any two ballots have at least one priest in common. • B3 For every ballot, if any priest in a quorum has voted in an earlier ballot, then the decree equals the decree of the latest of those earlier ballots.

Assumptions About The System • partial synchronous distributed system in which processes take actions within l time and messages are delivered within d time • the system doen not necessarily exhibits this “normal” timing behavior • each process has a direct communication channel with each other process • allowed failures: • timig failures (the bounds of l and d can be occasionally exceded) • loss, duplication or reordering of messages • process stopping • some stable storage is needed • process recovery is considered

The Synod Algorithm (1) Priest p chooses a new ballot number b. p sends message NextBallot(b) to some set of priests. (2) When a priest q recieves aNextBallot(b), he checks the notes in the back of his ledger and determines the vote v with the largestballot number less then b that he has voted for. If such a vote doesn’t exist, then a default value null(q) is used. q sends p a LastVoted(b,v) message. (3) After p receives a LastVoted(b,v) message from all the priests ina majority set Q, he initiates a new ballot with number b, quorum Q, and decree chosen according to B3. p records the new ballot and sens BeginBallot(b,d) to Q. (4) If q receivesBeginBallot(b,d)and decides to vote, then he records the vote in the back of his ledger and sendsVoted(b,q) to p.

(5) If p has recieved a Voted(b,q) from all q in Q, then he writes d in his ledger and sends Success(d) to all priests. (6) After receiving Success(d), a priest enters d in his ledger.

Notes on The Synod Algorithm • to maintain B1, each ballot has to receive a unique number; this can be done by • having each priest noting the ballots in his ledger • patitioning the set of possible ballots among the priests ( later we will talk about different implementations) • a priest should not cast the vote after receiving BeginBallot(b,d) if he has already sent a LastVote(b’,v’) message for some other ballot and v.bal’<b<b’. It follows that • a priest must record: • the number of every ballot he has initiated • every vote he has cast • every LastVote message he has sent

Stating The Problem in Terms of State Machines • a state machine consists of • state variables (encoded in states) • commands (which transform the states) • each command is implemented by a deterministic program and its execution is atomic with respect to other commands • clock I/O automaton: specific state machine devised by Lynch and Tuttle for modelling, verifying, and analyzingtime-based systems

Clock I/O Automata An I/O time automaton A consists of • a set of states: states(A) • a nonempty set start(A) of start states • a set of actions partitioned in input, output, internal, and time-passage actions and specified in the signature of A • a transition relation steps(A) subset of states(A)*acts(A)*states(A). No input action can be blocked: for all s state, for all a input action, there is a state s’ such that (s,a,s’ ) is a step in A. A time-passage action (t) models the passage of real time t. A special real variable Clockis included in each state to model the local clock of the process. It is not necessary that Clock simulates the real time.

The Synod Algorithm In Terms Of Clock GTA • The Distributed Setting • relation with the Paxos problem: • priest/process • law book/state • passing a decree/executing a command • complete network of n processes with unique identifiers in a totally ordered set known by all processes • clock GT automata are used to model both processes and channels; each automaton has a local clock and the local clock for a channel is used to detect timig failures • The Algorithm • ideea: propose values until one of them is accepted by a majority of processes • any process may propose a value by initiating a round for that value; it becomes the leader of that round • the leader and the other processes are agents

(1) The leader sends a Collect message to all agents • (2) If an agent recieves a Collect message and it is already committed for a round with a biger round number, it sends an OldRound message; otherwise, it sends a Last message with its information about rounds previously conducted. • (3) If the leader receives more than n/2Last messages, it initiates a new round and sends to all agents a Begin message. • (4) If an agent receives the Begin message and is committed, it sends an OldRoundmessage; otherwise, it accepts the value proposed and responds with an Acceptmessage. • (5) If the leader receives more than n/2Accept messages, then the round is successful and its own output value is the value of the round. • (6) The leader broadcasts the reached decision. • Notes: • the set of agents Last(Accept) messages are received from=info-quorum (accepting-quorum)

Implementation(1)

BPLEADER(I) (clock GTA running the leader at process i) Input: NewRound(i), Leader(i) NotLeader(i) Receive(m)(j,i), m=Last, Accept, Success, OldRound Output: Send(m)(j,i), m=Collect, Begin BeginCast(i) RndSuccess(v)(i) Internal: Collect(i), GatherLast(i) ... Time-passage: ... BPAGENT(I) (clock GTA running an agent at process i) Input: Receive(m)(j,i), m=Collect, Begin Output: Send(m)(j,i), m=Last, Accept, OldRound Internal: LastAccept(i), Accept(i), ... Time-passage: ... Implementation(2)

Correctness Proof • execution fragment: sequence of states followed by actions in steps according to the automaton • problem specification: set of allowable behaviors (behavior = sequence of external actions from an execution fragment) • an automaton A solves the problem if each of its behaviors is contained in the problem specification • safety properties: must hold in every state of a computation • liveness properties: specify events that must eventually be performed

Safety/Liveness Properties • safety property: in any execution of the system agreement and validity are guaranteed • liveness property: under some conditions, termination is guaranteed • an execution fragment is nice if • no loss or duplication takes place • at each time-passage action the local clock is incremented with the real time variation • every process is either stopped or alive • a majority of process are alive • Theorem:If a nice execution fragment starts in a reachable state and it has a unique leader and lasts for more than 16l+8nl+9d time units,then by the time 16l+8nl+9dthe leaderhas reached a decision. Note: proofs are based on invariants.

Other Results On Time Performance • If a nice execution fragment starts in a reachable state and lasts more than 24l+10nl+13d, then: • the leader decidesby the time 21l+8nl+11d and at most 8n messages are sent • all alive processes decide by time 24l+10nl+13d and at most 2n additional messages are sent

Generalization Of The Synod Protocol :MULTIPAXOS • consensus has to be reached on a sequence of values • for each value we run BAXICPAXOS • the automata used for each instance of the algorithm are like automata in BAXIXPAXOS, except that an additional parameter (the index of the proposed value) is present in each action • concurrency: several leaders may concurrently initiate rounds and these round are carried out concurrently • several leaders initiating values concurrently is an important difference between Paxos algorithm and three phase commit protocol

Data Replication • problem: providing distributed and concurrent access to data objects • simple implementation: maintain the object at a single process accessed by multiple clients • some disadvantages: • not good scaling when the number of clients increases • not fault-tolerant • other solution: data replication • servers are replicated: each server runs the same state machine • clients make requests which are redirected to specific servers

Replica Coordination(1) • Requirements • requests should be processed by state machines one at a time • the order of processing should be consistent with potential causality • outputs: determined only by the sequence of requests, independent of time or any other activity in the system • Replica coordination • agreement: every nonfaulty state machine replica receives every request • order: every nonfaulty state machine replica processes the requests it receives in the same relative order • issues to be considered: fault-tolerance and reconfiguration • MULTIPAXOS: possible solution to the problem

Replica Coordination(2)MULTIPAXOS For Replica Coordination • each process in the system maintains a copy of the data object • a client requests un update operation • a process proposes the operation in an instance of MULTIPAXOS • after some time, the update operation is the output value of the instance of MULTIPAXOS • the leader of the round updates its local copy; because of correctness, all the alive processes update their copies, too • a report to the client is given • a client requests a read operation • the request is immediately satisfied based on the local copy Note:majority to achieve consistency-> majority voting a unique leader required to achieve termination-> primary copy replication

Replica Coordination(3)Order and Stability • unique identifiers for requests (total order) • implementation: a replica next processes the stablerequest with the smallest unique identifier (stable request: no request from a correct client and with a lower uid can be subsequently delivered to that state machine) • using logical clocks to ensure order and stability: • each process has a local counter • local counter is incremented after each event at that process • each message sent is timestamped with the local clock • upon receipt of a message, the local clock of the receiver becomes 1+maximum of timestamp and local clock • a uid for each event is given by appending a fixed-length bit (encodes the process id) to the counter value of the process where the event takes place • using real clocks to ensure order and stability • assumptions: • the degree of clock synchronization better than min message delivery time • a request r will be received by every correct process no later then uid(r)+Δ • stability test: a request r is stable at a state machine if the local clock reads time t and t>uid(r)+ Δ

Replica Coordination(4)Reconfiguration • at time t there are P(t) processes, F(t) faulty • necessary condition for correct output: • P(t)>F(t)/2 if Byzantine failures are possible • P(t)>F(t) if only fail-stop failures • system described by 3 sets: clients (C), state machines (S), and output devices (O) ; information about them stored in state variables and changed by commands • C and O make periodical queries-> better share processors • messages sent by S always contain information about future reconfiguration-> permanent communication S<->C and S<->O • requests to change a configuration of the system made by failure/recovery detector mechanism

Replica Coordination(6)Integrating A Repaired Object • goal: integrate element e at request r • notation: e[r] is the state a non-faulty system element e should be in after processing all the requests up to r • if processors are fail stop and logical clocks are implemented, then the cooperation of only one state machine replica is needed (if the sm has not failed, then it is correct, and because of consensus among replicas, its information on the system is correct and complete with respect to other sm) -> the used sm should have access to enough information • implementation: e[r] is sent to e before the output produced by processing any request with uid larger than uid(r) • e in O : e[r] usually is device-specific setup information • can be stored in state variables of sm • e in C : e[r] usually based on sensor values read • use information from C to sm

Replica Coordination(7)Integrating A Repaired State Machine • try to use the algorithm: sm sends to e the values of all its state variables before the output produced by processing any request with uid larger than uid(r) .... problem: some client request might be recieved by sm after sending e[r], but delivered to e before its repair • solution: sm must relay to e requests received from clients • how long: as soon as e has received a request directly from a client c, requests from the same c with larger uid need not be relayed to e • so, e should inform sm of the uid of requests received directly from c • algorithm: (1) sm sends e the values of its state variables and copies of pending requests (2) sm sends to e every subsequent request r received from client c s.t. uid(r)<uid(rc) (rc is the first request e has directly recieved from c, after e restarted)

Decentralized Consensus Algorithms in Distributed Systems

Decentralized Consensus Algorithms in Distributed Systems

Presentation Transcript

COMP541 State Machines

Mindstorms State Machines

Finite State Machines

State Machines

Finite State Machines

EAP State Machines

Finite State Machines

Finite State Machines

State Machines

Finite State Machines

State Machines

Finite State Machines

Finite State Machines

COMP541 State Machines

State Machines

State Machines

Finite State Machines

Finite State Machines

Finite state machines

State Machines