Principles of Reliable Distributed Systems Lecture 8: Paxos

Principles of Reliable Distributed SystemsLecture 8: Paxos Spring 2008 Prof. Idit Keidar

Material • Paxos Made SimpleLeslie LamportACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 18-25.

Issues in the Real World I/III • Problem: Sometimes messages take longer than expected • Solution 1: Use longer timeouts • Slow • Solution 2: Assume asynchrony • Impossible - FLP • Solution 3: Assume eventual synchrony or unreliable failure detectors • See last week – MR Algorithm

Reminder: MR in “Normal” (Failure-Free Suspicion-Free) Runs 1 1 1 (1, v1) 2 2 . . . . . . n n all have est = v1 (1, v1) (decide, v1) all decide v1

On MR’s Performance • The algorithm can take unbounded time • What if no failures occur? • Is this inevitable? • Can we say more than “decision is reached eventually” ?

Performance Metric Number of communication steps in well-behavedruns • Well-behaved: • No failures • Stable (synchronous) from the beginning • With failure detector: no false suspicions • Motivation: common case

MR’s Running Time in Well-Behaved Runs • In round 1: • Coord is correct, not suspected by any process • All processes decide at the end of phase two • Decision in two communication steps • Halting (stopping) takes three steps • How much in synchronous model? • 2 Rounds for decision in Uniform Consensus • No performance penalty for indulgence!

Back to Last Week’s Example • Example network: • 99% of packets arrive within 10 µsec • Upper bound of 1000 µsec on message latency • Now we can choose a timeout of 10 µsec, without violating safety! • Most of the time, the algorithm will be just as fast as a synchronous uniform consensus algorithm • We did pay a price in resilience, though

Issues in the Real World II/III • Problem: Sometimes messages are lost • Solution 1: Use retransmissions • In case of transient partitions, a huge backlog can build up – catching up may take forever • More congestion, long message delays for extensive periods • Solution 2: Allow message loss • Impossible - 2 Generals • Solution 3: Assume eventually reliable links • That’s what we’ll do today

Issues in the Real World III/III • Problem: Processes may crash and later recover (aka crash-recovery model) • Solution 1: Store information on stable storage (disk) and retrieve it upon recovery • What happens to messages arriving when they’re down? • See previous slide

MR and Unreliable Links • From MR Algorithm Phase II: wait for (r,est) from n-t processes • Transient message loss violates liveness • What if we move to the next round in case we can’t get n-t responses for too long? • Notice the next line in MR: if any non- value e received thenvale

What If MR Didn’t Wait … decide v1 (1, v1) 1 1 1 (1, ) (1, v1) 2 2 will decide v2 . . . . . . (2, v2) est =  (1, v1) n n no waiting no change of val2

What Do We Want? • Do not get stuck in a round (like MR does) • Move on upon timeout • Move on upon hearing that others moved on • But, a new leader before proposing a decision value must learn any possibly decided value (must check with a majority)

Paxos: Main Principles • Use “leader election” module • If you think you’re leader, you can start a new “ballot” • Paxos name for a round • Always join the newest ballot you hear about • Leave old ballots in the middle if you need to • Two phases: • First learn outcomes of previous ballots from a majority • Then propose a new value, and get a majority to endorse it

Leader Election Failure Detector • W – Leader • Outputs one trusted process • From some point, all correct processes trust the same correct process • Can easily implement ◊S • Is the weakest for consensus[Chandra, Hadzilacos, Toueg 96]

W Implementations • Easiest: use ◊P implementation • In eventual synchrony model • Output lowest id non-suspected process • W is implementable also in some situations where ◊P isn’t • Optimizations possible • Choose “best connected”, strongest, etc.

Paxos: The Practicality • Overcomes message loss without retransmitting entire message history • Tolerates crash and recovery • Does not rotate through dead coordinators • Used in replicated file systems • Frangipani – DEC, early 90s • Nowadays Microsoft

The Part-Time Parliament[Lamport 88,98,01] Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. The Paxon parliament’s protocol provides a new way of implementing the state-machine approach to the design of distributed systems.

Annotation of TOCS 98 Paper • This submission was recently discovered behind a filing cabinet in the TOCS editorial office. • …the author is currently doing field work in the Greek isles and cannot be reached … • The author appears to be an archeologist with only a passing interest in computer science. • This is unfortunate; even though the obscure ancient Paxon civilization he describes is of little interest to most computer scientists, its legislative system is an excellent model for how to implement a distributed computer system in an asynchronous environment.

The Setting • The data (ledger) is replicated at n processes (legislators) • Operations (decrees) should be invoked (recorded) at each replica (ledger) in the same order • Processes (legislators) can fail (leave the parliament) • At least a majority of processes (legislators) must be up (present in the parliament) in order to make progress (pass decrees) • Why majority?

Eventually Reliable Links • There is a time after which every message sent by a correct process to a correct process eventually arrives • Old messages are not retransmitted • Usual failure-detector-based algorithms (like MR) do not work • Homework question

The Paxos (Paxos) Atomic Broadcast Algorithm • Leader based: each process has an estimate of who is the current leader • To order an operation, a process sends it to its current leader • The leader sequences the operation and launches a Consensus algorithm (Synod) to fix the agreement

The (Synod) Consensus Algorithm • Solves non-terminating consensus in asynchronous system • Or consensus in a partial synchrony system • Or consensus using an  failure detector • Overcomes transient crashes & recoveries and message loss • Can be modeled as just message loss

The Consensus Algorithm Structure • Two phases • Leader contacts a majority in each phase • There may be multiple concurrent leaders • Ballots distinguish among values proposed by different leaders • Unique, locally monotonically increasing • Correspond to rounds of ◊S-based algorithms [MR] • Processes respond only to leader with highest ballot seen so far

Ballot Numbers • Pairs num, process id • n1, p1 > n2, p2 • If n1 > n2 • Or n1=n2 and p1 > p2 • Leader p chooses unique, locally monotonically increasing ballot number • If latest known ballot is n, qthen p chooses n+1, p

The Two Phases of Paxos • Phase 1: prepare • If trust yourself by  (believe you are the leader) • Choose new unique ballot number • Learn outcome of all smaller ballots from majority • Phase 2: accept • Leader proposes a value with its ballot number • Leader gets majority to accept its proposal • A value accepted by a majority can be decided

Paxos - Variables BallotNumi, initially 0,0 Latest ballot pi took part in (phase 1) AcceptNumi, initially 0,0 Latest ballot piaccepted a value in (phase 2) AcceptVali, initially ^ Latest accepted value (phase 2)

Phase I: Prepare - Leader • Periodically, until decision is reached do: if leader (by W) then BallotNum BallotNum.num+1, myId send (“prepare”, BallotNum) to all • Goal: contact other processes, ask them to join this ballot, and get information about possible past decisions

Phase I: Prepare - Cohort This is a higher ballot than my current, I better join it • Upon receive (“prepare”, bal) from i if bal  BallotNum then BallotNum  bal send (“ack”, bal, AcceptNum, AcceptVal) to i This is a promise not to accept ballots smaller than bal in the future Tell the leader about my latest accepted value and what ballot it was accepted in

Phase II: Accept - Leader Upon receive (“ack”, BallotNum, b, val) from n-t if all vals = ^ then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all /* proposal */ The value accepted in the highest ballot might have been decided, I better propose this value

Phase II: Accept - Cohort This is not from an old ballot Upon receive (“accept”, b, v) ifb  BallotNum then AcceptNum  b; AcceptVal  v /* accept proposal */ send (“accept”, b, v) to all (first time only)

Paxos – Deciding Upon receive(“accept”, b, v) from n-t decide v periodically send (“decide”, v) to all Upon receive (“decide”, v) decide v Why don’t we ever “return”?

In Failure-Free Synchronous Runs (“prepare”, 1,1) (“accept”, 1,1 ,v1) 1 1 1 1 1 2 2 2 . . . . . . . . . (“ack”, 1,1, 0,0,^) n n n (“accept”, 1,1,v1) Simple W implementation always trusts process 1 decide v1

Correctness: Agreement • Follows from Lemma 1:If a proposal (“accept”, b, v)is sent by a majority, then for every sent proposal (“accept”, b’, v’)with b’>b, it holds that v’=v.

Proving Agreement Using Lemma 1 • Let v be a decided value. The first process that decides v receives n-t accept messages for v with some ballot b, i.e., (“accept”, b, v)is sent by a majority. • No other value is sent with an “accept” message with the same b. Why? • Let (“accept”, b1, v1)be the proposal with the lowest ballot number (b1) sent by n-t • By Lemma 1, v1 is the only possible decision value

To Prove Lemma 1 • Use Lemma 2: (invariant):If a proposal (“accept”, b, v)is sent, then there is a set S consisting of a majority such that either • no pS accepts a proposal ranked less than b (all vals = ^)or • v is the value of the highest-ranked proposal among proposals ranked less than b accepted by processes in S (myVal = received val with highest b).

What Makes Lemma 2 Hold • A process accepts a proposal numbered b only if it has not responded to a prepare request having a number greater than b • The “ack” response to “prepare” is a promise not to accept lower-ballot proposals in the future • The leader uses “ack” messages from a majority in choosing the proposed value

Termination • Assume no loss for a moment • Once there is one correct leader – • It eventually chooses the highest ballot number • No other process becomes a leader with a higher ballot • All correct processes “ack” its prepare message and “accept” its accept message and decide

What About Message Loss? • Does not block in case of a lost message • Phase 1 can start with new rank even if previous attempts never ended • Conditional liveness: If n-t correct processes including the leader can communicate with each other then they eventually decide • Holds with eventually reliable links

Performance? Why is this phase needed? (“prepare”, 1,1) (“accept”, 1,1 ,v1) 1 1 1 1 1 2 2 2 . . . . . . . . . (“ack”, 1,1, 0,0,^) n n n (“accept”, 1,1,v1) 4 Communication steps in well-behaved runsCompared to 2 for MR

Optimization • Allow process 1 (only!) to skip Phase 1 • Initiate BallotNum to 1,1 • Propose its own initial value • 2 steps in failure-free synchronous runs • Like MR • 2 steps for repeated invocations with the same leader • Common case

Atomic Broadcast by Running A Sequence of Consensus Instances

The Setting • Data is replicated at n servers • Operations are initiated by clients • Operations need to be performed at all correct servers in the same order • State-machine replication

Client-Server Interaction • Leader-based: each process (client/server) has an estimate of who is the current leader • A client sends a request to its current leader • The leader launches the Paxos consensus algorithm to agree upon the order of the request • The leader sends the response to the client

Failure-Free Message Flow C C request response S1 S1 S1 S1 S1 S2 S2 S2 . . . . . . (“prepare”) . . . (“ack”) (“accept”) Sn Sn Sn Phase 1 Phase 2

Observation • In Phase 1, no consensus values are sent: • Leader chooses largest unique ballot number • Gets a majority to “vote” for this ballot number • Learns the outcome of all smaller ballots from this majority • In Phase 2, leader proposes either its own initial value or latest value it learned in Phase 1

Message Flow: Take 2 C C request response S1 S1 S1 S1 S1 S1 S2 S2 S2 (“prepare”) . . . (“ack”) . . . . . . (“accept”) Sn Sn Sn Phase 1 Phase 2

Optimization • Run Phase 1 only when the leader changes • Phase 1 is called “view change” or “recovery mode” • Phase 2 is the “normal mode” • Each message includes BallotNum (from the last Phase 1) and ReqNum • e.g., ReqNum = 7 when we’re trying to agree what the 7th operation to invoke on the state machine should be • Respond only to messages with the “right” BallotNum

Paxos Atomic Broadcast: Normal Mode Upon receive (“request”, v) from client if (I am not the leader) then forward to leader else /* propose v as request number n */ ReqNum  ReqNum +1; send (“accept”, BallotNum , ReqNum, v) to all Upon receive (“accept”, b, n, v) with b = BallotNum /* accept proposal for request number n */ AcceptNum[n]  b; AcceptVal[n]  v send (“accept”, b, n, v) to all (first time only)

Recovery Mode • The new leader must learn the outcome of all the pending requests that have smaller BallotNums • The “ack” messages include AcceptNums and AcceptVals of all pending requests • For all pending requests, the leader sends “accept” messages • What if there are holes? • e.g., leader learns of request number 13 and not of 12 • fill in the gaps with dummy “do nothing” requests

Principles of Reliable Distributed Systems Lecture 8: Paxos