Create Presentation
Download Presentation

Download

Download Presentation

LINF 2345 Leader election and consensus with crash and Byzantine failures

132 Views
Download Presentation

Download Presentation
## LINF 2345 Leader election and consensus with crash and Byzantine failures

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**LINF 2345Leader electionand consensus with crashand**Byzantine failures Seif Haridi Peter Van Roy S. Haridi**Overview**• Synchronous systems with crash failures • Leader election in rings • Fault-tolerant consensus S. Haridi**Leader Electionin Rings**S. Haridi**BackgroundRings**• The ring topology is a circular arrangement of nodes used often as a control topology in distributed computations • A graph is regular if all nodes have the same degree • A ring is an undirected, connected, regular graph of degree 2 • G is a ring, is equivalent to: • There is a one-to-one mapping of V to {0,…,n-1} such that the neighbors of node i are nodes i-1 and i+1 (modulo n) S. Haridi**The Leader Election Problem**• A situation where a group of processors must select a leader among them • Simplified coordination • Helpful in achieving fault tolerance • Coordinator in two/three phase commits • Represents a general class of symmetry breaking problems • Deadlock removals S. Haridi**The Leader Election Problem**• An algorithm solves the leader election problem if: • The terminated states are partitioned into elected and non-elected states • Once a processor enters an elected/non-elected state, its transition function will only move it to another (or the same) elected/non-elected state • In every admissible execution exactly one processor enters an elected state and all others enter a non-elected state. S. Haridi**The Leader Election ProblemRings**p0 • In fact we have seen an election algorithm in the previous section on arbitrary network topology • For rings: • Edges go between pi and pi+1 (addition modulo n), for all i, 0in-1 • Processors have a consistent notion of left (clockwise) and right (anti clockwise) 2 1 2 1 p1 p2 2 1 Simple oriented ring S. Haridi**Anonymous Rings**• A leader election algorithm for a ring is anonymous if • Every processor has the same state machine • Implies that processors do not have unique identifiers • An algorithm is uniform if does not use the value n, the number of processors • Otherwise the algorithm is nonuniform: • For each size n there is a state machine, but it might be different for different size n S. Haridi**Anonymous RingsImpossibility Results**• Main result • There is no anonymous leader election algorithm for ring systems • The result can be stated more comprehensively as: • There is no nonuniform anonymous algorithm for leader election in synchronous rings • Impossibility results for synchronous systems implies the same impossibility results for asynchronous systems. Why? • Impossibility results for nonuniform implies the same for uniform. Why? S. Haridi**Anonymous RingsImpossibility Results**• Impossibility results for synchronous systems implies the same impossibility results for asynchronous systems. Why? • Answer: An admissible execution in SS is also an admissible execution in AS • Therefore there is always at least one admissible execution of any AS algorithm that does not satisfy the correctness condition of a leader election algorithm • Impossibility results for nonuniform implies the same for uniform. Why? • If there is a uniform algorithm, it could be used as a nonuniform algorithm S. Haridi**Asynchronous Rings**p0,0 • Processors have unique identifiers which could be any natural numbers • For each pi, there is a variable idi initialized to the identifier of pi • We specify a ring by listing the processors starting from the one with the smallest identifier • Each processor pi, 0in, is assigned idi p1,10 p2,5 p3,97 S. Haridi**Asynchronous RingsAn O(n2) Algorithm**• Each processor sends a message with its id to its left neighbor, and waits for messages from its right neighbor • When a processor pi receives a message m, it checks the id of m • If m.id > pi.id, pi forwards m to its own left neighbor • Otherwise the message is consumed • A processor pk that receives a message with its own id declares itself as a leader, and sends a termination message to its left neighbor • A processor that receives a termination message forwards it to the left, and terminates as a non-leader S. Haridi**Asynchronous RingsAn O(n2) Algorithm**• The algorithm never sends more than O(n2) messages • O(n2) means c.n2 is an upper bound, for some constant c • The processor with the lowest id may forward n messages plus one termination message • There is an admissible execution in which the algorithm sends (n2) messages • (n2) means c1.n2 is an upper bound and c2.n2 is a lower bound for some constants c1 and c2 S. Haridi**Asynchronous RingsAn O(n2) Algorithm**• Example (an execution) • The message of processor with identifier i is sent exactly i+1 times • n termination messages • Total is 2 1 0 n-1 n-2 S. Haridi**Asynchronous RingsAn O(n log n) Algorithm**• The k-neighborhood of a processor pi is the set of processors up to distance k from pi in the ring (to the left and to the right) • The algorithm operates in phases starting at 0: • At the kth phase a processor tries to be the winner of that phase • To be a k-phase winner it must have the largest id in its 2k-neighborhood • Only winners of phase k continue to phase k+1 • At the end only one processor survives, and is elected as the leader S. Haridi**Asynchronous RingsAn O(n log n) Algorithm**• In phase 0, each processor pi attempts to be a phase 0 winner: • pi sends a probe, idi message to its 1-neighborhood • If id of a neighbor receiving the probe is greater that idi the message is swallowed • Otherwise the neighbor sends a reply message • If pi receives a reply message from both its neighbors, it becomes a phase-0 winner and continues with phase 1 S. Haridi**Asynchronous RingsAn O(n log n) Algorithm**• In phase k, each processor pi that is a (k-1)-phase winner sends probe messages to its k-neighborhood • Each message traverses 2k processors one by one • A probe is forwarded by a processor if its id is smaller than the probe’s id, or it is not the last processor S. Haridi**Asynchronous RingsAn O(n log n) Algorithm**• If the probe is not swallowed by the last processor, it sends back a reply • If pi receives reply messages from both directions it becomes a k-phase winner and continues with phase k+1 • A processor that receives its own probe declares itself a leader and sends a termination message around the ring S. Haridi**p1**p2 p3 p4 p5 p6 p7 p8 p9 p1 Asynchronous RingsAn O(n log n) Algorithm p1 p2 p3 p4 p5 p6 p7 p8 p9 p1 p1, p3, p5,p7 phase 0-winners p1 p2 p3 p4 p5 p6 p7 p8 p9 p1 p1, p5 phase 1-winners S. Haridi**Asynchronous RingsAn O(n log n) Algorithm: Messages**• probe, id, k, i • reply, id, k • id: identifier of the processor • k: integer, the phase number • i: integer, a hop counter S. Haridi**Asynchronous RingsAn O(n log n) Algorithm**• Initially: asleep = false • Upon receiving no message: • if asleep then • asleep := false • send probe, id, 0, 1 to left and right S. Haridi**Asynchronous RingsAn O(n log n) Algorithm**• Initially: asleep = false • Upon receiving probe, j, k, d from left (resp. right): • if j = id then terminate as the leader • if j > id and d < 2k then • send probe, j, k, d+1 to right (resp. left) • if j > id and d = 2k then • send reply, j, k to left (resp. right) S. Haridi**Asynchronous RingsAn O(n log n) Algorithm**• Initially: asleep = false • Upon receiving reply, j, k from left (resp. right): • if j id then send reply, j, k to right (resp. left) • else • if already received reply, j, k from right (resp. left) then • send probe, id, k+1,1 to left and right S. Haridi**Fault-Tolerant Consensus**S. Haridi**Fault-Tolerance Consensus Overview**• Study problems when a distributed system is unreliable • Processors behave incorrectly • The consensus problem • Requires processors to agree on common output based on their (possibly conflicting) inputs • Types of failures • Crash failure (a processor stops operating) • Byzantine failure (a processor behaves arbitrarily, also known as malicious failure) S. Haridi**Fault-Tolerance Consensus Overview**• Synchronous systems • To solve consensus with Byzantine failure, less than a third of the processors may behave arbitrarily • We will show one algorithm in detail, which uses optimal number of rounds but has exponential message complexity • More sophisticated algorithms are possible, for example, an algorithm that has polynomial message complexity S. Haridi**Fault-Tolerance Consensus Overview**• Asynchronous message passing systems • The consensus problem cannot be solved by deterministic algorithms, neither for crash nor Byzantine failures • This is a famous impossibility result first proved in 1985 by Fischer, Lynch, and Paterson • How do we get around this impossibility? • We can introduce a synchrony assumption or we can make the algorithm randomized (probabilistic). • Both solutions can be practical, but have their limitations S. Haridi**Synchronous Systems with Crash Failures**• Assumptions • The communication graph is complete, i.e. a clique • Communication links are fully reliable • In the reliable SS • An execution consists of rounds • Each round consists of delivery of all messages pending in outbuf variables, followed by one computation step by each processor S. Haridi**Synchronous Systems with Crash Failures**• An f-resilient system • A system where f processors can fail • Execution in an f-resilient system • There exist a subset F of at most f processors, the faulty processors (different for different executions) • Each round contains exactly one computation event for every processor not in F, and at most one computation event for every processor in F S. Haridi**Synchronous Systems with Crash Failures**• Execution in an f-resilient system • Each round contains exactly one computation event for every processor not in F, and at most one computation event for every processor in F • If a processor in F does not have a computation event in some round, then it has no computation event in any subsequent round • In the last round in which a faulty processor has a computation event, an arbitrary subset of its outgoing messages are delivered S. Haridi**Synchronous Systems with Crash Failures**• Clean failure • A situation where all or none of the processor’s messages are delivered in its last step • Consensus is easy and efficient for clean failure • We have to deal with non-clean failure • As we shall see, this is what makes the algorithm expensive S. Haridi**The Consensus Problem**• Each pi has a special component xi, called the input, and yi, called the output • Initially • Each xi contains a value from some well-ordered set • yi is undefined • Solution to the consensus problem must satisfy the following conditions • Termination • Agreement • Validity S. Haridi**The Consensus Problem**• Termination • In every admissible execution, yi is eventually assigned a value, for every nonfaulty processor pi • Agreement • In every execution, if yi and yj are assigned, then yi = yj, for all nonfaulty processors pi and pj • Validity • In every execution, if yi is assigned v for some value v on a nonfaulty processor pi, then there exists a processor pj such that xj=v S. Haridi**Simple Algorithm**• Needs f+1 rounds • Every processor maintains a set of values it knows to exist in the system • Initially this set contains only its input value • In later rounds: • A processor updates its set by adding new values received from other processors • And broadcasts any new additions • At round f+1 each processor decides on the smallest value in its set S. Haridi**Simple AlgorithmConsensus in the Presence of Crash Failure**• Initially V = {x} • Round k, 1 k f+1: • send { v V : pi has not already sent v } to all processors • receive Sj from pj, 0 j n-1, j i • if k = f + 1 then y := min(V) S. Haridi**Illustration of the Algorithmf = 3**• The algorithm requires f+1 rounds, and tolerates f crash failures Round 4 Round 3 Round 2 Round 1 p0 p1 p2 p3 p4 x x x x S. Haridi**Illustration of the Algorithmf = 3**• p2 and p4 survive • Others crash one at a time • p2 and p4 have the value x Round 4 Round 3 Round 2 Round 1 p0 p1 p2 p3 p4 x x x x S. Haridi**How the algorithm works**• Why is one round not enough? • Hint: non-clean failures! • In the previous slides, the value x is sent across only one link instead of all links, because the processor has a non-clean failure • We need enough rounds to cover the possibility of a non-clean failure in each round S. Haridi**Synchronous Systems with Byzantine Failures**• We want to reach an agreement in spite of malicious processors • In an execution of an f-resilient Byzantine system, there is at most a subset of f processors which are faulty • In a computation step of a faulty processor, its state and the messages sent are completely unconstrained • A faulty processor may also mimic the behavior of a crashed processor S. Haridi**The Consensus Problem**• Termination • In every admissible execution, yi is eventually assigned a value, for every nonfaulty processor pi • Agreement • In every execution, if yi and yj are assigned, then yi = yj, for all nonfaulty processors pi and pj • Validity • In every execution, if yi is assigned v for some value v on a nonfaulty processor pi, then there exists a processor pj such that xj=v S. Haridi**Lower Bounds on the number of Faulty Processors**• If a third or more processors can be Byzantine then consensus cannot be reached S. Haridi**Lower Bounds on the number of Faulty Processors**• If a third or more processors can be Byzantine then consensus cannot be reached • In a system with three processors such that one is Byzantine, there is no algorithm that solves the consensus problem S. Haridi**Three Processor system**2 • Assume that there is a 3-processor Algorithm A that solves the Byzantine agreement problem if one is faulty • Take two copies of A and configure them into a hexagonal system S 1 3 S. Haridi**Three Processor System**2 2 3 • Input value for processors 1,2, and 3 is 0 • Input value for processors 1’,2’, and 3’ is 1 0 0 1’ A S 1 0 1 1 3 1 1 2’ 3’ S. Haridi**Three Processor System**2 3 • S is a synchronous system, each processor is running its algorithm in the triangle system A • Each processor in S knows its neighbors and it is unaware of other nodes • We expect S to exhibit a well defined behavior with its input • Observe S does not solve the consensus problem • Call the resulting execution (infinite synchronous execution) 0 0 1’ S 1 0 1 1 1 2’ 3’ S. Haridi**Execution from the point of view of processors 2 and 3**2 3 2 0 • Processors 2 and 3 see 1 as faulty, and since A is a consensus algorithm they both decide on 0 in execution of S 0 0 1 0 1’ 1 1 0 1 3 1 1 2’ 3’ S. Haridi**Execution from the point of view of processors 1’ and**2’ 2 3 2 0 0 • Processors 1’ and 2’ see 3 as faulty, and since A is a consensus algorithm they both decide on 1 in execution of S 1 1 0 1’ 2 1 1 1 3 1 1 2’ 3’ S. Haridi**Execution from the point of view of processors 1’ and**3 2 2 3 0 0 • Processors 1’ and 3 see 2 as faulty, and since A is a consensus algorithm they both must decide on one output value in execution of S • This is not possible since they already decided differently • A contradiction! Therefore A does not exist 3 1 0 1’ 1 1 0 1 3 1 1 2’ 3’ S. Haridi**Consensus Algorithm 1**• Takes exactly f+1 rounds • Requires n 3f+1 • The algorithm has two stages: • First, information is gathered by communication among processors • Second, each processor computes locally its decision value S. Haridi**Information Gathering Phase**• Information in each processor is represented as a tree, in which each path from the root to leaf contains f+2 nodes, (height = f+1) • Nodes are labeled by sequences of processor names: • The root is labeled by the empty sequence • Let the label of the internal node v be (i1, i2, …, ir), then for each i, 0in-1, not in the sequence, v has a child labeled by (i1, i2, …, ir, i) S. Haridi