Network algorithms

Network algorithms Presenter- KurchiSubhraHazra

Agenda • Basic Algorithms such as Leader Election • Consensus in Distributed Systems • Replication and Fault Tolerance in Distributed Systems • GFS as an example of a Distributed System

Network Algorithms • Distributed System is a collection of entities where • Each of them is autonomous, asynchronous and failure-prone • Communicating through unreliable channels • To perform some common function • Network algorithms enable such distributed systems to effectively perform these “common functions”

Gobal State in Distributed Systems • We want to estimate a “consistent” state of a distributed system • Required for determining if the system is deadlocked, terminated and for debugging • Two approaches: • 1. Centralized- All processes and channels report to a central process • 2. Distributed – ChandyLamport Algorithm

ChandyLamport Algorithm Based on Marker Messages M On receiving M over channel c: If state is not recorded: a) Record own state b) Start recording state of incoming channels c) Send Marker Messages to all outgoing channels Else a) Record state of c

e11,2 e14 e13 M M e24 M M e21,2,3 M M e31 e32,3,4 1- P1 initiates snapshot: records its state (S1); sends Markers to P2 & P3; turns on recording for channels Ch21 and Ch31 2- P2 receives Marker over Ch12, records its state (S2), sets state(Ch12) = {} sends Marker to P1 & P3; turns on recording for channel Ch32 3- P1 receives Marker over Ch21, sets state(Ch21) = {a} 4- P3 receives Marker over Ch13, records its state (S3), sets state(Ch13) = {} sends Marker to P1 & P2; turns on recording for channel Ch23 5- P2 receives Marker over Ch32, sets state(Ch32) = {b} 6- P3 receives Marker over Ch23, sets state(Ch23) = {} 7- P1 receives Marker over Ch31, sets state(Ch31) = {} ChandyLamport Algorithm e10 e13 P1 a e23 P2 e20 b P3 e30 Taken from CS 425/UIUC/Fall 2009

Leader Election • Suppose you want to -elect a master server out of n servers -elect a co-ordinator among different mobile systems Common Leader Election Algorithms -Ring Election -Bully Election Two requirements • Safety (Process with best attribute is elected) • Liveness (Election terminates)

Ring Election • Processes organized in a ring • Send message clockwise to next process in a ring with its id and own attribute value • Next process checks the election message • if its attribute value is greater, it replaces its own process id with that in the message. • If the attribute value is less, it simply passes on the message • If the attribute value is equal it declares itself as the leader and passes on an “elected” message. What happens when a node fails?

Ring Election - Example Taken from CS 425/UIUC/Fall 2009

Bully Algorithm Best case and worst case scenarios Taken from CS 425/UIUC/Fall 2009

Consensus • A set of n processes/systems attempt to “agree” on some information • Pi begins in undecided state and proposes value viєD • Pi‘s communicate by exchanging values • Pi sets its decision value di and enters decided state • Requirements: 1.Termination: Eventually all correct processes decide, i.e., each correct process sets its decision variable 2. Agreement: Decision value of all correct processes is the same 3. Integrity: If all correct processes proposed v, then any correct decided process has di= v

2 Phase Commit Protocol • Useful in distributed transactions to perform atomic commit • Atomic Commit: Set of distinct changes applied in a single operation • Suppose A transfers 300 $ from A’s account to B’s bank account. • A= A-300 • B=B+300 These operations should be guaranteed for consistency.

2 Phase Commit Protocol What happens if the co-ordinator and a participant fails after doCommit?

Issue with 2PC CanCommit? Co- ordinator A B

Issue with 2PC Yes Co- ordinator A B

Issue with 2PC doCommit A crashes Co- ordinator A Co-ordinator Crashes B commits B A new co-ordinator cannot know whether A had committed.

3 Phase Commit Protocol (3PC) Use an additional stage

3PC Cont… commit ack ack canCommit preCommit Co-ordinator commit Cohort 1 commit Cohort 2 commit Cohort 3

3PC Cont… • Why is this better? • 2PC: execute transaction when everyone is willing to COMMIT it • 3PC: execute transaction when everyone knowsit will COMMIT (http://www.coralcdn.org/07wi-cs244b/notes/l4d.txt) • But 3PC is expensive • Timeouts triggered by slow machines

Paxos Protocol • A consensus algorithm • Important Safety Conditions: • Only one value is chosen • Only a proposed value is chosen • Important Liveness Conditions: • Some proposed value is eventually chosen • Given a value is chosen, a process can learn the value eventually • Nodes behave as Proposer, Acceptor and Learners

Paxos Protocol – Phase 1 Prepare message Select a number n for proposal of value v Acceptor Proposer Acknowledgement Acceptors respond back with the highest n it has seen Acceptor Acceptor Acceptor What about this acceptor? Majority of acceptors is enough

Paxos Protocol – Phase 2 Proposer Majority of acceptors agree on proposal n with value v n Acceptor n n Acceptor Acceptor Acceptor

Paxos Protocol – Phase 2 Acceptors accept Proposer Accept Majority of acceptors agree on proposal n with value v Acceptor Acceptor What if v is null? Acceptor Acceptor

Paxos Protocol Cont… • What if arbitrary number of proposers are allowed? Round 1 n1 P Round 2 Acceptor n2 Q

Paxos Protocol Cont… • What if arbitrary number of proposers are allowed? • To ensure progress, use distinguished proposer Round 1 P n3 Round 2 Acceptor Round 3 n4 Round 4 Q

Paxos Protocol Contd… • Some issues: • How to choose proposer? • How do we ensure unique n ? • Expensive protocol • No primary if distinguished proposer used Originally used by Paxons to run their part-time parliament

Replication • Replication is important for • Fault Tolerance • Load Balancing • Increased Availability Requirements: • Transparency • Consistency

Failure in Distributed Systems • An important consideration in every design decision • Fault detectors should be : • Complete – should be able to detect a fault when it occurs • Accurate – Does not raise false positives

Byzantine Faults • Arbitrary messages and transitions • Cause: e.g., software bugs, malicious attacks • Byzantine Agreement Problem: “Can a set of concurrent processes achieve coordination in spite of the faulty behavior of some of them?” • Concurrent processes could be replicas in distributed systems

Practical Byzantine Fault Tolerance(PBFT) • Replication Algorithm that is able to tolerate faults. • Useful for software faults • Why “Practical”? -> since can be used in an asynchronous environment like the internet • Important Assumptions: • At most nodes can be faulty • All replicas start in the same state • Failures are independent – Practical?

PBFT Cont.. request pre-prepare prepare reply commit C Execution after 2f+1 commits R1 R2 R3 R4 Client blocks and waits for f+1 replies After accepting 2f prepares C : Client R1: Primary replica

PBFT Cont… • The algorithm provides • -> Safety • By guaranteeing linearizability. Pre-prepare and prepare ensures total order on messages • -> Liveness • By providing for view change, when the primary replica fails. Here, synchrony is assumed. • How do we know apriori the value of f?

Google File System • Revisited traditional file system design 1. Component failures are a norm 2. Multi-GB Files are common 3. Files mutated by appending new data 4. Relaxed consistency model

GFS Architecture Leader Election/ Replication Maintains metadata, namespace, chunk metadata etc

GFS – Relaxed Consistency

GFS – Design Issues Rational: Keep things simple Single Master Problems: Increasing volume of underlying storage -> Increase in metadata Clients not as fast as master server -> Master server became bottleneck Current: Multiple Masters per data center Ref: http://queue.acm.org/detail.cfm?id=1594206

GFS Design Isuues • Replication of chunks • Replication across racks – default number is 3 • Allowing concurrent changes to the same file.-> In retrospect, they would rather have a single writer • Primary replica serializes mutation to chunks-They do not use any of the consensus protocols before applying mutations to the chunks. Ref: http://queue.acm.org/detail.cfm?id=1594206

Thank You

Network algorithms