Distributed Error- Confinement

Distributed Error-Confinement Shay Kutten (Technion) with Yossi Azar (Tel Aviv U.) Boaz Patt-Shamir (Tel Aviv U.)

Talk Overview (1) (Confinement in) the context of self stabilization (2) What is error confinement? (3) The new “agility” measure for fault tolerance (4) The new “core- bootstrapping” idea for algorithm. (5) Optimization question and answer for “core” construction. (6) Additional results: practical considerations, building blocks, lower bound (7) Generalizations, open problems

“Self Stabilization” versus Error Confinement - Error confinement can be studied in the context of any faults model - We study error confinement in the context of “self stabilization” (explained below) since if we manage to handle a sever kind of faults, handling other faults may be easier.

text text Common model for distributed algorithms X=3 A B E C D state A, B, … Unique node Ids. -No shared memory. -Time complexity: sending a message over a link= time unit (at most, for asynchronous net.) name Node Link weight Message

“Self Stabilization” - Node’s state:value of all its variable - Global state: states of all nodes - Legal states:set of global states (those desired by the algorithm designers) - Stabilization: legality: starting fromany global state, eventually the state is legal closure: starting from a legal global state, no illegal state is reached (except by faults)

“Self Stabilization” - Node’s state:value of all its variable - Global state: states of all nodes - Legal states:set of global states (those desired by the algorithm designers) - Stabilization: legality: starting fromany global state, eventually the state is legal closure: starting from a legal global state, no illegal state is reached (except by faults) A “fault” means starting in an illegal state. Only the state may be faulty, not the program!

A B E C D Self Stabilization example: Token passing Legality: - Exactly ONE node has the token token

A B E C D Self Stabilization example: Token passing Legality: - Exactly ONE node has the token - The token circulates by messages token

A B E C D Self Stabilization example: Token passing Legality: - Exactly ONE node has the token - The token circulates token

A B E C D token Self Stabilization example: Token passing Legality: - Exactly ONE node has the token - The token circulates

A B E C D Self Stabilization example: Token passing Legality: - Exactly ONE node has the token - The token circulates token

A B E C D Self Stabilization Problem Example: Token passing Legality: - Exactly ONE node has the token - The token circulates token token fault A fault brings the system to an illegal global state

Talk Overview (1) Confinement in the context of self stabilization (2) What is error confinement? (3) The new “agility” measure for fault tolerance. (4) The new “core- bootstrapping” idea for algorithm. (5) Optimization question and answer for “core” construction. (6) Additional results: practical considerations, building blocks, lower bound (7) Generalizations, open problems

C S A B Motivation: “error propagation” (example) (1) Assume no fault: My distance to C via S: 7+4=11 Message from S to A: distance 7toC 7 4 C Traffic to C Internet routing: Node A compute shortest path to C based on messages from S.

C distance 0toC Motivation: “error propagation” (example) (2) with fault (at B): My distance to C via S: 7+4=11 Message from S to A: C distance 7toC 7 S A 2 4 B Traffic to C State corrupting fault (adversary modifies data memory)

distance 0toC Recall: state corrupting fault (self stabilization): Not malicious! Just a one time change of memory content. C 7 S A 2 4 B State corrupting fault (adversary modifies data memory)

C Motivation: “error propagation” (example) (2) With fault (at B): My distance to C via S: 7+4=11 Message from S to A: C distance 7toC 7 S A 4 2 B Traffic to C distance 0toC fault

C Motivation: “error propagation” (example) (3) B’s fault propagated to A My distance to C via B: 2+0=2 Message from S to A: C distance 7toC 7 S A 4 2 B Traffic to C distance 0toC fault

C C Motivation: “error propagation” (example) B’s fault propagated to A My distance to C via B: 2+0=2 Message from S to A: C distance 7toC 7 S A 4 2 B (4) Traffic to C is sent the wrong way as a result of the fault propagation distance 0toC fault

This is, actually, how the Internet (than Called “ARPANET”) in 1980 S C crashed D C S D A C B D I have distance 0to everybody fault

C I do not believe you! “Error confinement”: non faulty node A outputs only correctoutput(or nooutputat all) Sounds impossible? S A Output (to routing:) My distance to C via S: 7+4=11 B distance 0toC fault

Error Confinement (Formally) • : problem specification, P: protocol. • P solves  with error confinement if for any execution of P with behavior  (possibly containing a state corrupting fault), there exists a behavior ’ & for all non-faulty nodes v: ’v= v

Error Confinement (Formally) • : problem specification, P: protocol. • P solves  with error confinement if for any execution of P with behavior  (possibly containing a state corrupting fault), there exists a behavior ’ & for all non-faulty nodes v: ’v= v (behavior- ignoring time)

Error Confinement (Formally) • : problem specification, P: protocol. • P solves  with error confinement if for any execution of P with behavior  (possibly containing a state corrupting fault), there exists a behavior ’ & for all non-faulty nodes v: ’v= v • (“stabilization” deals also with faultynodes)

Talk Overview (1) Confinement in the context of self stabilization (2) What is error confinement? (3)The new “agility” measure for fault tolerance. (4) The new “core- bootstrapping” idea for algorithm. (5) Optimization question and answer for “core” construction. (6) Additional results: practical considerations, building blocks, lower bound (7) Generalizations, open problems

t 0 t 0 Introducing a new measure of fault resilience: The resilience of a protocol is smaller at first t Environment (e.g. user) 2 time t 1 Input is given to S at time C S A B D

time The resilience of a protocol is smaller at first (cont.) Environment (e.g. user) gives input to S at time t t 2 0 If adversary changes the state of S at timetf shortly after the input t 1 t f C t S A 0 B D

t 2 t 1 The resilience of a protocol is smaller at first (cont.) time Environment (e.g. user) gives Input to S at time t 0 If adversary changes the state of S at time tf shortly after the input then the input is lost forever t f C t S A 0 B D

t 2 t 1 t f The resilience of a protocol grows with time time However, a fault, even in S, can be tolerated if it waits until after S distributed the input value C S A B D C t S A 0 B D input

t 2 t 1 t t f f The resilience of a protocol grows with time (cont.) time However, a fault, even in S, can be tolerated if it waits until after S distributed the input value distribution C S A B D C t S A 0 B D input

t 2 t 1 tf tf The resilience of a protocol grows with time time A fault even in S can be tolerated if it “waits” until after S distributed the input value distribution C S A B D C t S A 0 B D input

t 2 t 1 t t f f The resilience of a protocol grows with time time A fault even in S can be tolerated if it “waits” until after S distributed the input value distribution C A S B D C t S A 0 B D input

t t f f The resilience of a protocol grows with time time To destroy the replicated value the adversary needs to hit more nodes at > > t0 t1 tf t0 C S t1 A B D C t0 S A B D input

t t 3 3 If no faults occurred by some later , then the input is replicated even further The resilience continues to grows with time time C S S A B D C t S A 2 B D C t S A 1 B D

tf The resilience continues to grows with time time C S A t B D 3 C t S A 2 B D The later the faults, the more faults can be tolerated C t S A 1 B D

Time Space Cone time C S S A t B D 3 C t S A 2 B D The later the faults, the more faults can be tolerated if the protocol is designed to be robust C t S A 1 B D

“Narrow” cone a LESS fault tolerant algorithm time C S S A t B D 3 C t S A 2 B D Slower replication less nodes offer help C t S A 1 B D

A “Wider” cone a more fault tolerant algorithm time C S S A t B D 3 C t S A 2 B D Replication to more nodes faster C t S A 1 B D

So, a recovery of corrupted values is theoretically possible, for an adversary that is constrained according to a space-time-cone, but what is the algorithm that does the recovery? time S

Constraining faults: Agility • c-constrained environment: environment generating faultstf time units after the input, (c 1), only in: • with agilityc: Broadcast algorithm that guarantees error confinement against c-constrained environments. minority of· |Balls(c·tf)| nodes. algorithm V V c·tf S Balls

Algorithm’s “agility” measures the rate the constraint on the adversary can be lifted C S S D C S time D Agility: S

Talk Overview (1) Confinement in the context of self stabilization (2) What is error confinement? (3) The new “agility” measure for fault tolerance. (4) The new “core- bootstrapping” idea for algorithm. (5) Optimization question and answer for “core” construction. (6) Additional results: practical considerations, building blocks, lower bound (7) Generalizations, open problems

The message resides at some nodes we term “core”

A node can join the core when it “made sure” it heard the votes of all core nodes

and even the fault can be corrected

Distributed Error- Confinement