1 / 66

Error Confinement: New Measures for Fault Tolerance and Core Bootstrapping

This talk discusses error confinement in distributed systems, introducing new measures for fault tolerance and the core bootstrapping idea for algorithms. It addresses optimization questions and provides answers for core construction.

snegrete
Download Presentation

Error Confinement: New Measures for Fault Tolerance and Core Bootstrapping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Error-Confinement Shay Kutten (Technion) with Boaz Patt-Shamir (Tel Aviv U.) Yossi Azar (Tel Aviv U.)

  2. Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance (3) The new “core- bootstrapping” idea for algorithm. (4) Optimization question and answer for “core” construction.

  3. C S A B Motivation: “error propagation” (example) (1) Assume no fault: My distance to C via S: 7+4=11 Message from S to A: distance 7toC 7 4 C Traffic to C Internet routing: Node A compute shortest path to C based on messages from S.

  4. C distance 0toC Motivation: “error propagation” (example) (2) with fault (at B): My distance to C via S: 7+4=11 Message from S to A: C distance 7toC 7 S A 2 4 B Traffic to C State corrupting fault (adversary modifies data memory)

  5. distance 0toC State corrupting fault (self stabilization): Not malicious! Just a one time change of memory content. C 7 S A 2 4 B State corrupting fault (adversary modifies data memory)

  6. C Motivation: “error propagation” (example) (2) With fault (at B): My distance to C via S: 7+4=11 Message from S to A: C distance 7toC 7 S A 4 2 B Traffic to C distance 0toC fault

  7. C Motivation: “error propagation” (example) (3) B’s fault propagated to A My distance to C via B: 2+0=2 Message from S to A: C distance 7toC 7 S A 4 2 B Traffic to C distance 0toC fault

  8. C C Motivation: “error propagation” (example) B’s fault propagated to A My distance to C via B: 2+0=2 Message from S to A: C distance 7toC 7 S A 4 2 B (4) Traffic to C is sent the wrong way as a result of the fault propagation distance 0toC fault

  9. This is, actually, how the Internet (than Called “ARPANET”) in 1980 S C crashed D C S D A C B D I have distance 0to everybody fault

  10. C I do not believe you! “Error confinement”: non faulty node A outputs only correctoutput(or nooutputat all) Sounds impossible? S A Output (to routing:) My distance to C via S: 7+4=11 B distance 0toC fault

  11. (“stabilization” deals also with faulty nodes) • (behavior- ignoring time) Error Confinement (Formally) • : problem specification, P: protocol. • P solves  with error confinement if for any execution of P with behavior  (possibly containing a state corrupting fault), there exists a behavior ’ & for all non-faulty nodes v: ’v= v

  12. Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance. (3) The core- bootstrapping idea idea for algorithm (4) Optimization question and answer for “core” construction.

  13. Introducing a new measure of fault resilience: The resilience of a protocol is smaller at first t Environment (e.g. user) 2 time t 1 Input is given to S at time t 0 C t S A 0 B D

  14. time The resilience of a protocol is smaller at first (cont.) Environment (e.g. user) gives Input is to S at time t t 2 0 If adversary changes the state of S at timetf shortly after the input t 1 t f C t S A 0 B D

  15. t 2 t 1 The resilience of a protocol is smaller at first (cont.) time Environment (e.g. user) gives Input to S at time t 0 If adversary changes the state of S at time tf shortly after the input then the input is lost forever t f C t S A 0 B D

  16. t 2 t 1 t f The resilience of a protocol grows with time time However, a fault, even in S, can be tolerated if it waits until after S distributed the input value C S A B D C t S A 0 B D input

  17. t 2 t 1 t t f f The resilience of a protocol grows with time (cont.) time However, a fault, even in S, can be tolerated if it waits until after S distributed the input value distribution C S A B D C t S A 0 B D input

  18. t 2 t 1 tf tf The resilience of a protocol grows with time time A fault even in S can be tolerated if it “waits” until after S distributed the input value distribution C S A B D C t S A 0 B D input

  19. t 2 t 1 t t f f The resilience of a protocol grows with time time A fault even in S can be tolerated if it “waits” until after S distributed the input value distribution C A S B D C t S A 0 B D input

  20. t t f f The resilience of a protocol grows with time time To destroy the replicated value the adversary needs to hit more nodes at > > t0 t1 tf t0 C S t1 A B D C t0 S A B D input

  21. t t 3 3 If no faults occurred by some later , then the input is replicated even further The resilience continues to grows with time time C S S A B D C t S A 2 B D C t S A 1 B D

  22. tf The resilience continues to grows with time time C S A t B D 3 C t S A 2 B D The later the faults, the more faults can be tolerated C t S A 1 B D

  23. Time Space Cone time C S S A t B D 3 C t S A 2 B D The later the faults, the more faults can be tolerated if the protocol is designed to be robust C t S A 1 B D

  24. “Narrow” cone a LESS fault tolerant algorithm time C S S A t B D 3 C t S A 2 B D Slower replication less nodes offer help C t S A 1 B D

  25. A “Wider” cone a more fault tolerant algorithm time C S S A t B D 3 C t S A 2 B D Replication to more nodes faster C t S A 1 B D

  26. So, a recovery of corrupted values is theoretically possible, for an adversary that is constrained according to a space-time-cone, but what is the algorithm that does the recovery? time S

  27. Constraining faults: Agility • c-constrained environment: environment generating faultstf time units after the input, (c 1), only in: • with agilityc: Broadcast algorithm that guarantees error confinement against c-constrained environments. minority of· |Balls(c·tf)| nodes. algorithm V V c·tf S Balls

  28. Algorithm’s “agility” measures the rate the constraint on the adversary can be lifted C S S D C S time D Agility: S

  29. Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance. (3) The new “core- bootstrapping” idea for algorithm. (4) Optimization question and answer for “core” construction.

  30. The message resides at some nodes we term “core”

  31. A node can join the core when it “made sure” it heard the votes of all core nodes

  32. A node can join the core when it “made sure” it heard the votes of all core nodes

  33. A node can join the core when it “made sure” it heard the votes of all core nodes

  34. A node can join the core when it “made sure” it heard the votes of all core nodes

  35. and even the fault can be corrected

  36. and even the fault can be corrected

  37. Let us view again the join of one node

  38. If core is such that adversary’s constraint allows hit of only a minority of the core… Then the message passes to the new node correctly

  39. If core is such that adversary’s constraint allows hit of only a minority of the core… Disclaimer: any connection to Actual historical rivalry is coincidental Then the message passes to the new node correctly

  40. If core is such that adversary’s constraint allows hit of only a minority of the core… Disclaimer: any connection to Actual historical rivalry is coincidental Then the message passes to the new node correctly

  41. C I do not believe you! “Error confinement”: non faulty node A outputs Only correctoutput(or nooutputat all) S A D Output (to routing:) My distance to C via S: 7+4=11 B distance 0toC fault

  42. and even the fault can be corrected

  43. When the core grow, the algorithm can withstand more faults.

More Related