1 / 27

Fault-containment in Weakly Stabilizing Systems

Fault-containment in Weakly Stabilizing Systems. Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa. Preview.

nikita
Download Presentation

Fault-containment in Weakly Stabilizing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

  2. Preview • Weak stabilization (Gouda 2001) guarantees reachability and closure of the legal configuration. Once “stable”, if there is a minor perturbation, apparentlyno recovery guarantee exists, let alone “efficient recovery”. • We take a weakly stabilizing leader election algorithm, and add fault-containment to it.

  3. Our contributions • An exercise in adding fault-containment to a weakly stabilizingleader election algorithm on a line topology. Processes are anonymous. • Containment time = O(1) from all single failures • Lim m∞ (contamination number) is O(1) (precisely 4), where mis a tuning parameter (Contamination number = max. no. of non-faulty processes that change their states during recovery)

  4. The big picture

  5. Model and Notations Consider n processes in a line topology N(i) = neighbors of process i Variable P(i) = {N(i) U ⊥} (parent of i) Macro C(i) = {q ∈ N(i): P(q) = i} (children of i) Predicate Leader(i) ≡ (P(i)=⊥) Legal configuration: • For exactly one process i: P(i) = ⊥ •  j ≠ i: P(j) = k  P(k) ≠ j C(i) Node i P(i) Leader

  6. Model and Notations C(i) Node i P(i) Leader Shared memory model and central scheduler Weak fairness of the scheduler Guarded action by a process: g  A Computation is a sequence of (global) states and state transitions

  7. Stabilization A stable (or legal) configuration satisfies a predicate LCdefined in terms of the primary variablesp that are observable by the application. However, fault-containment often needs the use of secondary variables (a.k.a auxiliary or state variables) s. Thus, Local state of process i = (pi, si) Global state of the system = (p, s), where p = the set of all pi, and s = the set of all si (p, s)  LC  p  LCp and s  LCs

  8. Definitions Containment time is the maximum time needed to establish LCp from a 1-faulty configuration Containment in space means the primary variables of O(1) processes changing their state during recovery from any 1-faulty configuration Fault-gap is the time to reach LC (both LCp and LCS) from any 1-faulty configuration Fault gap LCs restored LCp restored

  9. Weakly stabilizing leader election We start from the weakly stabilizing leader election algorithm by Devismes, Tixeuil,Yamashita [ICDCS 2007], and then modify it to add fault-containment. Here is the DTY algorithm for an array of processes. DTY algorithm:Program for any process in the array Guarded actions: R1 :: not leader ∧ N(i) = C(i)→ be a leader R2 :: not leader∧N(i) \ {C(i) U P(i)} ≠  → switch parent R3 :: leader∧ N(i) ≠ C(i) → parent := k : k  C(i)

  10. Effect of a single failure With a randomized scheduler, the weakly stabilizing system will recover to a legal configuration with probability 1. However, If a single failure occurs, the recovery time can be as large as n (Using situations similar to Gambler’s ruin). For fault-containment, we need something better. We bias a randomized scheduler to achieve our goal. The technique is borrowed [Dasgupta, Ghosh, Xiao: SSS 2007]. Here we show that the technique is indeed powerful enough to solve a larger class of problems.

  11. Biasing a random scheduler For fault-containment, each process i uses a secondary variable x(i). A node i updates its primary variable P(i). when the following conditions hold: • The guard involving the primary variables is true • The randomized scheduler chooses i • x(i) ≥ x(k), wherek  N(i)

  12. Biasing a random scheduler After the action, x(i) is incremented as x(i) := max q ∈N(i) x(q) + m, m ∈ Z+ (call it update x(i), here m is a tuning parameter). Whenx(i) < x(k) but conditions 1-2 hold, the primary variable P(i) remains unchanged -- only x(i) is incremented by 1 (Let m = 5) * * * * j i k j k i x(j)=8 x(i)=10 x(k)=7 x(j)=8 x(i)=10 x(k)=7 j i k j i k x(j)=8 x(i)=10 x(k)=8 x(j)=8 x(i)=13 x(k)=7 INCREMENT x(i) UPDATE x(i)

  13. The Algorithm Algorithm 1 (containment) : program for process i Guarded actions: R1 :: (P(i) ≠ ⊥) ∧ (N(i) = C(i)) → P(i) := ⊥ R2 :: (P(i) = ⊥) ∧ (∃k ∈ N(i) \ C(i)) → P(i) := k R3a :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) ≠ i or ⊥)∧x(i) ≥ x(k) → P(i) := k; update x(i) R3b :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) ≠ i or⊥) ∧x(i) < x(k) → increment x(i) R4a :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) = ⊥) ∧ x(i) ≥ x(k) → P(i) := k R4b :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) = ⊥) ∧x(i) < x(k) → increment x(i) R5 :: (P(i) = j) ∧ (P(j) = ⊥) ∧ (∃k ∈ N(i) : P (k) ≠ i or⊥} → P(i) := k

  14. Analysis of containment Consider six cases 1. Fault at the leader 2. Fault at distance-1 from the leader 3. Fault at distance-2 from the leader 4. Fault at distance-3 from the leader 5. Fault at distance-4 from the leader 6. Fault at distance-5 or greater from the leader

  15. Case 1: fault at leader node 0 1 2 3 4 5 6 7 * * R1 applied by node 5 0 1 2 3 4 5 6 7 R1 applied by node 4: node 4 is the new leader 0 1 2 3 4 5 6 7 R1 :: (P(i) ≠ ⊥) ∧ (N(i) = C(i)) → P(i) := ⊥

  16. Case 2: fault at distance-1 from the leader node 0 1 2 3 4 5 6 7 R1: node 3 * * * * * 0 1 2 3 4 5 6 7 R2: node 5 0 1 2 3 4 5 6 7 R2 :: (P(i) = ⊥) ∧ (∃k ∈ N(i) \ C(i)) → P(i) := k

  17. Case 5: fault at distance-4 from the leader node 0 1 2 3 4 5 6 7 R4a(2): x(2)>x(1) * * 0 1 2 3 4 5 6 7 R3a(3): x(3)>x(2) * * 0 1 2 3 4 5 6 7 R5 (4) * * 0 1 2 3 4 5 6 7 R2(5) * * stable 0 1 2 3 4 5 6 7 Non-faulty processes up to distance 4 from the faulty node being affected R4a :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) = ⊥) ∧ x(i) ≥ x(k) → P(i) := k

  18. Case 6: fault at distance ≥ 5 from the leader node Current leader 0 1 2 3 4 5 6 7 R4a(2): x(2)>x(1) * * 0 1 2 3 4 5 6 7 R3a(3): x(3)>x(2), x(4) * * 0 1 2 3 4 5 6 7 R3a (3); R5 (2) * * 0 1 2 3 4 5 6 7 R2 (1) * * 0 1 2 3 4 5 6 7 Recovery complete With a high m, it is difficult for 4 to change its parent, but 3 can easily do it

  19. Fault-containment in space Theorem 1. As m∞, the effect of a single failure is restricted within distance-4 from the faulty process i.e., algorithm is spatially fault-containing. Proof idea. Uses the exhaustive case-by-case analysis. The worst case occurs when a node at distance-4 from the leader node fails as shown earlier.

  20. Fault-containment in time Theorem 2. The expected number of steps needed to contain a single fault is independent of n. Hence algorithm containment is fault-containing in time. Proof idea. Case by case analysis. When a node beyond distance-4 from the leader fails, its impact on the time complexity remains unchanged.

  21. Fault-containment in time Case 1: leader fails 0 1 2 3 4 5 6 7 * * Recovery completed in a single move regardless of whether node 3 or 4 executes a move. • Case 2: A node i at distance -1 from the leader fails. • P(i) becomes ⊥:recovery completed in one step • P(i) switches to a new parent: recovery time = 2 +∑∞n=1 n/2n = 4

  22. Fault-containment in time P(i) ⊥ P(i) switches Summary of expected containment times Thus, the expected containment time is O(1)

  23. Another proof of convergence • Theorem 3. The proposed algorithm recovers from all single faults • to a legal configuration in O(1) time. • Proof (Using martingale convergence theorem) • A martingale is a sequence of random variables X1, X2, X3, … s.t. ∀n • E(|Xn|) < ∞, and • E(Xn+1|X1 … Xn) = Xn (for super-martingale use ≤ for =, and • for sub-martingale, use ≥ for =) • We use the following corollary of Martingale convergence theorem: • Corollary. If Xn ≥ 0 is a super-martingale then as n → ∞, Xn converges • to X with probability 1, and E(X) ≤ E(X0).

  24. Proof of convergence (continued) Let Xibe the number of processes with enabled guards in step i. After 0 or 1 failure, X can be 0, 2, or 3 (exhaustive enumeration). When Xi = 0, Xi+1 = 0 (already stable) When Xi = 2, E(Xi+1)= 1/2 x 1 + 1/2 x 2 = 1 ≤ 2 When Xi = 3, E(Xi+1)= 1/3 x 0 + 1/3 x 2 + 1/3 x 4 = 2 ≤ 3 Thus X1, X2, X3, … is a super-martingale. Using the Corollary, as n → ∞, E(Xn) ≤ E(X0). Since X is non-negative by definition, Xn converges to 0 with probability 1, and the system stabilizes.

  25. Proof idea of weak stabilization DTY algorithm Our algorithm R1 R1 R2 R2 Executes the same action (P(i) :=k) as in DTY, but the guards are biased differently R3 R3 R4  R5 Weakly stabilizing Weakly stabilizing

  26. Stabilization from multiple failures Theorem 3. When m → ∞, the expected recovery time from multiple failures is O(1) if the faults occur at distance 9 or more apart. Proof sketch. Since the contamination number is 4, no non-faulty process is influenced by both failures. Fault Fault 4 4

  27. Conclusion With increasing m, the containment in space is tighter, but stabilization from arbitrary initial configurations slows down. 2.LCs = true, so the systems is ready to deal with the next single failure as soon as LCpholds. This reduces the fault-gap and increases system availability. The unbounded secondary variable x can be bounded using the technique discussed in [Dasgupta, Ghosh, Xiao SSS 2007] paper. It is possible to extend this algorithm to a tree topology (but we did not do it here)

More Related