Fault-containment in Weakly Stabilizing Systems: Algorithm Modification Study

Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Preview • Weak stabilization (Gouda 2001) guarantees reachability and closure of the legal configuration. Once “stable”, if there is a minor perturbation, apparentlyno recovery guarantee exists, let alone “efficient recovery”. • We take a weakly stabilizing leader election algorithm, and add fault-containment to it.

Our contributions • An exercise in adding fault-containment to a weakly stabilizingleader election algorithm on a line topology. Processes are anonymous. • Containment time = O(1) from all single failures • Lim m∞ (contamination number) is O(1) (precisely 4), where mis a tuning parameter (Contamination number = max. no. of non-faulty processes that change their states during recovery)

The big picture

Model and Notations Consider n processes in a line topology N(i) = neighbors of process i Variable P(i) = {N(i) U ⊥} (parent of i) Macro C(i) = {q ∈ N(i): P(q) = i} (children of i) Predicate Leader(i) ≡ (P(i)=⊥) Legal configuration: • For exactly one process i: P(i) = ⊥ •  j ≠ i: P(j) = k  P(k) ≠ j C(i) Node i P(i) Leader

Model and Notations C(i) Node i P(i) Leader Shared memory model and central scheduler Weak fairness of the scheduler Guarded action by a process: g  A Computation is a sequence of (global) states and state transitions

Stabilization A stable (or legal) configuration satisfies a predicate LCdefined in terms of the primary variablesp that are observable by the application. However, fault-containment often needs the use of secondary variables (a.k.a auxiliary or state variables) s. Thus, Local state of process i = (pi, si) Global state of the system = (p, s), where p = the set of all pi, and s = the set of all si (p, s)  LC  p  LCp and s  LCs

Definitions Containment time is the maximum time needed to establish LCp from a 1-faulty configuration Containment in space means the primary variables of O(1) processes changing their state during recovery from any 1-faulty configuration Fault-gap is the time to reach LC (both LCp and LCS) from any 1-faulty configuration Fault gap LCs restored LCp restored

Weakly stabilizing leader election We start from the weakly stabilizing leader election algorithm by Devismes, Tixeuil,Yamashita [ICDCS 2007], and then modify it to add fault-containment. Here is the DTY algorithm for an array of processes. DTY algorithm:Program for any process in the array Guarded actions: R1 :: not leader ∧ N(i) = C(i)→ be a leader R2 :: not leader∧N(i) \ {C(i) U P(i)} ≠  → switch parent R3 :: leader∧ N(i) ≠ C(i) → parent := k : k  C(i)

Effect of a single failure With a randomized scheduler, the weakly stabilizing system will recover to a legal configuration with probability 1. However, If a single failure occurs, the recovery time can be as large as n (Using situations similar to Gambler’s ruin). For fault-containment, we need something better. We bias a randomized scheduler to achieve our goal. The technique is borrowed [Dasgupta, Ghosh, Xiao: SSS 2007]. Here we show that the technique is indeed powerful enough to solve a larger class of problems.

Biasing a random scheduler For fault-containment, each process i uses a secondary variable x(i). A node i updates its primary variable P(i). when the following conditions hold: • The guard involving the primary variables is true • The randomized scheduler chooses i • x(i) ≥ x(k), wherek  N(i)

Biasing a random scheduler After the action, x(i) is incremented as x(i) := max q ∈N(i) x(q) + m, m ∈ Z+ (call it update x(i), here m is a tuning parameter). Whenx(i) < x(k) but conditions 1-2 hold, the primary variable P(i) remains unchanged -- only x(i) is incremented by 1 (Let m = 5) * * * * j i k j k i x(j)=8 x(i)=10 x(k)=7 x(j)=8 x(i)=10 x(k)=7 j i k j i k x(j)=8 x(i)=10 x(k)=8 x(j)=8 x(i)=13 x(k)=7 INCREMENT x(i) UPDATE x(i)

The Algorithm Algorithm 1 (containment) : program for process i Guarded actions: R1 :: (P(i) ≠ ⊥) ∧ (N(i) = C(i)) → P(i) := ⊥ R2 :: (P(i) = ⊥) ∧ (∃k ∈ N(i) \ C(i)) → P(i) := k R3a :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) ≠ i or ⊥)∧x(i) ≥ x(k) → P(i) := k; update x(i) R3b :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) ≠ i or⊥) ∧x(i) < x(k) → increment x(i) R4a :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) = ⊥) ∧ x(i) ≥ x(k) → P(i) := k R4b :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) = ⊥) ∧x(i) < x(k) → increment x(i) R5 :: (P(i) = j) ∧ (P(j) = ⊥) ∧ (∃k ∈ N(i) : P (k) ≠ i or⊥} → P(i) := k

Analysis of containment Consider six cases 1. Fault at the leader 2. Fault at distance-1 from the leader 3. Fault at distance-2 from the leader 4. Fault at distance-3 from the leader 5. Fault at distance-4 from the leader 6. Fault at distance-5 or greater from the leader

Case 1: fault at leader node 0 1 2 3 4 5 6 7 * * R1 applied by node 5 0 1 2 3 4 5 6 7 R1 applied by node 4: node 4 is the new leader 0 1 2 3 4 5 6 7 R1 :: (P(i) ≠ ⊥) ∧ (N(i) = C(i)) → P(i) := ⊥

Case 2: fault at distance-1 from the leader node 0 1 2 3 4 5 6 7 R1: node 3 * * * * * 0 1 2 3 4 5 6 7 R2: node 5 0 1 2 3 4 5 6 7 R2 :: (P(i) = ⊥) ∧ (∃k ∈ N(i) \ C(i)) → P(i) := k

Case 5: fault at distance-4 from the leader node 0 1 2 3 4 5 6 7 R4a(2): x(2)>x(1) * * 0 1 2 3 4 5 6 7 R3a(3): x(3)>x(2) * * 0 1 2 3 4 5 6 7 R5 (4) * * 0 1 2 3 4 5 6 7 R2(5) * * stable 0 1 2 3 4 5 6 7 Non-faulty processes up to distance 4 from the faulty node being affected R4a :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) = ⊥) ∧ x(i) ≥ x(k) → P(i) := k

Case 6: fault at distance ≥ 5 from the leader node Current leader 0 1 2 3 4 5 6 7 R4a(2): x(2)>x(1) * * 0 1 2 3 4 5 6 7 R3a(3): x(3)>x(2), x(4) * * 0 1 2 3 4 5 6 7 R3a (3); R5 (2) * * 0 1 2 3 4 5 6 7 R2 (1) * * 0 1 2 3 4 5 6 7 Recovery complete With a high m, it is difficult for 4 to change its parent, but 3 can easily do it

Fault-containment in space Theorem 1. As m∞, the eﬀect of a single failure is restricted within distance-4 from the faulty process i.e., algorithm is spatially fault-containing. Proof idea. Uses the exhaustive case-by-case analysis. The worst case occurs when a node at distance-4 from the leader node fails as shown earlier.

Fault-containment in time Theorem 2. The expected number of steps needed to contain a single fault is independent of n. Hence algorithm containment is fault-containing in time. Proof idea. Case by case analysis. When a node beyond distance-4 from the leader fails, its impact on the time complexity remains unchanged.

Fault-containment in time Case 1: leader fails 0 1 2 3 4 5 6 7 * * Recovery completed in a single move regardless of whether node 3 or 4 executes a move. • Case 2: A node i at distance -1 from the leader fails. • P(i) becomes ⊥:recovery completed in one step • P(i) switches to a new parent: recovery time = 2 +∑∞n=1 n/2n = 4

Fault-containment in time P(i) ⊥ P(i) switches Summary of expected containment times Thus, the expected containment time is O(1)

Another proof of convergence • Theorem 3. The proposed algorithm recovers from all single faults • to a legal configuration in O(1) time. • Proof (Using martingale convergence theorem) • A martingale is a sequence of random variables X1, X2, X3, … s.t. ∀n • E(|Xn|) < ∞, and • E(Xn+1|X1 … Xn) = Xn (for super-martingale use ≤ for =, and • for sub-martingale, use ≥ for =) • We use the following corollary of Martingale convergence theorem: • Corollary. If Xn ≥ 0 is a super-martingale then as n → ∞, Xn converges • to X with probability 1, and E(X) ≤ E(X0).

Proof of convergence (continued) Let Xibe the number of processes with enabled guards in step i. After 0 or 1 failure, X can be 0, 2, or 3 (exhaustive enumeration). When Xi = 0, Xi+1 = 0 (already stable) When Xi = 2, E(Xi+1)= 1/2 x 1 + 1/2 x 2 = 1 ≤ 2 When Xi = 3, E(Xi+1)= 1/3 x 0 + 1/3 x 2 + 1/3 x 4 = 2 ≤ 3 Thus X1, X2, X3, … is a super-martingale. Using the Corollary, as n → ∞, E(Xn) ≤ E(X0). Since X is non-negative by definition, Xn converges to 0 with probability 1, and the system stabilizes.

Proof idea of weak stabilization DTY algorithm Our algorithm R1 R1 R2 R2 Executes the same action (P(i) :=k) as in DTY, but the guards are biased differently R3 R3 R4  R5 Weakly stabilizing Weakly stabilizing

Stabilization from multiple failures Theorem 3. When m → ∞, the expected recovery time from multiple failures is O(1) if the faults occur at distance 9 or more apart. Proof sketch. Since the contamination number is 4, no non-faulty process is influenced by both failures. Fault Fault 4 4

Conclusion With increasing m, the containment in space is tighter, but stabilization from arbitrary initial configurations slows down. 2.LCs = true, so the systems is ready to deal with the next single failure as soon as LCpholds. This reduces the fault-gap and increases system availability. The unbounded secondary variable x can be bounded using the technique discussed in [Dasgupta, Ghosh, Xiao SSS 2007] paper. It is possible to extend this algorithm to a tree topology (but we did not do it here)

Fault-containment in Weakly Stabilizing Systems: Algorithm Modification Study

Fault-containment in Weakly Stabilizing Systems: Algorithm Modification Study

Presentation Transcript

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems

Smart Seal Containment Systems

Fault Tolerance in Distributed Systems

Fault Tolerance in Embedded Systems

Self-stabilizing Distributed Systems

Towards Network Containment in Malware Analysis Systems

Weakly Coupled Stochastic Decision Systems

Fault-tolerance in Component-based Systems

Weakly endochronous systems

Ergodicity in Natural Fault Systems

Toward Self-Stabilizing Operating Systems

Fault Tolerance in Distributed Systems

Consistency of Replicated Data in Weakly Connected Systems

Continuous fault containment and local stabilization in path-vector routing

Ergodicity in Natural Fault Systems

Fault Tolerance in Distributed Systems

Self-stabilizing Distributed Systems

Fault Tolerance in Distributed Systems