“Revisiting Fault Diagnosis Agreement in a New Territory”

“Revisiting Fault Diagnosis Agreement in a New Territory” • S. C. Wang and K. Q. Yan • Operating Systems Review, April 2004, p. 41– 61. • An extension of the Byzantine General’s algorithm – and hot off the press

Agreement Problem • In the Byzantine General problem there is a commanding general that issues an “order” and all loyal lieutenant generals must come to the same agreement on the order. • A related subproblem is the consensus problem – each processor, which has its own initial value, has to communicate with all other processors to reach a common value among the healthy processors.

Consensus constraints • All the healthy processors agree on the common value (Consensus) • If there exists a common initial value v_i among ALL the processors, then all the healthy processors must agree on v_i Most protocols for solving Byzantine Agreement or consensus are fault-masking protocols – come to consensus without the fault affecting the outcome.

Fault Diagnosis Agreement (FDA) Goal is to make each healthy processor able to detect and locate the faulty components in the distributed system • ALL the healthy processor identify the common set of faulty components in the process of reaching consensus (Agreement) • No healthy component is falsely detected as faulty by any healthy processor (Fairness)

Paper assumes dual failure mode on the network • Most previous papers assume that the faulty components are processors only and that the network is fault-free. • Here we assume that the processors are fault-free and that the network may have a fault. • Also, most other papers assume that the fault is malicious only. Here we assume dual failure: • Malicious faults (a random value is sent), and • Dormant faults (no value/crash or a stuck-at value is sent). Assume that a healthy process can detect components with dormant faults.

Assumptions • A synchronous distributed system whose processors are reliable during the protocol execution • Some faults, crash, stuck-at, noise or an intruder may interfere with message transmission • N-processor fully connected network, with m malicious faults, d dormant faults, m<=ceiling[(n-d-3)/2]

Dual Fault Detection Consensus (DFDC) Algorithm • Three phases: • Message exchange phase • Decision making phase • Fault detection phase • Message exchange phase and the decision making phase is (similar to) OM(1) in the Byzantine General paper. This results in a matrix of information at each processor, MAT_i, which is used to construct a majority vector, MAJ_i

Fault detection phase • Each processor sends every other processor its MAT_i. The MAT_i is used to find the faults by each healthy processor i: • Take the majority value in each position of the matrix to get FDMAT_i • If no majority exists for the i,jth position, use the negative value of the i,jth position of the MAT_j that was sent

P1=0 P2=0 P3=0 malcious faulty dormant faulty P4=1 P5=1

P1=0 P2=0 P3=0 malcious faulty dormant faulty P4=1 P5=1 MAT_1 MAJ_1

P1=0 P2=0 P3=0 malcious faulty dormant faulty P4=1 P5=1 MAT_2,3 MAJ_2,3

MAT from P1 MAT from P2 MAT from P3 MAT from P4 MAT from P5 FDMAT Fault detection phase with processor P1

“Revisiting Fault Diagnosis Agreement in a New Territory”