1 / 18

Joint work with Michel Hurfin (IRISA), Carole Delporte-Gallet (LIAFA)

Designing Modular Services in the Scattered Byzantine Failure Model* Emmanuelle Anceaume (IRISA / CNRS ). Joint work with Michel Hurfin (IRISA), Carole Delporte-Gallet (LIAFA) Hugues Fauconnier (LIAFA), Gérard Le Lann (INRIA). *This work has been supported by the French Space Agency.

bianca
Download Presentation

Joint work with Michel Hurfin (IRISA), Carole Delporte-Gallet (LIAFA)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing Modular Services in the Scattered Byzantine Failure Model*Emmanuelle Anceaume (IRISA / CNRS) Joint work with Michel Hurfin (IRISA), Carole Delporte-Gallet (LIAFA) Hugues Fauconnier (LIAFA), Gérard Le Lann (INRIA) *This work has been supported by the French Space Agency

  2. Fault Tolerant Distributed Applications • Fault tolerance is a critical issue • To tolerate failures (from benign to malign ones) physical redundancy is mandatory • Increase the overall reliability of the computing system • Replication techniques and agreement algorithms are needed

  3. Correct / Faulty Processes • Classically, the set of redundant processes are classified into two categories: • Correct processes: behave according to their specification during the whole application • Otherwise processes are faulty • To correctly design fault tolerant applications, • Maximal subset of faulty processes • Once these faulty processes have failed, no more failures can be tolerated

  4. Context of the study Space domain context • Radiation, power supply glitches (ex:bit-flip) may cause transient faults in electronic systems • Running times of the applications are extremely long • Drastic limitations are imposed on the computer system • Most of the failures are recoverable, and are accidental • Physical phenomena can arbitrarily affect the behavior of a processor (altering the executed code, registers, …) • Checking procedures or reconfiguration are available.  Operational state – may not be semantically correct

  5. Outline • Formalize the scattered byzantine failure model • Solving the clock synchronization problem and the timed atomic broadcast problem in this model • Characterization of the post-fault period, i.e., minimal period of time that is needed for a processor to recover is given • For non atomic services, characterization of the fore-fault period, i.e., completion time of the service

  6. The Scattered Byzantine Failure Model • A processor can alternate correct and faulty periods • Good period: a processor behaves according to its specification • Faulty period: a processor behaves arbitrarily • No limitation on the number of faulty processors • Frees the application designer from the recurrent question • “What happens if the quorum of processors that were supposed to fail is exceeded ?” • Extension of the classical byzantine failure model

  7. Model of the System (1) • Computational model: • Finite set of processes {p1, …,pn} modeled as automata • Synchronous • Duration of computation steps are bounded • Local hardware clock with a bounded drift rate wrt real time (1+ )-1(t2-t1) ≤ Ri(t2)- Ri(t1) ≤ (1+ )(t2-t1) • Transmission delays are upper bounded () • Communication links are reliable • The communication network does not lose, falsify, duplicate messages

  8. Model of the System (2) • Scattered byzantine failure model: • At any time, all processes can alternate correct and faulty periods • At any time, at most t processes are in a faulty period faulty correct correct p1 p1 p2 p2 correct f faulty pn pn correct faulty correct faulty

  9. faulty correct post-fault Atomic broad. service bad Level k Layered services faulty correct bad post-fault Clock sync. service Level k-1 Scattered Byzantine Failure Model Faulty periods: Bad period: byzantine failures • End of a bad period when an operational state is reached Post-fault period: from operational state to safe state • Consistent with correct processes state • Purge of logs, validity of critical variables • Maximal duration Dspis computable

  10. faulty correct Atomic broad. service correct Level k good fore-fault Layered services correct correct good Clock sync. service Level k-1 Scattered Byzantine Failure Model Correct period: To exactly identify completed activitiesfrom uncompleted one: • good period • fore-fault period: reflects wcet of a long lasting service s (Dfs) • Ensures the completion of a service • Maximal duration Dsfis computable

  11. Clock Synchronization Service (1) • Enables to overcome the effects of drifts and failures • Guarantees that • The maximal deviation between all logical clocks is bounded Agreement property : there is a constant Dmax such that: | Ci() - Cj() | ≤ Dmax • Logical clocks are within a linear envelope of real-time Accuracy property: there exists a constant  such that: /(1+ ) + a ≤ Ci() ≤ (1+ ) + b A process is in a bad period if it deviates from its algorithm or if the rate of drift  of its physical clock is not bounded

  12. Clock Synchronization Service (2) • Principles of the algorithm of Srikanth and Toueg [ST87] • Classical failure model At process i if C() = kP send (Sync-init,k) to all the processes upon receipt of (Sync-init,k) from t+1 processes relay (Sync-echo,k) to all processes if (2t+1) (Sync-echo,k,j) have been received accept (Synchro,k) if (accept(Synchro,k)) then C()=kP+ ≥ ((1+)Dmax+ 2)(1+ ) Dmax ≥(P(1+ )+2)dr+ 2 (1+) P> 2 (1+)+Dmax

  13. Clock Synchronization Service (3) • Extension of this algorithm to ensure that: • Local structures of the processes in correct periods are never corrupted by the recovering processes • Faulty processes recover by synchronizing their local clocks within a bounded delay (I.e., Dp is bounded)

  14. Validity test 1 Validity test 2 Clock Synchronization Service (4) if C(t) = kP broadcast (Sync,k,i) to all the other processes if (Sync,m,j) is received at time T=C(t) from l if (l=j) and (-Dmax)(1+)≤T-mP(1+ )≤(+Dmax)(1+ ) then relays this message to all the processes otherwise discards it else if (lj) then add (sync,m,j,l) to Buff-rec if (l’: l’l s.t. (sync,m,j,l’) to Buff-rec) then add (sync,m,j) to Buff-accepted if (j’: j’j s.t. (sync,m,j’) to Buff-accepted) and (km+1) then C(t):= mP+ Buff-rec = Buff-accepted := Ø k:=k+1 Clean local structures

  15. Clock Synchronization Service (5) • Proposition: Suppose that process p recovers an operational state at time t (I.e., enters a post-fault period at time t), then by time t+2((P-)(1+ )+2), p is resynchronized with all the processes in correct periods. • post-fault period duration = 2((P-)(1+ )+2) time units • fore-fault period duration = 0 time units • Similarly to [ST87], achieve optimal accuracy

  16. -Atomic Broadcast • Powerful communication paradigm • Agreement on the set of received messages and their order • 2 primitives: broadcast and deliver • Revisited properties • Validity: if process pi broadcasts (m,i) at time t during its good period, then every process in a good period at time t delivers (m,i) exactly once during the corresponding correct period • Agreement: if process pj delivers (m,i) at time t during its good period, then every process in a good period at time t delivers (m,i) • -timeliness: if process pi delivers (m,j) at time t during its good period, then pj broadcast (m,j) between time t- and t • Total order: if two processes pi and pj deliver two messages (m,k1) and (m,k2) during a correct period then both messages are delivered in the same order by pi and pj

  17. faulty correct Atomic broad. service correct Level k good fore-fault Layered services correct correct good Clock sync. service Level k-1 -Atomic Broadcast (1+ ) =2(1+ )+2+3Dmax) 2((P-)(1+ )+2)

  18. Conclusion and future work • Byzantine recovery problem: • Formalization of the scattered byzantine failure model • Two fundamental agreement problems: • Clock synchronization problem • Timed bounded atomic broadcast problem • Revisited their specifications • Designed simple and efficient solutions, and computed Df and Dp • Designing independent services • Asynchronous model: self-stabilizations techniques ?

More Related