240 likes | 357 Views
This paper presents a comprehensive study on tolerating communication and processor failures in distributed real-time systems. It outlines the modeling of such systems, focusing on the fault model that addresses both communication and processor fault tolerances. By leveraging active and passive replication strategies, the research proposes effective techniques for enhancing system resilience. The conclusion discusses future work, including implementation into the SynDEx tool and simulations to demonstrate the efficacy of the proposed methods. This work aims to reduce overhead in failure recovery.
E N D
POPART Rhones-Alpes Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003
Outline • Introduction • Modeling distributed real-time systems • The Fault model • Related work • Processor fault tolerance • Communication fault tolerance • Conclusion and future work
Introduction High level program Compiler Model of the algorithm Architecture specification Distribution constraints Execution times Real-time constraints Failure specification Distribution and scheduling fault-tolerant heuristic Fault-tolerant distributed static schedule Code generator Fault-tolerant distributed code
Modeling distributed real-time systems • Algorithm Model I1 B O C I2 A « I1 and I2 » are inputs operations (sensors) « O » is output operation (actuator) « A, B and C » are computations operations
Modeling distributed real-time systems • Architecture Model P1 P3 Computation unit B1 B2 memory co-processor … P2 co-processor « P1, P2 and P3 » are processors « B1 and B2 » are communication buses Processor
The Fault Model • Tolerating a fixed number of fail-silent processors. • Tolerating a fixed number of fail-silent bus: complete and partial faults. P1 P3 P1 P3 B1 B1 B2 B2 P2 P2 Partial bus faults Processors faults P1 P3 B1 B2 P2 Complete bus faults
Problem ? • Find a distributed schedule of the algorithm on the architecture which is fault-tolerant toprocessors and communications failures ? P1 I1 B schedule O C B1 B2 I2 A P2 P3
Related Work (1) Time-Triggered Architecture (TTA): active replication of operations and communications. (20 years = 100 masters theses and 25 doctoral) Forward Error Correction (FEC): passive or active replication of operations and active replication of communication.
Related Work (2) Time-Triggered Architecture (TTA): • Processor fault tolerance: k replicas or copies of each operation are actively allocated to separate processors. • Communication fault tolerance: k’ replicas or copies of each communication are actively allocated to separate buses.
Related Work (3) Forward Error Correction (FEC): • Processor fault tolerance: k replicas or copies of each operation are actively or passively allocated to separate processors. • Communication fault tolerance: First, each communication is coded by the FEC code on k’ messages with redundant informations. Next, the k’ messages are actively allocated to separate buses.
Outline • Introduction • Modeling distributed real-time systems • The Fault model • Related work • Processor fault tolerance • Communication fault tolerance • Conclusion and future work
Processor fault tolerance • Use the active software replication of operations; where each operation is replicated on k different processors to tolerate k processors failures.
Communication fault tolerance (1) Use the passive software replication of communication, which need « watchdog timer », Split each data communication on k messages. (data fragmentation)
Communication fault tolerance (2) Use the passive software replication of communication, which need « watchdog timer »,
Communication fault tolerance (3) Split each data communication on k messages. (data fragmentation)
Communication fault tolerance (3) Why data fragmentation of communication ? Distinction between complete and partial communication fault !
Communication fault tolerance (4) Why data fragmentation of communication ? Enable rapid recovery from processors and buses failures
Recovery from failures (1) • Processor fault
Recovery from failures (2) • Partial bus fault
Recovery from failures (3) • Complete bus fault
Conclusion and future work Result • A new method to tolerate both communication and processor failures in distributed real-time systems, which may be reduce the load and the overhead of the recovery from failures. Future work • Implementation of the proposed method into the SynDEx tool. • Simulations.