1 / 22

Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection. Principal Investigators: C. Mani Krishna Israel Koren Presented By: Eric Ciocca. Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003. What is ALFTD?.

kishi
Download Presentation

Application Level Fault Tolerance and Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application Level Fault Tolerance and Detection Principal Investigators: C. Mani Krishna Israel Koren Presented By: Eric Ciocca Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003

  2. What is ALFTD? • ApplicationLevelFaultTolerance andDetection • ALFTD complements existing system or algorithm level fault tolerance by leveraging information available only at the application level • Using such application level semantic information significantly reduces the overall cost providing fault tolerance • ALFTD may be used alone or to supplement other fault detection schemes Application Level Fault Tolerance and Detection

  3. ALFTD Overview • Application Level Fault Tolerance and Detection allows for system survival of both data and system (instruction/hardware) faults. • System faults cause a process to eventually cease functioning • Data faults cause a process to continue running with incorrect results • ALFTD is scalable • The level of fault tolerance can be traded off with invested time overhead Application Level Fault Tolerance and Detection

  4. Principles of ALFTD P2 P3 P4 S1 S2 S3 P1 S4 • To provide system fault tolerance, every physical node runs its own work (P,primary) as well as a scaled-down copy of a neighboring node’s work (S,secondary) • If a fault should corrupt a process, the corresponding secondary of that task will still produce output, albeit at a lower (but acceptable) quality Node 1 Node 2 Node 3 Node 4 Application Level Fault Tolerance and Detection

  5. Principles of ALFTD • The secondary processes can be scaled-down by • reducing the resolution of input data • reducing the precision of calculations • heuristically predicting results from previous iterations’ output • In some applications the secondary can be run optionally on an as-needed basis • If the corresponding primary is approaching a deadline miss • If the corresponding primary has been incapacitated • If the corresponding primary has produced faulty data • If faults are infrequent, an optional secondary will incur very little additional overhead Application Level Fault Tolerance and Detection

  6. ALFTD in OTIS • ALFTD was implemented into OTIS (Oribital Thermal Imaging Spectrometer) to test its viability as a fault tolerance and detection scheme • OTIS, part of the REE (Remote Exploration and Experimentation) program group from JPL, is intended to run on orbiting satellites • OTIS processes radiation data of a geographic area from a sensor array [input] and produces temperature and emissivity data [output] Application Level Fault Tolerance and Detection

  7. OTIS Structure OUTPUT M 5 3 2 1 1. MPI Starts MPI 4 S 2. MPI Starts Slave and master processes S 3. Master sends tasks 4. Slave Calculations S 5. Slave Returns Results Application Level Fault Tolerance and Detection

  8. ALFTD in OTIS (cont’d) • ALFTD is suited for remote applications, • As a software-based fault handling mechanism, it requires no extra hardware • The scaled secondaries require less power than full software redundancy • In OTIS, and other applications, ALFTD is passive, only requiring extra runtime in a fault case. Application Level Fault Tolerance and Detection

  9. ALFTD OTIS Structure ? OUTPUT M 5 3 2 1. MPI Starts 1 4 MPI P1 2. MPI Starts master and slaves, primary and secondary processes S2 P2 S3 3. Master sends tasks P3 4. Slave Calculations S1 5. Slave Returns Results Application Level Fault Tolerance and Detection

  10. Secondaries in OTIS • The secondary required for ALFTD is implemented to be functionally similar to the primary • Secondary scaling occurs through resolution reduction • OTIS’ “natural” temperature data input exhibits spatial locality • Points not directly calculated can be approximately estimated using interpolation between calculated points • Secondary processes have been tested at 20%-50% of the primary calculation overhead • While 50% affords better quality, 20% has less overhead Application Level Fault Tolerance and Detection

  11. Example of Secondary Resolution (ALFTD Compensation for 10 rows in a sample dataset) 100% Secondary Resolution 50% Secondary Resolution 33% Secondary Resolution 25% Secondary Resolution Application Level Fault Tolerance and Detection

  12. Fault Detection • Output filters on the primary data determine when secondary validation is required • Output filters are created to check for application-specific trends in data • Aberrations from normal data characteristics can be considered to be the product of potentially faulty processes • OTIS relies on natural temperature characteristics to detect potentially faulty data • Spatial Locality: temperature changes gradually over small areas • Absolute Bounds: temperature should not exceed certain values Application Level Fault Tolerance and Detection

  13. Fault Detection (cont’d) • After the secondary has been run to validate a primary’s results, the “better” data is chosen according to the following logic grid: Secondary Results Application Level Fault Tolerance and Detection

  14. Data Sets • Three data sets were chosen for their interesting characteristics Application Level Fault Tolerance and Detection

  15. Fault Tolerance Results: “Spots” • Fault Tolerance with injected faults in “Spots” Application Level Fault Tolerance and Detection

  16. Fault Tolerance Results: “Spots” (cont’d) Fault-Free Output Faulty Output 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead ALFTD-corrected faulty output Application Level Fault Tolerance and Detection

  17. Fault Tolerance Results: “Blob” • Fault Tolerance with injected faults in “Blob” Application Level Fault Tolerance and Detection

  18. Fault Tolerance Results: “Blob” (cont’d) Fault-Free Output Faulty Output 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead ALFTD-corrected faulty output Application Level Fault Tolerance and Detection

  19. Fault Tolerance Results: “Stripe” Difference Plots – faulty output versus faultless output No ALFTD 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead No Error Max Error Application Level Fault Tolerance and Detection

  20. Fault Tolerance Results: “Stripe”(cont’d) Fault-Free Output Faulty Output 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead ALFTD-corrected faulty output Application Level Fault Tolerance and Detection

  21. Conclusion / Future Work • ALFTD has shown to be a cost-effective alternative to full redundancy • Improvements on the scheme will increase fault coverage and decrease secondary calculation overhead • OTIS has general application characteristics that will make its implementation a springboard to other, similar programs • ALFTD should continue to be effective in any programs that have predictable data characteristics Application Level Fault Tolerance and Detection

  22. Thank You! • For additional information, please contact • Eric Ciocca (eciocca@ecs.umass.edu) • Israel Koren (koren@euler.ecs.umass.edu) • C. Mani Krishna (krishna@ecs.umass.edu) Application Level Fault Tolerance and Detection

More Related