Comprehensive Report on 2002 Fault Tolerance Workshop for Scalable Systems

Report on 2002 FaultTolerance Workshop Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories

Motivation • Large COTS systems are prone to failures • Lots of parts; complex configurations • Applications stress the systems • Few options for application survival • University resources are untapped • DOE researchers unfamiliar with fault tolerance experts • University researchers unfamiliar with DOE problem domain Goal: Bring laboratory and university researchers together to educate each other and discuss issues associated with scalable fault tolerance.

Basic Info • June 10-11, 2002 in Albuquerque, NM • ~40 attendees • Cornell, Denison, Florida, Houston, Indiana, LANL, LLNL, MSTI, SNL, Tennessee, UT Austin • Interest exceeded capacity • Organized by Patty Hough (SNL), Tom Bressoud (Denison), and Lee Ward (SNL) • Sponsored by the CSRI

Agenda • 11 invited talks + 2 hours focused discussion on: • Application descriptions and needs • System monitoring • MPI fault tolerance • Traditional approaches with a twist • Topics not covered • Checkpoint-free algorithms • Preventative measures • System services • Migration • Redistribution • Validation • Run-time environments

Conclusions • MPI support is needed • Programming model needs to be considered • Balance research with timely delivery of capabilities • New ideas are needed • Leverage hardware • More systematic, integrated approach • There are still outstanding issues • Transparency vs. intrusiveness • Can traditional approaches be made scalable? • Workshop was a great success!

For more information… http://csmr.ca.sandia.gov/projects/ftalgs.html

Comprehensive Report on 2002 Fault Tolerance Workshop for Scalable Systems

Comprehensive Report on 2002 Fault Tolerance Workshop for Scalable Systems

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance