60 likes | 131 Views
Summary of workshop on scalable fault tolerance addressing challenges in large COTS systems, educating researchers, and fostering collaboration between universities and laboratories.
E N D
Report on 2002 FaultTolerance Workshop Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories
Motivation • Large COTS systems are prone to failures • Lots of parts; complex configurations • Applications stress the systems • Few options for application survival • University resources are untapped • DOE researchers unfamiliar with fault tolerance experts • University researchers unfamiliar with DOE problem domain Goal: Bring laboratory and university researchers together to educate each other and discuss issues associated with scalable fault tolerance.
Basic Info • June 10-11, 2002 in Albuquerque, NM • ~40 attendees • Cornell, Denison, Florida, Houston, Indiana, LANL, LLNL, MSTI, SNL, Tennessee, UT Austin • Interest exceeded capacity • Organized by Patty Hough (SNL), Tom Bressoud (Denison), and Lee Ward (SNL) • Sponsored by the CSRI
Agenda • 11 invited talks + 2 hours focused discussion on: • Application descriptions and needs • System monitoring • MPI fault tolerance • Traditional approaches with a twist • Topics not covered • Checkpoint-free algorithms • Preventative measures • System services • Migration • Redistribution • Validation • Run-time environments
Conclusions • MPI support is needed • Programming model needs to be considered • Balance research with timely delivery of capabilities • New ideas are needed • Leverage hardware • More systematic, integrated approach • There are still outstanding issues • Transparency vs. intrusiveness • Can traditional approaches be made scalable? • Workshop was a great success!
For more information… http://csmr.ca.sandia.gov/projects/ftalgs.html