1 / 16

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale. Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign. Motivation. As machines grow in size MTBF decreases

Download Presentation

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign

  2. Motivation • As machines grow in size • MTBF decreases • Jaguar had 2.33 average failures/day from 2008 to 2010 • Applications have to tolerate faults • Challenges for exascale: • Disk-based (NFS reliable disk) checkpointing is slow • System-level checkpointing can be expensive • Scalable checkpointing/restart can be a communication intensive process • Job scheduler prevent fault tolerance support in runtime Charm++ Workshop 2012

  3. Motivation (cont.) • Applications on future exascale machines need fast, low cost and scalable fault tolerance support • Previous work: • double in-memory checkpoint/restart scheme • In production version of Charm++ since 2004 Charm++ Workshop 2012

  4. Double in-memory Checkpoint/Restart Protocol PE3 PE0 PE1 PE2 I H J A G B D E F C H I J F G D E B C A A I H B C J G D F E PE1 crashed ( lost 1 processor ) PE0 PE2 PE3 I B C H J A G D E F D H J G A B C F E I A C E J H I F G B D checkpoint 1 checkpoint 2 object restored object A A A A Charm++ Workshop 2012

  5. Runtime Support for FT • Automatically checkpointing threads • Including stack and heap (isomalloc) • User helper functions • To pack and unpack data • Checkpointing only the live variables Charm++ Workshop 2012

  6. Local Disk-Based Protocol • Double in-memory checkpointing • Memory concern • Pick checkpointing time where global state is small • MD, N-body, quantum chemistry • Double In-disk checkpointing • Make use of local disk (or SSD) • Also does not rely on any reliable storage • Useful for applications with very big memory footprint Charm++ Workshop 2012

  7. Previous Results: Performance Comparisons with Traditional Disk-based Checkpointing Charm++ Workshop 2012

  8. Previous Results: Restart with Load Balancing LeanMD, Apoa1, 128 processors Charm++ Workshop 2012

  9. Previous Result: Recovery Performance • 10 crashes • 128 processors • Checkpoint every 10 time steps Charm++ Workshop 2012

  10. FT on MPI-based Charm++ • Practical challenge: job scheduler • Job scheduler kills the entire job when a process fails • MPI-based Charm++ is portable on major supercomputers • A fault injection scheme in MPI machine layer • DieNow() • MPI process stop responding • Fault detection by keep-alive messages • Spare processors to replace failed ones • Demonstrated on 64K cores of BG/P machine Charm++ Workshop 2012

  11. Performance at Large Scale Charm++ Workshop 2012

  12. Optimization for scalability • Communication bottlenecks • Checkpoint/restart time takes O(P) time • Optimizations: • Collectives (barriers) • Switch O(P) barrier to a tree-based barrier • Stale message handling • Epoch number • A phase to discard stale messages as quickly as possible • Small messages • Streaming optimization Charm++ Workshop 2012

  13. LeanMD Checkpoint Time before/after Optimization Charm++ Workshop 2012

  14. Checkpoint Time for Jacobi/AMPI Kraken Charm++ Workshop 2012

  15. LeanMD Restart Time Charm++ Workshop 2012

  16. Conclusions and Future work • In-memory checkpointing after optimization is scalable towards Exascale • A short paper is accepted at the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012) • Future work: • Non-blocking checkpointing Charm++ Workshop 2012

More Related