1 / 34

Fault Tolerance in Charm++

Fault Tolerance in Charm++. Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana-Champaign. Motivation. As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support

bruce-olson
Download Presentation

Fault Tolerance in Charm++

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana-Champaign

  2. Motivation • As machines grow in size • MTBF decreases • Applications have to tolerate faults • Applications need fast, low cost and scalable fault tolerance support • Fault tolerant runtime for: • Charm++ • Adaptive MPI

  3. Outline • Disk Checkpoint/Restart • FTC-Charm++ • in-memory checkpoint/restart • Proactive Fault Tolerance • FTL-Charm++ • message logging

  4. Disk Checkpoint/Restart

  5. Checkpoint/Restart • Simplest scheme for application fault tolerance • Any long running application saves its state into disk periodically at certain point • coordinated checkpointing strategy (barrier) • State information is saved in a directory of your choosing • Checkpoint of the application data is done by invoking pup routine of all objects • Restore also uses pup, so no additional application code is needed (pup is all you need)

  6. Checkpointing Job • In Charm++, use: • void CkStartCheckpoint(char* dirname,const CkCallback& cb) • Called on one processor; calls resume when checkpoint is complete • In AMPI, use: • MPI_Checkpoint(<dir>); • Collective call; returns when checkpoint is complete

  7. Restart Job from Checkpoint • The charmrun option ++restart <dir> is used to restart • ./charmrun +p4 ./pgm ++restart log • Number of processors need not be the same • Parallel objects are redistributed when needed

  8. FTC-Charm++ In-Memory Checkpoint/Restart

  9. Disk vs. In-memory Scheme • Disk checkpointing suffers • Need user intervention to restart a job • Assume reliable storage - disk • Disk I/O is slow • In-memory checkpoint/restart scheme • Online version of the previous scheme • Low impact on fault-free execution • Provide fast and automatic restart capability • Does not rely on extra processors • Maintain execution efficiency after restart • Does not rely on any fault-free component • Not assume stable storage

  10. Overview • Coordinated checkpointing scheme • Simple, low overhead on fault-free execution • Scientific applications that are iterative • Double checkpointing • Tolerate one failure at a time • In-memory checkpointing • Diskless checkpointing • Efficient for applications with small memory footprint • In case when there is no extra processors • Program continue to run with remaining processors • Load balancing for restart

  11. Checkpoint Protocol • Similar to the previous scheme • coordinated checkpointing strategy • Programmers decide what to checkpoint • void CkStartMemCheckpoint(CkCallback &cb) • Each object pack data and send to two different (buddy) processors

  12. Restart protocol • Initiated by the failure of a physical processor • Every object rolls back to the state preserved in the recent checkpoints • Combine with load balancer to sustain the performance

  13. Checkpoint/Restart Protocol PE3 PE0 PE1 PE2 I H J A G B D E F C H I J F G D E B C A A I H B C J G D F E PE1 crashed ( lost 1 processor ) PE0 PE2 PE3 I B C H J A G D E F D H J G A B C F E I A C E J H I F G B D checkpoint 1 checkpoint 2 object restored object A A A A

  14. Local Disk-Based Protocol • Double in-memory checkpointing • Memory concern • Pick checkpointing time where global state is small • Double In-disk checkpointing • Make use of local disk • Also does not rely on any reliable storage • Useful for applications with very big memory footprint

  15. Compiling FTC-Charm++ • Build charm with “syncft” option • ./build charm++ net-linux syncft –O • Command line switch +ftc_disk for disk/memory checkpointing: • charmrun ./pgm +ftc_disk

  16. Performance Evaluation • IA-32 Linux cluster at NCSA • 512 dual 1Ghz Intel Pentium III processors • 1.5GB RAM each processor • Connected by both Myrinet and 100MBit Ethernet

  17. Performance Comparisons with Traditional Disk-based Checkpointing

  18. Recovery Performance • Molecular Dynamics Simulation application - LeanMD • Apoa1 benchmark (92K atoms) • 128 processors • Crash simulated by killing processes • No backup processors • With load balancing

  19. Performance improve with Load Balancing LeanMD, Apoa1, 128 processors

  20. Recovery Performance • 10 crashes • 128 processors • Checkpoint every 10 time steps

  21. LeanMD with Apoa1 benchmark • 90K atoms • 8498 objects

  22. Proactive Fault Tolerance

  23. Motivation • Run-time reacts to a failure • Proactively migrate from a processor about to fail • Modern hardware supports early fault indication • SMART protocol, Motherboard temperature sensors, Myrinet interface cards • Possible to create mechanism for fault prediction

  24. Requirements • Response time should be as low as possible • No new processes should be required • Collective operations should still work • Efficiency loss should be proportional to computing power loss

  25. System • Application is warned of impending fault via signal • Processor, memory and interconnect should continue to work correctly for sometime after warning • Run-time ensures that application continues to run on the remaining processors even if one processor crashes

  26. Solution Design • Migrate Charm++ objects off warned processor • Point to point message delivery should continue to work • Collective operations should cope with the possible loss of multiple processors • Modify the runtime system's reduction tree to remove the warned processor. • Minimal number of processors should be affected • Runtime system should remain load balanced after a processor has been evacuated

  27. Original utilization Utilization after fault Proactive FT: Current Status • Status • Support for multiple faults ready; currently testing support for simultaneous faults • Faults simulated via signal sent to process • Current version fully integrated to Charm++ and AMPI • Example: sweep3d (MPI code) on NCSA’s tungsten Utilization after LB 27

  28. How to Use • Part of default version of Charm++ • No extra compiler flags required • This code does not get executed until a warning • Any detection system can be plugged in • Can send signal (USR1) to process on compute node • Can call a method (CkDecideEvacPe) to evacuate a processor • Used with any Charm++ and AMPI program • For AMPI needs to be used with -memory isomalloc

  29. FTL-Charm++ Message Logging

  30. Motivation • Checkpointing not fully automatic • Coordinated checkpointing is expensive • Checkpoint/Rollback doesn’t scale • All nodes are rolled back just because 1 crashed • Even nodes independent of the crashed node are restarted 30

  31. Design • Message Logging • Sender side message logging • Asynchronous checkpoints • Each processor has a buddy processor • Stores its checkpoint in the buddy’s memory • Checkpoint on its own (no barrier) 31

  32. Message to Remote Chares Chare P sender <SN, TN, Message> <SN,TN, Receiver> <Sender, SN> Chare Q receiver • If <sender, SN> has been seen earlier TN is marked as received • Otherwise create new TN and store the <sender, SN,TN> 32

  33. Status • Most of Charm++ and AMPI has been ported • Support for migration has not yet been implemented in the fault tolerant protocol • Parallel restart not yet implemented • Not in Charm main branch 33

  34. Thank You! Free source, binaries, manuals, and more information at:http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois

More Related