1 / 24

Checkpointing and Recovery

Checkpointing and Recovery. Purpose. Consider a long running application Regularly checkpoint the application Expensive task In case of failure, restore to the previous checkpoint What happens in case of a distributed application One (or more) processes fail

harry
Download Presentation

Checkpointing and Recovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Checkpointing and Recovery

  2. Purpose • Consider a long running application • Regularly checkpoint the application • Expensive task • In case of failure, restore to the previous checkpoint • What happens in case of a distributed application • One (or more) processes fail • Restoration to previous checkpoint should be done consistently

  3. Examples

  4. What to Save? • Depends on application • Could be as simple as just program counter information • Could be the state of the entire process, including messages received, etc

  5. Stable Storage • Checkpoints must survive failure of processes (including failure during a disk write) • A simple approach for stable storage

  6. Approaches • Asynchronous • The local checkpoints at different processes are taken independently • Synchronous • The local checkpoints at different processes are coordinated • They may not be at the same time

  7. Asynchronous Checkpointing • Problem • Domino effect Failed process

  8. Other Issues with Asynchronous Checkpointing • Useless checkpoints • Need for garbage collection • Recovery requires significant coordination

  9. Asynchronous Checkpointing (Continued) • Identify dependency between different checkpoint intervals • This information is stored along with checkpoints in a stable storage • When a process repairs, it requests this information from others to determine the need for rollback

  10. Two Examples of Asynchronous Checkpointing • Bhargava and Lian • Wang et al

  11. Algorithm by Bhargava et al • Draw an edge from ci, x to cj,y if either • i = j and y = x+1 • i  j and a message m is sent from Ii, x and received in Ij, y • Where Ii, x is the interval between ci, x-1 and ci, x • Rollback recovery line used for recovery as well as garbage collection

  12. Algorithm by Wang et al • Difference • If a message sent from Ii, x is received in Ij, y then draw an edge between cj, x-1 to cj, y • Recovery line obtained is similar to that by Bhargava and Lian • Advantage • Number of useful checkpoints is at most N(N+1)/2 • This can be shown that the number of checkpoints that are ahead of recovery line

  13. Coordinated Checkpointing • Using diffusing computation • How can we use diffusing computation to obtain a consistent snapshot?

  14. Algorithm by Tamir and Sequin • Blocking checkpoint • A coordinator decides when a checkpoint is taken • Coordinator sends a request message to all • Each process • Stops executing • Flushes the channels • Takes a tentative checkpoint • Replies to coordinator • When all processes send replies, the coordinator asks them to change it to a permanent checkpoint

  15. Algorithm by Tamir and Sequin • How many checkpoints need to be stored per process?

  16. Tamir and Sequin assume fully connected graph? • How would you do it if it was not fully connected? • Use diffusing computation • Each node stops `original computation’ when it prorogates the diffusing computation • Each node takes tentative checkpoint at completion • Channel flushing achieved in between

  17. Checkpointing in Timed Systems • If perfectly synchronized clocks?

  18. Checkpointing in Timed Systems • What if clocks are loosely synchronized? • Max clock drift, , is known? • All processes take a checkpoint at a fixed (local) time • After the checkpoint, a process does not send any messages for 2 • The set of local checkpoints is guaranteed to be consistent

  19. Minimal Checkpoint Coordination • Approach by Koo and Toueg • Require processes to take a checkpoint only if they have to

  20. Logging Protocols • Pessimistic • Optimistic • Causal

  21. Concept of Logging • If restarted process was guaranteed to behave like it would before failure then other processes need not be aborted. • Log non-deterministic events

  22. Definitions • Depend(m) • Processes that depend on m • Stable(m) • m stored on stable storage • Log(m) • Processes that have logged m • C • Set of failed processes

  23. Pessimistic Protocols • Not Stable(m) => |Depend(m)| = 0 • What if • Not Stable(m) => |Depend(m)| <= 1

  24. Causal Protocols • Save m on volatile memory of other processes • Ensure • Depend(m)  Log(m)

More Related