1 / 35

Uniprocessor Checkpointing

Uniprocessor Checkpointing. CS 717 – Fall 2001 9/25/01. The Need to Save State. Many of the FT systems we have discussed need a way to restart processes from previous points in their computation A checkpoint is just a ‘snapshot’ of a process (or system) at a certain point in time

marv
Download Presentation

Uniprocessor Checkpointing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Uniprocessor Checkpointing CS 717 – Fall 2001 9/25/01

  2. The Need to Save State • Many of the FT systems we have discussed need a way to restart processes from previous points in their computation • A checkpoint is just a ‘snapshot’ of a process (or system) at a certain point in time • A checkpointing system provides a way to take these snapshots, and to restart from them

  3. Types of Ckpt Systems • Kernel Level • OS supports ckpt & recovery • Transparent to the application and developer • User Level • Application linked against (user) library • Library functions perform ckpt and recovery • Transparent to application • Limitations (cannot restore PID, PPID, etc.) • Application Level • Applications coded to ckpt themselves, and to restart from a checkpoint

  4. Comparison of Levels • Kernel & User (System) Level • Easy to add checkpointing to existing code • Works with (almost) any programs • General, ‘coarse’, approach • Application Level • Could require complete re-write, or extensive modifications • Specific, ‘fine-grained’ solutions

  5. System Level Checkpointing • Libckpt (1994) • Plank, Beck, Kingsley (UTK), Li (Princeton) • User level library for UNIX

  6. Libckpt • User Level Checkpoint Library • Goals • Transparent • Requires minimal modifications to code and re-re-linking • Low Overhead • Automatic optimizations to reduce ckpt file size • Allow user directed checkpointing

  7. Libckpt Overview • Taking the ‘snapshot’ • Suspend the process • Write process’ memory and registers to a file • Recovery • Reload executable from original file • Reconstruct memory and register state from checkpoint file

  8. Libckpt Operation • Application main() is re-named ckpt_target() • Library main() checks if in restore mode (specified using command line option); otherwise reads checkpoint parameters from file

  9. Libckpt Operation (2) • main() sets a timer to interrupt application every n seconds • On signal • Uses setjmp to record registers, pc, etc. • Writes the stack and heap segments to file • Resumes application code

  10. Libckpt Operation • If application started with =recover as command line option • Application begins, recovering Text segments • Open checkpoint file • Recover heap from file • Recover stack from file • Restores register file (using longjmp)

  11. Virtual Address Space Bottom of Stack Stack SP sbrk(0) Heap &edata Data (Static) &etext Text 0

  12. main() if(recovery) restore stack restore heap pos = top of stack longjmp(pos, 1) // restore regs. else run usual code signal_handler() jmp_buf pos if(setjmp(pos)==0) //saved reg. in known //position on stack write stack write heap else // process recovered return Checkpoint And Recovery Algorithms

  13. main() user_main() fun1() fun2() signal save regs on stack save stack to file save heap to file resume main() restore() restore stack restore heap take jump Illustration

  14. Optimization: Incremental Checkpointing • Observation: between taking two checkpoints, only a portion of the memory has actually been changed • Optimization: save only what has been changed since last ckpt, the rest can be read from previous ckpts

  15. Taking Incremental Ckpts. • After taking a ckpt (and after init.), set protection on all pages to ‘read-only’ • Write to page will cause a protection violation • Libckpt library catches that signal, and sets page protection to ‘read-write’, page is marked as dirty • When writing checkpoint file, only write dirty pages

  16. Drawbacks to Incremental Ckpt • Required to keep multiple copies of the checkpoint file • On recovery, will unnecessarily restore old copies of data

  17. Optimization: Asynchronous Checkpointing • Observation: the process must be suspended while the checkpoint file is written • Optimization: a separate thread could write the checkpoint file while the main thread was allowed to continue

  18. Asynchronous Checkpointing • Make a copy of the process space • 2nd thread takes writes copy to disk • 1st thread continues without halting

  19. Asynchronous Checkpointing(2) • Unix fork() provides the necessary behavior • When about to take ckpt, process forks • OS makes a complete copy of the original process’ space • Clone writes ckpt file, then dies • Original continues computing

  20. Copy-On-Write Checkpointing • Like asynchronous checkpointing, but only copy page if the two versions are about to differ • Some (most?) OS implement fork() in this manner, so benefit is automatic

  21. Checkpoint Compression • Use a standard data compression algorithm to shrink the size of the checkpoint file • Only improves overhead if the speed of compression is faster than the speed of disk writes, and compression is significant • “For uniprocessor checkpointing, this is not the case” • Not implemented in libckpt

  22. User Directed Checkpointing • As described so far, libckpt is (almost) entirely transparent to the programmer • Compare to application level checkpoint requiring extensive code changes • Is there a middle ground? • Libckpt allows programmers to annotate application code with directives that guide the checkpointing

  23. Memory Exclusion • Certain areas of memory can be excluded from the checkpoint • Dead memory – will never be read or written • Clean memory – values have not changed since previous checkpoint • Incremental Ckpt provides clean memory opt. at a coarse level (page size) • Only writing the ‘active’ areas of the stack and heap provides dead memory opt.

  24. User Directed Memory Exclusion • Libckpt provides the app. programer with two functions • exclude_bytes(ptr, length, usage) • Specify an area of memory to exclude from future checkpoints • include_bytes(ptr, length) • Add a previously excluded area of memory to future checkpoints

  25. Clean Memory • If mem is clean • exclude_bytes(mem, …, CKPT_READONLY) • Include mem in next checkpoint, but exclude in all subsequent • Cannot write to mem until after call to include_bytes(mem) • Restore last saved version of mem

  26. Clean Memory: Example for (…) { A = init_A() exclude_bytes(A,…,CKPT_READONLY) do_stuff(A) //assuming A does not change include_bytes(A…) }

  27. Dead Memory • If mem is dead • exclude_bytes(mem, …, CKPT_DEAD) • Do not checkpoint mem • Cannot read mem until after include_bytes(mem) • Will not restore mem

  28. Dead Memory: Example for (…) { A = init_A() do_stuff(A) exclude_bytes(A…DEAD) do_other_stuff() // assumes will not read A include_bytes(A) }

  29. Using Memory Exclusion • There can be a dramatic reduction in the size of the checkpoint file • Must be used very carefully • Inadvertently excluding a live region from a checkpoint could cause erroneous behavior on restart

  30. Synchronous Checkpointing • At different points in the program’s execution the amount of ‘live’ state varies widely • The stack might be much smaller (shallower call graph) • Heap items might have been de-allocated • Regions of memory might be dead or clean

  31. Synchronous Ckpt (2) • If checkpoints are taken at times where there is relatively little live state, the checkpoint file size (and overhead) will be smaller • Allow user to specify where in a program a checkpoint should be taken • Independent of timers (signals)

  32. Sync. Ckpt. Example for (…) { checkpoint_here() A = malloc(…) do_stuff(A) free A }

  33. Synchronous Ckpt (3) • To avoid checkpointing too frequently, mintime parameter specifies the minimal amount of time between two checkpoints • If checkpoint_here() is called less than mintime seconds after the last checkpoints, the call is ignored

  34. Synchronous Ckpt (4) • To ensure that checkpoints are taken frequently enough to be of use, maxtime parameter specifies the maximum time allowed to elapse between two checkpoints • If maxtime passes, an asynchronous checkpoint is taken

  35. main(){ D = malloc f = file while(!done){ D = read(f) perform_calc(D) output_result() } } ckpt_target(){ D = malloc f = file while(!done){ D = read(f) perform_calc(D) output_result() exclude_bytes(D, DEAD) checkpoint_here() include_bytes(D) } } Combining Mem. Exclusion and Sync. Checkpointing

More Related