checkpoint based recovery from power failures n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Checkpoint Based Recovery from Power Failures PowerPoint Presentation
Download Presentation
Checkpoint Based Recovery from Power Failures

Loading in 2 Seconds...

play fullscreen
1 / 16

Checkpoint Based Recovery from Power Failures - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

Checkpoint Based Recovery from Power Failures. Christopher Sutardja Emil Stefanov. Goals. Consistent checkpoint A consistent snapshot of memory for a specific time in the past. Safe even under power failure The checkpoint is never “in transition” Small storage overhead

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Checkpoint Based Recovery from Power Failures' - nardo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
checkpoint based recovery from power failures

Checkpoint Based Recovery from Power Failures

Christopher Sutardja

Emil Stefanov

goals
Goals
  • Consistent checkpoint
    • A consistent snapshot of memory for a specific time in the past.
  • Safe even under power failure
    • The checkpoint is never “in transition”
  • Small storage overhead
    • Not much more than double the memory.
  • Low performance overhead
    • Should not stall the processor for too long.
  • Scalable
    • Scales well in large core networks such as meshes.
related work
Related Work
  • On the feasibility of incremental checkpointing for scientific computing by J. Sancho et al
    • Speculates about the future role of checkpointing in parallel machines.
    • As the number of processing nodes grows exponentially, failure of any one node becomes much more likely.
    • Error correction codes and other redundancies would introduce too much overhead when used alone.
    • As a result, researching Checkpoint recovery is growing in importance.
related work1
Related Work
  • Modular Checkpointing for Atomicity by L. Ziarek et al.
    • Introduces an abstraction called stabilizers to make checkpointing easier.
    • Targets message-passing machines
      • Makes consistent checkpointing more challenging.
related work2
Related Work
  • SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery by D. Sorin et al.
    • Explores the concept of checkpointing in logical time.
    • Multiple checkpoints.
    • Each dirty cache line has a tag indicating when it was modified relative to a checkpoint.
    • Low execution overhead.
    • Not safe from power failures.
related work3
Related Work
  • ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors by M. Prvulovic et al.
    • Explores different ways of rollback recovery in shared-memory multiprocessor systems. Considers:
      • the scope of the checkpoint
      • memory
      • checkpointing mechanism.
    • Achieves about 6% checkpointing overhead.
    • Not safe from power failures.
    • Not geared towards non-volatile memory: requires fast writes.
related work4
Related Work
  • Efficient Initialization and Crash Recovery for Log-based File Systems over Flash Memory by Chin Wu et al.
    • As Flash Memory becomes cheaper and denser, the uses for Flash increase.
    • Uses flash for recovering file systems.
    • Yet another use of flash for recovery.
    • Use a log-based method to accelerate remounting after system crash by minimizing the amount of information that has to be changed upon reboot.
slide8

Memory Controller

Memory Controller

DRAM

DRAM

Memory Controller

Memory Controller

DRAM

DRAM

Core

L1

L2

slide9

DRAM

Memory Controller

Memory Controller

DRAM

DRAM Checkpointer

DRAM Checkpointer

Memory Controller

Memory Controller

DRAM

DRAM

DRAM Checkpointer

DRAM Checkpointer

Core

Checkpoint A

Checkpoint B

Checkpoint Coordinator

Address Decoder

L1

Cache Checkpoint Controller

Checkpoint A

Buffer

Buffer

Buffer

Buffer

Checkpoint B

Log

Log

Log

Log

L2

Checkpoint A

Cache Checkpoint Controller

Check point

Check point

Check point

Check point

Checkpoint B

checkpointing techniques
Checkpointing Techniques
  • For Caches and Cores:
    • Each cache/core has two flash storages adjacent to it.
      • One is for the previous checkpoint
      • One for the current checkpoint.
    • During a checkpoint, the cache/core internal state is copied to flash storage.
  • For DRAM:
    • The checkpointing system snoops on DRAM.
    • DRAM changes are continuously logged to flash memory.
    • A chain of parallel buffers ensues that DRAM checkpointing almost never causes a stall.
responsibilities of the main components
Responsibilities of the Main Components
  • Checkpoint Coordinator
    • Notifies the nodes and DRAM checkpointers that a checkpoint is beginning.
  • DRAM Checkpointer
    • Continuously logs DRAM changes.
    • Checkpoints when instructed by the coordinator.
  • Cache Checkpoint Controller
    • Checkpoints the adjacent cache when instructed by the coordinator.
steps for checkpointing 1 of 2
Steps for Checkpointing (1 of 2)
  • The coordinator sets the checkpoint signal to 1.
  • In parallel each
    • Core:
      • Pauses processing instructions.
      • Copies internal state to flash memory.
    • Cache Checkpoint Controller:
      • Copies cache internal state to flash memory (data is copied one line at a time).
    • DRAM Checkpointer:
      • Flushes buffer to flash log.
      • Notifies checkpoint coordinator that the buffer has been flushed.
steps for checkpointing 2 of 2
Steps for Checkpointing (2 of 2)
  • The coordinator sets the checkpoint signal to 0.
  • In parallel each
    • Core:
      • Flips flash memory bit to indicate the new checkpoint buffer.
    • Cache Checkpoint Controller:
      • Flips flash memory bit to indicate the new checkpoint buffer.
    • DRAM Checkpointer:
      • Marks checkpoint boundary in flash log.
slide14

Core

Checkpoint A

Checkpoint B

L1

Cache Checkpoint Controller

Checkpoint A

Checkpoint B

L2

Checkpoint A

Cache Checkpoint Controller

Checkpoint B

F

F

F

F

F

F

F

F

slide15

Address Decoder

Buffered Changes

Buffer

Buffer

Buffer

Buffer

Log

Log

Log

Log

Check point

Check point

Check point

Check point

Previous Checkpoint Changes

Next Checkpoint Changes

end

start

Previous Checkpoint

(random access)

recovering
Recovering
  • Determining which Checkpoint to use
    • System checks which Checkpoint is the most recent
    • If the most recent checkpoint was in progress during crash, the older checkpoint is used.
  • Restoring Previous State
    • Each architectural register is rewritten.
    • Each cache is written to by its adjacent FLASH buffer (one cache line at a time)
    • Main Memory is recovered
    • Take advantage of pipelined write if available.
  • Resume Execution
    • Resume program counter
    • Notify that CPU’s that the system is restoring from a checkpoint (single bit)