1 / 11

SafetyNet

SafetyNet. Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood March 31 st 2006. Target: Systems where availability is crucial

von
Download Presentation

SafetyNet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood March 31st 2006

  2. Target: • Systems where availability is crucial • SMP Commercial Servers: Application Services, Database Management Systems Motivation: • Increase in Performance => Decrease in feature size => Decrease in Reliability • Cost of fault-tolerant solution: Important

  3. Approach and Challenges • Decouple: • Local Fault Detection - ECC, timeout, etc. • Lightweight & Global Fault Recovery - SafetyNet • Challenges for lightweight recovery schemes: • Amount of storage (checkpoints logs) • Maintain consistent global recovery point • Advance global recovery point

  4. SafetyNet: High-Level View • Maintain per processor checkpoints: • Oneglobally validated recovery point • Multiple coordinated checkpoints pending validation • ID by global logical timestamp • Fault detected => recover state to Recovery Point (Global)

  5. Solutions: Storage • Checkpoint architectural state: • Registers: • Shadow registers or cached copies • Copy once on beginning of checkpoint • Memory and Caches: • Checkpoint Log Buffers (CLBs) • Log incrementallystores, ownership change • Log only first update per block per checkpoint

  6. Solution: Global Coherence • Logical Time Base: • General agreement on checkpoint interval for each coherence transaction • Loosely synchronous checkpoint clock • Maintain per block Checkpoint number (CN)

  7. Solution: Global Recovery Point • Checkpoint Validation: • All agree execution to that point Error Free • Broadcast new Recovery Point Checkpoint Number • Restart: • Drain interconnection network • Discard in progress coherence state • Processors: restore register checkpoint • Memory: undo actions in Checkpoint Log Buffers (CLBs) • Caches: undo CLB

  8. Evaluation: Performance Impact

  9. Evaluation: Sensitivity

  10. Evaluation: Sensitivity (Cont)

  11. Questions • Why is having a coordinated checkpoint important? • Why broadcast Recovery Point Checkpoint Number twice: • when advancing the recovery point • when triggering recovery? • Why a Sequential Consistent model? • Is the scheme valid for Processor Consistency? • Is this a good idea? Has it caught on?

More Related