Availability in cmps
1 / 16

Availability in CMPs - PowerPoint PPT Presentation

  • Uploaded on

Availability in CMPs. By Eric Hill Pranay Koka. Motivation. RAS is an important feature for commercial servers Server downtime is equivalent to lost money Investigate the feasibility of previously proposed SMP availability schemes in the context of CMPs. Availability.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Availability in CMPs' - lindley

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Availability in cmps

Availability in CMPs


Eric Hill

Pranay Koka


  • RAS is an important feature for commercial servers

    • Server downtime is equivalent to lost money

  • Investigate the feasibility of previously proposed SMP availability schemes in the context of CMPs


  • Focused on Backward Error Recovery (BER) schemes

    • System periodically checkpoints state

    • Roll back to previously validated checkpoint upon fault detection

  • Design Choices

    • When to checkpoint

    • How to checkpoint

    • Where to store checkpointed state

Ber taxonomy when
BER Taxonomy (When)

  • Checkpoint Consistency

    • Global

      • Processors synchronize in physical time to create checkpoints

      • Synchronization latency could be expensive

    • Coordinated Local

      • Processors synchronize in logical time to create checkpoints

    • Uncoordinated Local

      • Processors create checkpoints without synchronization

Ber taxonomy how
BER Taxonomy (How)

  • Flushing

    • At checkpoint creation, processor and cache state is dumped to memory

  • Incremental Logging

    • Within a checkpoint interval, changes to cache state are recorded. Processor state is logged at checkpoint interval boundary

Ber taxonomy where
BER Taxonomy (Where)

  • On-Chip

    • Consumes transistors which could be used for caching or computation

    • Logging/Restoring of checkpoint state easier

  • Off-Chip

    • No storage overhead on actual CMP

    • Logging/Restoring of state consumes bandwidth


  • Global Checkpoint Consistency

    • Synchronization time should be reduced for a single CMP

  • Flushing of data

    • Consumes bandwidth

  • Off-chip checkpoint storage


  • Coordinated Local Checkpointing

    • Global logical clock distributed across system

  • Incremental Logging of Data

    • Incrementally record changes in cache state

  • On chip storage

    • Consumes transistors which could be used for processors/cache


  • Estimation of CLB size for SafetyNet in a CMP

    • Placed counters in ruby to record update-actions

    • CLB overhead = (CLB entries created/interval) * (5 outstanding checkpoints) * (72 bytes per entry)

  • Estimation of ReVive flushing overhead

    • Counted how many cache lines need to be flushed per interval

  • Estimation of I/O buffering overhead

    • Measured disk and network I/O for 2 workloads by modifying Linux kernel

Design assumptions
Design Assumptions

  • Assumed inclusive 2 Level memory hierarchy with write-back L1 & L2 caches

    • Logging must be done at both levels of memory hierarchy

  • Write-through L1 caches only require logging at 1 level

    • L2 has knowledge of every store performed in private L1s

  • Exclusion also requires logging at both levels

Safetynet results
SafetyNet Results

16 MB L2 cache – 100K cycle checkpoint interval, 25 transactions

Safetynet results1
SafetyNet Results

  • L2 CLB overhead is normalized to transactions

  • CLB size calculation is related to the CLB overhead per checkpoint interval

    • As more processors are added, throughput (transactions/interval) increases

  • Though the overhead cost per transaction decreases as more processors are added to a CMP, the absolute cost is still increasing


  • CLBs are sized to handle bursty, not average case

    • Worst case was 4.5 MB of overhead, for zeus 16 proc workload

    • Resized L2 cache to 12 MB by removing way of associativity

    • Resulted in 11.75% reduction in IPC

  • Large number of cache lines need to be flushed by ReVive

    • Should suffer more performance degradation than SafetyNet

What about i o
What About I/O?

  • Conventional wisdom – Buffer I/O

  • Buffering requirements are a function of validation latency.

  • I/O data rates on a 1p system


  • SafetyNet should perform better than ReVive

    • Storage overhead is still a problem

  • For small CMPs I/O buffering should not be a problem

    • More workloads still need to be studied

    • This may not be true for CMPs with large numbers of processors