Transient fault detection and recovery via simultaneous multithreading
This presentation is the property of its rightful owner.
Sponsored Links
1 / 42

Transient Fault Detection and Recovery via Simultaneous Multithreading PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Transient Fault Detection and Recovery via Simultaneous Multithreading. Nevroz ŞEN 26/04/2007. AGENDA. Introduction & Motivation SMT , SRT & SRTR Fault Detection via SMT (SRT) Fault Recovery via SMT (SRTR) Conclusion. INTRODUCTION. Transient Faults:

Download Presentation

Transient Fault Detection and Recovery via Simultaneous Multithreading

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Transient fault detection and recovery via simultaneous multithreading

Transient Fault Detection and Recovery via Simultaneous Multithreading

Nevroz ŞEN




  • Introduction & Motivation


  • Fault Detection via SMT (SRT)

  • Fault Recovery via SMT (SRTR)

  • Conclusion



  • Transient Faults:

    • Faults that persist for a “short” duration

    • Caused bycosmic rays (e.g., neutrons)

    • Charges and/or discharges internal nodes of logic or SRAM Cells – High Frequency crosstalk

  • Solution

    • No practical solution to absorb cosmic rays

      • 1 fault per 1000 computers per year (estimated fault rate)

  • Future is worse

    • Smaller feature size, reduce voltage, higher transistor count, reduced noise margin



  • Fault tolerant systems use redundancy to improve reliability:

    • Time redundancy: seperate executions

    • Space redundancy: seperate physical copies of resources

      • DMR/TMR

    • Data redundancy

      • ECC

      • Parity



  • Simultaneous Multithreading improves the performance of a processor by allowing multiple independent threads to execute simultaneously (same cycle) in different functional units

  • Use the replication provided by the different threads to run two copies of the same program so we are able to detect errors


R1  (R2)

R1  (R2)







Memory covered by ECC

RAID array covered by parity

Servernet covered by CRC


Replicated Microprocessors + Cycle-by-Cycle Lockstepping


R1  (R2)

R1  (R2)







Memory covered by ECC

RAID array covered by parity

Servernet covered by CRC


Replicated Threads + Cycle-by-Cycle Lockstepping ???



  • Less hardware compared to replicated microprocessors

    • SMT needs ~5% more hardware over uniprocessor

    • SRT adds very little hardware overhead to existing SMT

  • Better performance than complete replication

    • Better use of resources

  • Lower cost

    • Avoids complete replication

    • Market volume of SMT & SRT

Motivation challenges


  • Cycle-by-cycle output comparison and input replication (Cycle-by-Cycle Lockstepping);

    • Equivalent instructions from different threads might execute in different cycles

    • Equivalent instructions from different threads might execute in different order with respect to other instructions in the same thread

  • Precise scheduling of the threads is crucial

    • Branch misprediction

    • Cache miss

Smt srt srtr








Simultaneous Multithreading (SMT)

Smt srt srtr1


  • SRT: Simultaneous & Redundantly Threaded Processor

  • SRT = SMT + Fault Detection

  • SRTR: Simultaneous & Redundantly Threaded Processor with Recovery

  • SRTR = SRT + Fault Recovery

Fault detection via smt srt

Fault Detection via SMT - SRT

  • Sphere of Replication (SoR)

  • Output comparison

  • Input replication

  • Performance Optimizations for SRT

  • Simulation Results

Srt sphere of replication sor

Logical boundary of redundant execution within a system

Components inside sphere are protected against faults using replication

External components must use other means of fault tolerance (parity, ECC, etc.)

Its size matters:

Error detection latency

Stored-state size

SRT - Sphere of Replication (SoR)

Srt sphere of replication sor for srt

SRT - Sphere of Replication (SoR)for SRT

Excludes instruction and data caches

Alternates SoRs possible (e.g., exclude register file)

Output comparision


  • Compare & validate output before sending it outside the SoR - Catch faults before propagating to rest of system

  • No need to compare every instruction; Incorrect value caused by a fault propagatesthrough computations and is eventually consumed by a store,checking only stores suffices.

  • Check;

    • Address and data for stores from redundant threads. Both comparison and validation at commit time

    • Address for uncached load from redundant threads

    • Address for cached load from redundant threads: not required

  • Other output comparison based on the boundary of an SoR

Output comparision store queue

Store: ...



Store: ...

Store: ...

Store: ...

Store: R1  (R2)

To Data Cache



Store: ...

Store: R1  (R2)


  • Bottleneck if store queue is shared

  • •Separate per-thread store queues boost performance

Input replication


  • Replicate & deliver same input (coming from outside SoR) to redundant copies. To do this;

  • Instructions: Assume no self-modification. No check

  • Cached load data:

    • Active Load Address Buffer

    • Load Value Queue

  • Uncached load data:

    • Synchronize when comparing addresses that leave the SoR

    • When data returns, replicate the value for the two threads

  • External Interrupts:

    • Stall lead thread and deliver interrupt synchronously

    • Record interrupt delivery point and deliver later

Input replication active load address buffer alab

INPUT REPLICATION – Active Load Address Buffer (ALAB)

  • Delays a cache block’s replacement or invalidation after the retirement of the trailing load

  • Counter tracks trailing thread’s outstanding loads

  • When a cache block is about to be replaced:

    • The ALAB is searched for an entry matching the block’s address

    • If counter != 0 then:

      • Do not replace nor invalidate until trailing thread is done

      • Set the pending-invalidate bit

    • Else replace - invalidate

Input replication load value queue lvq


load R1(R2)


probe cache


load R1  (R2)



  • An alternative to ALAB – Simpler

  • Pre-designated leading & trailing threads

  • Protected by ECC


Leading Thread

Trailing Thread

Input replication load value queue lvq1


  • Advantages over ALAB;

    • Reduces the pressure on data cache ports

    • Accelerate fault detection of faulty addresses

    • Simple design

Performance optimizations for srt

Performance Optimizations for SRT

  • Idea: Using one thread to improve cache and branch prediction behavior for the other thread. Two techniques;

    • Slack Fetch

      • Maintains a constant slack of instructionsbetween the threads

      • Prevents the trailing thread from seeing mispredictionsand cache misses

    • Branch Outcome Queue (BOQ)

Performance optimizations for srt branch outcome queue boq







Data Cache

Performance Optimizations for SRT - Branch Outcome Queue (BOQ)

  • Sends the outcomes of the committed branch outcomes(branch PCs and outcomes) to the trailing thread

  • In the fetch stage trailing thread uses the head of queue like a branch target buffer

Simulation results

Simulation Environment:

Modified Simplescalar “sim-outorder”

Long front-end pipeline because of out-of-order nature and SMT

Simple approximation of trace cache

Used 11 SPEC95 benchmarks

Simulation Results

Simulation results1

ORH: On-Chip Replicated Hardware

ORH-Dual -> two pipelines, each with half the resources

SMT- Dual -> Replicated threads with no detection hardware

Simulation Results

Simulation results slack fetch branch outcome queue

Max 27% performance improvements for SF, BOQ, and SF + BOQ

Performance better withslack of 256 instructionsover 32 or 128

Prevents trailing threadfrom wasting resourcesby speculating

Simulation Results - Slack Fetch & Branch Outcome Queue

Simulation results input replication

Very low performance degradation for 64- entry ALAB or LVQ

On average a 16-entry ALAB and a 16-entry LVQ degrade performance by 8% and 5% respectively.

Simulation Results - Input Replication

Simulation results overall

Comparison with ORH- Dual

SRT processor: 256 slack fetch, BOQ with 128 entries, 64-entry store buffer, and 64-entry LVQ

Average: 16% Maksimum: %29 over a lockstepping processor with the “same” hardware

Simulation Results - Overall

Fault recovery via smt srtr

Fault Recovery via SMT (SRTR)

  • What is wrong with SRT: A leading non-store instruction may commit before the check for the fault occurs

    • Relies on the trailing thread to trigger the detection

    • However, an SRTR processor works well in a fail-fast architecture

    • A faulty instruction cannot be undone once theinstruction commits.

Fault recovery via smt srtr motivation

Fault Recovery via SMT (SRTR) - Motivation

  • In SRT, a leading instructionmay commit before the check for faults occurs, relying on thetrailing thread to trigger detection.

  • In contrast, SRTR must notallow any leading instruction to commit before checkingoccurs,

  • SRTR uses the time between the completion and commit time of leading instruction and checks the results as soon as the trailing completes

  • In SPEC95, complete to commit takes about 29 cycles

  • This short slack has some implications:

    • Leading thread provides branch predictions

    • The StB, LVQ and BOQ need to handle mispredictions

Fault recovery via smt srtr motivation1

Fault Recovery via SMT (SRTR) - Motivation

  • Leading thread provides the trailing thread with branch predictions instead of outcomes (SRT).

  • Register value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.

  • Dependence-based checking elision (DBCE) to reduce the number of checks is developed

  • Recovery via traditional rollback ability of modern pipelines

Srtr additions to smt

SRTR Additions to SMT

SRTR Addition to MST

Predq : Prediction Queue

LVQ : Load Value Queue

CVs : Commit Vectors

AL: Active List

RVQ: Register Value Queue

Srtr al lvq


  • Leading and trailing instructions occupy the same positions in their ALs (private for each thread)

    • May enter their AL and become ready to commit them at different times

  • The LVQ has to be modified to allow speculative loads

    • The Shadow Active List holds pointers to LVQ entries

    • A trailing load might issue before the leading load

    • Branches place the LVQ tail pointer in the SAL

    • The LVQ’s tail pointer points to the LVQ has to be rolled back in a misprediction

Srtr predq


  • Leading thread places predicted PC

  • Similar to BOQ but only holds predictions instead of outcomes

  • Using the predQ, the two threads fetch essentially the same instructions

  • On a misprediction detection leading clears the predQ

  • ECC protected

Srtr rvq cv


  • SRTR checks when the trailing instruction completes

  • The Register Value Queue is used to store register values for checking, avoiding pressure on the register file

    • RVQ entries are allocated when instruction enter the AL

    • Pointers to the RVQ entries are placed in the SAL to facilitate their search

  • If check succeeds, the entries in the CV vector are set to checked-ok and comitted

  • If check fails, the entries in the CV vectors are set to failed

    • Rollback done when entries in head of AL

Srtr pipeline

SRTR - Pipeline

  • After the leading instruction writes its result back, it enters thefault-check stage

  • The leading instruction puts its value in the RVQ using the pointer from the SAL.

  • The trailing instructions also use the SAL to obtain theirRVQ pointers and find their leading counterparts’

Srtr dbce


  • SRTR uses a separate structure, the register value queue (RVQ),to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.

  • Check each inst brings BW pressure on RVQ

  • DBCE (Dependence Based Checking Elision) scheme reduce the number of checks, and thereby, the RVQ bandwidth demand.

Srtr dbce1


  • Idea:

    • Faults propagate through dependent instructions

    • Exploits register dependence chains so that only the last instruction in a chain uses the RVQ, and has the leading and trailing values checked.

Srtr dbce2


  • If the last instruction check succeeds,commit previous ones

  • If the check fails, all the instructions in the chain are marked as having failed and the earliest instruction in the chain triggers a rollback.

Srtr performance

SRTR - Performance

  • Detection performance between SRT & SRTR

  • Better results in the interaction between branch mispredictions and slack.

  • Better than SRT between %1-%7

Srtr performance1

SRTR - Performance

  • SRTR’s average performance peaks at aslack of 32



  • A more efficient way to detect Transient Faults is presented

  • Thetrailing thread repeats the computation performed by the leadingthread, and the values produced by the two threads arecompared.

    • Defined some concepts: LVQ, ALAB, Slack Fetch and BOQ

  • An SRT processor can provide higher performance then an equivalently sized on-chip HW replicated solution.

  • SRT can be extended for fault recovery-SRTR

Refe r ances


  • T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient Fault Recovery using Simultaneous Multithreading,” Proc. 29th Annual Int’l Symp. on Computer Architecture, May 2002.

  • S. K. Reinhardt and S. S. Mukherjee. Transient-fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25–36, June 2000.

  • Eric Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessor,” Proceedings of Fault-Tolerant Computing Systems (FTCS), 1999.

  • S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002

  • Login