Transient fault detection and recovery via simultaneous multithreading
Download
1 / 42

Transient Fault Detection and Recovery via Simultaneous Multithreading - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on
  • Presentation posted in: General

Transient Fault Detection and Recovery via Simultaneous Multithreading. Nevroz ŞEN 26/04/2007. AGENDA. Introduction & Motivation SMT , SRT & SRTR Fault Detection via SMT (SRT) Fault Recovery via SMT (SRTR) Conclusion. INTRODUCTION. Transient Faults:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Transient Fault Detection and Recovery via Simultaneous Multithreading

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Transient Fault Detection and Recovery via Simultaneous Multithreading

Nevroz ŞEN

26/04/2007


AGENDA

  • Introduction & Motivation

  • SMT, SRT & SRTR

  • Fault Detection via SMT (SRT)

  • Fault Recovery via SMT (SRTR)

  • Conclusion


INTRODUCTION

  • Transient Faults:

    • Faults that persist for a “short” duration

    • Caused bycosmic rays (e.g., neutrons)

    • Charges and/or discharges internal nodes of logic or SRAM Cells – High Frequency crosstalk

  • Solution

    • No practical solution to absorb cosmic rays

      • 1 fault per 1000 computers per year (estimated fault rate)

  • Future is worse

    • Smaller feature size, reduce voltage, higher transistor count, reduced noise margin


INTRODUCTION

  • Fault tolerant systems use redundancy to improve reliability:

    • Time redundancy: seperate executions

    • Space redundancy: seperate physical copies of resources

      • DMR/TMR

    • Data redundancy

      • ECC

      • Parity


MOTIVATION

  • Simultaneous Multithreading improves the performance of a processor by allowing multiple independent threads to execute simultaneously (same cycle) in different functional units

  • Use the replication provided by the different threads to run two copies of the same program so we are able to detect errors


R1  (R2)

R1  (R2)

microprocessor

microprocessor

Output

Comparison

Input

Replication

Memory covered by ECC

RAID array covered by parity

Servernet covered by CRC

MOTIVATION

Replicated Microprocessors + Cycle-by-Cycle Lockstepping


R1  (R2)

R1  (R2)

Thread

Thread

Output

Comparison

Input

Replication

Memory covered by ECC

RAID array covered by parity

Servernet covered by CRC

MOTIVATION

Replicated Threads + Cycle-by-Cycle Lockstepping ???


MOTIVATION

  • Less hardware compared to replicated microprocessors

    • SMT needs ~5% more hardware over uniprocessor

    • SRT adds very little hardware overhead to existing SMT

  • Better performance than complete replication

    • Better use of resources

  • Lower cost

    • Avoids complete replication

    • Market volume of SMT & SRT


MOTIVATION - CHALLENGES

  • Cycle-by-cycle output comparison and input replication (Cycle-by-Cycle Lockstepping);

    • Equivalent instructions from different threads might execute in different cycles

    • Equivalent instructions from different threads might execute in different order with respect to other instructions in the same thread

  • Precise scheduling of the threads is crucial

    • Branch misprediction

    • Cache miss


Thread1

Thread2

Instruction

Scheduler

Functional

Units

SMT – SRT - SRTR

Simultaneous Multithreading (SMT)


SMT – SRT - SRTR

  • SRT: Simultaneous & Redundantly Threaded Processor

  • SRT = SMT + Fault Detection

  • SRTR: Simultaneous & Redundantly Threaded Processor with Recovery

  • SRTR = SRT + Fault Recovery


Fault Detection via SMT - SRT

  • Sphere of Replication (SoR)

  • Output comparison

  • Input replication

  • Performance Optimizations for SRT

  • Simulation Results


Logical boundary of redundant execution within a system

Components inside sphere are protected against faults using replication

External components must use other means of fault tolerance (parity, ECC, etc.)

Its size matters:

Error detection latency

Stored-state size

SRT - Sphere of Replication (SoR)


SRT - Sphere of Replication (SoR)for SRT

Excludes instruction and data caches

Alternates SoRs possible (e.g., exclude register file)


OUTPUT COMPARISION

  • Compare & validate output before sending it outside the SoR - Catch faults before propagating to rest of system

  • No need to compare every instruction; Incorrect value caused by a fault propagatesthrough computations and is eventually consumed by a store,checking only stores suffices.

  • Check;

    • Address and data for stores from redundant threads. Both comparison and validation at commit time

    • Address for uncached load from redundant threads

    • Address for cached load from redundant threads: not required

  • Other output comparison based on the boundary of an SoR


Store: ...

Store

Queue

Store: ...

Store: ...

Store: ...

Store: R1  (R2)

To Data Cache

Output

Comparison

Store: ...

Store: R1  (R2)

OUTPUT COMPARISION – Store Queue

  • Bottleneck if store queue is shared

  • •Separate per-thread store queues boost performance


INPUT REPLICATION

  • Replicate & deliver same input (coming from outside SoR) to redundant copies. To do this;

  • Instructions: Assume no self-modification. No check

  • Cached load data:

    • Active Load Address Buffer

    • Load Value Queue

  • Uncached load data:

    • Synchronize when comparing addresses that leave the SoR

    • When data returns, replicate the value for the two threads

  • External Interrupts:

    • Stall lead thread and deliver interrupt synchronously

    • Record interrupt delivery point and deliver later


INPUT REPLICATION – Active Load Address Buffer (ALAB)

  • Delays a cache block’s replacement or invalidation after the retirement of the trailing load

  • Counter tracks trailing thread’s outstanding loads

  • When a cache block is about to be replaced:

    • The ALAB is searched for an entry matching the block’s address

    • If counter != 0 then:

      • Do not replace nor invalidate until trailing thread is done

      • Set the pending-invalidate bit

    • Else replace - invalidate


add

load R1(R2)

sub

probe cache

add

load R1  (R2)

sub

INPUT REPLICATION – Load Value Queue (LVQ)

  • An alternative to ALAB – Simpler

  • Pre-designated leading & trailing threads

  • Protected by ECC

LVQ

Leading Thread

Trailing Thread


INPUT REPLICATION – Load Value Queue (LVQ)

  • Advantages over ALAB;

    • Reduces the pressure on data cache ports

    • Accelerate fault detection of faulty addresses

    • Simple design


Performance Optimizations for SRT

  • Idea: Using one thread to improve cache and branch prediction behavior for the other thread. Two techniques;

    • Slack Fetch

      • Maintains a constant slack of instructionsbetween the threads

      • Prevents the trailing thread from seeing mispredictionsand cache misses

    • Branch Outcome Queue (BOQ)


BOQ

Dispatch

Decode

Commit

Fetch

Execute

Data Cache

Performance Optimizations for SRT - Branch Outcome Queue (BOQ)

  • Sends the outcomes of the committed branch outcomes(branch PCs and outcomes) to the trailing thread

  • In the fetch stage trailing thread uses the head of queue like a branch target buffer


Simulation Environment:

Modified Simplescalar “sim-outorder”

Long front-end pipeline because of out-of-order nature and SMT

Simple approximation of trace cache

Used 11 SPEC95 benchmarks

Simulation Results


ORH: On-Chip Replicated Hardware

ORH-Dual -> two pipelines, each with half the resources

SMT- Dual -> Replicated threads with no detection hardware

Simulation Results


Max 27% performance improvements for SF, BOQ, and SF + BOQ

Performance better withslack of 256 instructionsover 32 or 128

Prevents trailing threadfrom wasting resourcesby speculating

Simulation Results - Slack Fetch & Branch Outcome Queue


Very low performance degradation for 64- entry ALAB or LVQ

On average a 16-entry ALAB and a 16-entry LVQ degrade performance by 8% and 5% respectively.

Simulation Results - Input Replication


Comparison with ORH- Dual

SRT processor: 256 slack fetch, BOQ with 128 entries, 64-entry store buffer, and 64-entry LVQ

Average: 16% Maksimum: %29 over a lockstepping processor with the “same” hardware

Simulation Results - Overall


Fault Recovery via SMT (SRTR)

  • What is wrong with SRT: A leading non-store instruction may commit before the check for the fault occurs

    • Relies on the trailing thread to trigger the detection

    • However, an SRTR processor works well in a fail-fast architecture

    • A faulty instruction cannot be undone once theinstruction commits.


Fault Recovery via SMT (SRTR) - Motivation

  • In SRT, a leading instructionmay commit before the check for faults occurs, relying on thetrailing thread to trigger detection.

  • In contrast, SRTR must notallow any leading instruction to commit before checkingoccurs,

  • SRTR uses the time between the completion and commit time of leading instruction and checks the results as soon as the trailing completes

  • In SPEC95, complete to commit takes about 29 cycles

  • This short slack has some implications:

    • Leading thread provides branch predictions

    • The StB, LVQ and BOQ need to handle mispredictions


Fault Recovery via SMT (SRTR) - Motivation

  • Leading thread provides the trailing thread with branch predictions instead of outcomes (SRT).

  • Register value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.

  • Dependence-based checking elision (DBCE) to reduce the number of checks is developed

  • Recovery via traditional rollback ability of modern pipelines


SRTR Additions to SMT

SRTR Addition to MST

Predq : Prediction Queue

LVQ : Load Value Queue

CVs : Commit Vectors

AL: Active List

RVQ: Register Value Queue


SRTR – AL & LVQ

  • Leading and trailing instructions occupy the same positions in their ALs (private for each thread)

    • May enter their AL and become ready to commit them at different times

  • The LVQ has to be modified to allow speculative loads

    • The Shadow Active List holds pointers to LVQ entries

    • A trailing load might issue before the leading load

    • Branches place the LVQ tail pointer in the SAL

    • The LVQ’s tail pointer points to the LVQ has to be rolled back in a misprediction


SRTR – PREDQ

  • Leading thread places predicted PC

  • Similar to BOQ but only holds predictions instead of outcomes

  • Using the predQ, the two threads fetch essentially the same instructions

  • On a misprediction detection leading clears the predQ

  • ECC protected


SRTR – RVQ & CV

  • SRTR checks when the trailing instruction completes

  • The Register Value Queue is used to store register values for checking, avoiding pressure on the register file

    • RVQ entries are allocated when instruction enter the AL

    • Pointers to the RVQ entries are placed in the SAL to facilitate their search

  • If check succeeds, the entries in the CV vector are set to checked-ok and comitted

  • If check fails, the entries in the CV vectors are set to failed

    • Rollback done when entries in head of AL


SRTR - Pipeline

  • After the leading instruction writes its result back, it enters thefault-check stage

  • The leading instruction puts its value in the RVQ using the pointer from the SAL.

  • The trailing instructions also use the SAL to obtain theirRVQ pointers and find their leading counterparts’


SRTR – DBCE

  • SRTR uses a separate structure, the register value queue (RVQ),to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.

  • Check each inst brings BW pressure on RVQ

  • DBCE (Dependence Based Checking Elision) scheme reduce the number of checks, and thereby, the RVQ bandwidth demand.


SRTR – DBCE

  • Idea:

    • Faults propagate through dependent instructions

    • Exploits register dependence chains so that only the last instruction in a chain uses the RVQ, and has the leading and trailing values checked.


SRTR – DBCE

  • If the last instruction check succeeds,commit previous ones

  • If the check fails, all the instructions in the chain are marked as having failed and the earliest instruction in the chain triggers a rollback.


SRTR - Performance

  • Detection performance between SRT & SRTR

  • Better results in the interaction between branch mispredictions and slack.

  • Better than SRT between %1-%7


SRTR - Performance

  • SRTR’s average performance peaks at aslack of 32


CONCLUSION

  • A more efficient way to detect Transient Faults is presented

  • Thetrailing thread repeats the computation performed by the leadingthread, and the values produced by the two threads arecompared.

    • Defined some concepts: LVQ, ALAB, Slack Fetch and BOQ

  • An SRT processor can provide higher performance then an equivalently sized on-chip HW replicated solution.

  • SRT can be extended for fault recovery-SRTR


REFERANCES

  • T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient Fault Recovery using Simultaneous Multithreading,” Proc. 29th Annual Int’l Symp. on Computer Architecture, May 2002.

  • S. K. Reinhardt and S. S. Mukherjee. Transient-fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25–36, June 2000.

  • Eric Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessor,” Proceedings of Fault-Tolerant Computing Systems (FTCS), 1999.

  • S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002


ad
  • Login