CS 7810    Lecture 25
Download
1 / 18

CS 7810 Lecture 25 - PowerPoint PPT Presentation


  • 260 Views
  • Uploaded on

CS 7810 Lecture 25. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999. Redundancy. If a processor’s output is error-prone, reliability can be provided with redundancy. Input Program. Primary Core. Checker

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 7810 Lecture 25' - bethan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

CS 7810 Lecture 25

DIVA: A Reliable Substrate for Deep

Submicron Microarchitecture Design

T. Austin

Proceedings of MICRO-32

November 1999


Redundancy

  • If a processor’s output is error-prone, reliability

  • can be provided with redundancy

Input

Program

Primary

Core

Checker

Core

Verify &

Commit


Redundancy

  • If a processor’s output is error-prone, reliability

  • can be provided with redundancy

Input

Program

Primary

Core

Checker

Core

Checker

Core

Verify &

Commit

One checker can detect errors. For

recovery, we may need another checker

or some other form of redundancy


Why Redundancy?

  • Soft Errors: A high energy particle can strike a device and

  • deposit enough charge to flip the value

Input

Program

  • Cosmic rays

  • Alpha particles

Primary

Core

Checker

Core

Verify &

Commit


Why Redundancy?

  • Soft Errors: voltage spikes or noise

Input

Program

  • Crosstalk

  • di/dt

  • Lower voltages

Primary

Core

Checker

Core

Verify &

Commit


Why Redundancy?

  • Allows unverified or aggressively clocked primary cores

Input

Program

  • Functionally incorrect

    core: some corner

    case slips through

  • Electrically incorrect

    core: high temperature

    causes a circuit to not

    meet the timing

    constraint

Primary

Core

Checker

Core

Verify &

Commit


DIVA Microarchitecture

BPred

I-$

Dec/Ren

IQ

Rename

Regs

ALU

D-$

Arch

Regs

LR3 + LR7  LR15

4 8 12

If both checks

succeed, write

12 into LR15

Storage Check

Rd LR3 and LR7 from Arch Regs

and confirm it equals 4 and 8

ALU Check

Add 4+8 and confirm it equals 12


Microarchitecture Details

  • Instructions are fed to checker in order during commit

  • The logic and storage checks detect errors in ALUs

  • and datapath

  • The checker core is a simple in-order pipeline – easy to

  • design and verify

  • An error in an earlier stage (LR3 instead of LR2) can be

  • detected by also adding a ren/decode stage to the checker

  • In-order core has no stalls (need bypass for register file)

  • – no data dependences, cache misses, branch mispredicts

  • Contention for register file and data cache can degrade

  • primary thread


Recovery

  • The architected register file and data cache are ECC

  • protected – when an error is detected, it is assumed

  • that checker and architected state are correct

  • Primary core is re-started from faulting instruction

  • A fault in the primary core may result in deadlock:

  • e.g. instruction that produces R5 is waiting for R5 to be

  • produced (instead of R4)

  • A timeout in the checker signals an error


Redundant Multi-Threading

  • Execute two threads in parallel (CMP or SMT) – each

  • thread maintains its own register state

  • Threads execute as in a conventional processor, except

    • trailing thread commits after verifying result

    • leading thread commits stores to a buffer – these

      get written to cache/memory only after verification

    • load values of the leading thread are sent to trailing

      thread, so trailing thread never accesses data cache

    • branch outcomes are also sent to trailing thread

Reg results, load values,

branch outcomes

Leading Thread

Trailing Thread

Store values


Fault Model

  • A single error in either core can be detected

  • Since loads are not replicated, the load/store datapath

  • must be ECC protected

  • For recovery, a second checker thread is required

  • ECC in the checker register file will enable recovery

  • in most cases without a second checker


RMT on SMT/CMP

+ SMT does not require inter-core traffic – values can be

read from shared register file/data cache

– Single thread performance may be degraded

– Each redundant instr executes on high-power pipeline

+ Trailing CMP core can be a simple in-order processor

 low power/area overheads

+ Trailing core’s frequency can be independently

controlled

+ Heterogeneous CMP where cores can be dynamically

employed for throughput/reliability

+ Lower probability for errors


Parallelization of Trailing Thread

Sequential Thread

Parallel Thread 1

Parallel Thread 2

Parallel Thread 3

Parallel Thread 4

Is it more power-efficient to execute the verification thread in parallel?


Parallelization of Trailing Thread

Sequential Thread

Parallel Thread 1

Parallel Thread 2

Parallel Thread 3

Parallel Thread 4

If the trailing cores are frequency-scaled, dynamic power does not

change, but leakage power increases

If the trailing cores are frequency-and-voltage scaled, dynamic power

decreases, and leakage power increases



Acronyms!!

  • MTTF & MTBF: Mean time to/between failures

  • Errors are either SDC (silent data corruption) or DUE

  • (detected unrecoverable errors)

  • Many errors get masked:

  • ACE bits: these bits are required for architecturally

  • correct execution

  • un-ACE bits: these bits do not affect the final output

  • AVF: architecture vulnerability factor (the percentage of

  • time/space that a structure holds ACE state)


Partial Coverage

  • RMT covers faults in the entire core (almost!)

  • If that is too expensive, provide error coverage in

  • specific structures to reduce error probabilities

  • Are there ways to ensure that an instruction spends less

  • time in architecturally vulnerable structures?


Title

  • Bullet


ad