slide1
Download
Skip this Video
Download Presentation
CS 7810 Lecture 25

Loading in 2 Seconds...

play fullscreen
1 / 18

CS 7810 Lecture 25 - PowerPoint PPT Presentation


  • 262 Views
  • Uploaded on

CS 7810 Lecture 25. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999. Redundancy. If a processor’s output is error-prone, reliability can be provided with redundancy. Input Program. Primary Core. Checker

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 7810 Lecture 25' - bethan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

CS 7810 Lecture 25

DIVA: A Reliable Substrate for Deep

Submicron Microarchitecture Design

T. Austin

Proceedings of MICRO-32

November 1999

slide2

Redundancy

  • If a processor’s output is error-prone, reliability
  • can be provided with redundancy

Input

Program

Primary

Core

Checker

Core

Verify &

Commit

slide3

Redundancy

  • If a processor’s output is error-prone, reliability
  • can be provided with redundancy

Input

Program

Primary

Core

Checker

Core

Checker

Core

Verify &

Commit

One checker can detect errors. For

recovery, we may need another checker

or some other form of redundancy

slide4

Why Redundancy?

  • Soft Errors: A high energy particle can strike a device and
  • deposit enough charge to flip the value

Input

Program

  • Cosmic rays
  • Alpha particles

Primary

Core

Checker

Core

Verify &

Commit

slide5

Why Redundancy?

  • Soft Errors: voltage spikes or noise

Input

Program

  • Crosstalk
  • di/dt
  • Lower voltages

Primary

Core

Checker

Core

Verify &

Commit

slide6

Why Redundancy?

  • Allows unverified or aggressively clocked primary cores

Input

Program

  • Functionally incorrect

core: some corner

case slips through

  • Electrically incorrect

core: high temperature

causes a circuit to not

meet the timing

constraint

Primary

Core

Checker

Core

Verify &

Commit

slide7

DIVA Microarchitecture

BPred

I-$

Dec/Ren

IQ

Rename

Regs

ALU

D-$

Arch

Regs

LR3 + LR7  LR15

4 8 12

If both checks

succeed, write

12 into LR15

Storage Check

Rd LR3 and LR7 from Arch Regs

and confirm it equals 4 and 8

ALU Check

Add 4+8 and confirm it equals 12

slide8

Microarchitecture Details

  • Instructions are fed to checker in order during commit
  • The logic and storage checks detect errors in ALUs
  • and datapath
  • The checker core is a simple in-order pipeline – easy to
  • design and verify
  • An error in an earlier stage (LR3 instead of LR2) can be
  • detected by also adding a ren/decode stage to the checker
  • In-order core has no stalls (need bypass for register file)
  • – no data dependences, cache misses, branch mispredicts
  • Contention for register file and data cache can degrade
  • primary thread
slide9

Recovery

  • The architected register file and data cache are ECC
  • protected – when an error is detected, it is assumed
  • that checker and architected state are correct
  • Primary core is re-started from faulting instruction
  • A fault in the primary core may result in deadlock:
  • e.g. instruction that produces R5 is waiting for R5 to be
  • produced (instead of R4)
  • A timeout in the checker signals an error
slide10

Redundant Multi-Threading

  • Execute two threads in parallel (CMP or SMT) – each
  • thread maintains its own register state
  • Threads execute as in a conventional processor, except
    • trailing thread commits after verifying result
    • leading thread commits stores to a buffer – these

get written to cache/memory only after verification

    • load values of the leading thread are sent to trailing

thread, so trailing thread never accesses data cache

    • branch outcomes are also sent to trailing thread

Reg results, load values,

branch outcomes

Leading Thread

Trailing Thread

Store values

slide11

Fault Model

  • A single error in either core can be detected
  • Since loads are not replicated, the load/store datapath
  • must be ECC protected
  • For recovery, a second checker thread is required
  • ECC in the checker register file will enable recovery
  • in most cases without a second checker
slide12

RMT on SMT/CMP

+ SMT does not require inter-core traffic – values can be

read from shared register file/data cache

– Single thread performance may be degraded

– Each redundant instr executes on high-power pipeline

+ Trailing CMP core can be a simple in-order processor

 low power/area overheads

+ Trailing core’s frequency can be independently

controlled

+ Heterogeneous CMP where cores can be dynamically

employed for throughput/reliability

+ Lower probability for errors

slide13

Parallelization of Trailing Thread

Sequential Thread

Parallel Thread 1

Parallel Thread 2

Parallel Thread 3

Parallel Thread 4

Is it more power-efficient to execute the verification thread in parallel?

slide14

Parallelization of Trailing Thread

Sequential Thread

Parallel Thread 1

Parallel Thread 2

Parallel Thread 3

Parallel Thread 4

If the trailing cores are frequency-scaled, dynamic power does not

change, but leakage power increases

If the trailing cores are frequency-and-voltage scaled, dynamic power

decreases, and leakage power increases

slide16

Acronyms!!

  • MTTF & MTBF: Mean time to/between failures
  • Errors are either SDC (silent data corruption) or DUE
  • (detected unrecoverable errors)
  • Many errors get masked:
  • ACE bits: these bits are required for architecturally
  • correct execution
  • un-ACE bits: these bits do not affect the final output
  • AVF: architecture vulnerability factor (the percentage of
  • time/space that a structure holds ACE state)
slide17

Partial Coverage

  • RMT covers faults in the entire core (almost!)
  • If that is too expensive, provide error coverage in
  • specific structures to reduce error probabilities
  • Are there ways to ensure that an instruction spends less
  • time in architecturally vulnerable structures?
slide18

Title

  • Bullet
ad