architectural vulnerability factor does a soft error matter n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Architectural Vulnerability Factor (does a soft error matter?) PowerPoint Presentation
Download Presentation
Architectural Vulnerability Factor (does a soft error matter?)

Loading in 2 Seconds...

play fullscreen
1 / 27

Architectural Vulnerability Factor (does a soft error matter?) - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Architectural Vulnerability Factor (does a soft error matter?). Arijit Biswas Shubu Mukherjee Intel Corporation July, 2009. Outline . Background Terminology & Metrics AVF Computation Techniques AVF Results & Summary. What is a soft error? Strike Changes State of a Single Bit. 0. 1.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Architectural Vulnerability Factor (does a soft error matter?)' - lawrencia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
architectural vulnerability factor does a soft error matter
Architectural Vulnerability Factor(does a soft error matter?)

Arijit BiswasShubu Mukherjee

Intel Corporation

July, 2009

outline
Outline
  • Background
  • Terminology & Metrics
  • AVF Computation Techniques
  • AVF Results & Summary
slide4

source

drain

Impact of Neutron Strike on a Si Device

neutron strike

Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device

+

+

-

+

+

-

-

-

Transistor Device

  • Secondary source of upsets: alpha particles from packaging
slide5

p

p

n

n

p

n

n

p

n

p

n

Earth’s Surface

Cosmic Rays Come From Deep Space

  • Neutron flux is higher at higher altitudes
slide6

Impact of Elevation

Figure 8, Ziegler, et al., “IBM experiments in soft fails in computer electronics (1978 - 1994),” IBM J. of R. & D., Vol. 40, No. 1, Jan. 1996.

  • 3x-5x increase in Denver at 5,000 feet
  • 100x increase in airplanes at 30,000+ feet
  • No practical shielding exists (eg. >10 ft. of concrete)
vulnerable bits growing with moore s law
# Vulnerable Bits Growing with Moore’s Law
  • Aggressive designs have significantly higher number of vulnerable latches
  • Additional soft errors on RAM cells, static logic, & dynamic logic
  • Higher soft error rate in multiprocessor systems
    • If left unprotected, a data center with hundreds to thousands of such systems may encounter data corruptions on a weekly or daily basis
    • Especially troubling for HPC systems
  • Even detected errors can be a concern for HPC systems
    • High rate of detection without recovery could cripple HPC systems
      • System halt or app-kill
    • HPC systems must consider recovery and failover mechanisms
outline1
Outline
  • Background
  • Terminology & Metrics
  • AVF Computation Techniques
  • AVF Results & Summary
slide9

Bit

Read?

Bit has error protection

benign fault

no error

benign fault

no error

benign fault

no error

Does bit matter?

Does bit matter?

False Detected Unrecoverable Error

True Detected Unrecoverable Error

Silent Data Corruption

Strike on a bit (e.g., in register file)

Particle Strike

Causes Bit Flip!

no

yes

Detection &

Correction

no

Detection

only

yes

no

no

yes

SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error

metrics
Metrics
  • Interval-based
    • MTTF = Mean Time to Failure
    • MTTR = Mean Time to Repair
    • MTBF = Mean Time Between Failures = MTTF + MTTR
    • Availability = MTTF / MTBF
  • Rate-based
    • FIT = Failure in Time = 1 failure in a billion hours
    • SER FIT is broken into SDC FIT and DUE FIT
  • Total Soft Error FIT =

(for each vulnerable device i) (circuit-level soft error ratei * AVFi)

  • AVF = fraction of faults that become user-visible errors
architectural vulnerability factor does a bit matter
Architectural Vulnerability FactorDoes a bit matter?
  • Branch Predictor
    • Doesn’t matter at all (AVF = 0%)
  • Program Counter
    • Almost always matters (AVF ~ 100%)
  • Computing AVF for complex structures
    • Statistical Fault Injection
    • ACE Analysis
outline2
Outline
  • Background
  • Terminology & Metrics
  • AVF Computation Techniques
  • AVF Results & Summary
slide13

Statistical Fault Injection (SFI) into RTL

Simulate Strike on Latch

Logic

0

1

output

0

Does Fault Propagate to Architectural State

AVF (crudely) = number of errors in arch. state / number of injected faults

SFI into RTL good for latches, but very difficult for arch. structures

 SFI is extremely compute intensive as many fault simulations are needed in order to provide statistically significant results

architecturally correct execution ace
Architecturally Correct Execution (ACE)

Program Input

  • ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine)
  • Anything else (un-ACE path) can be derated away

Dynamically

Dead

Instruction

Program Outputs

mapping ace un ace instructions to the instruction queue

NOP

Prefetch

ACE Inst

Wrong-

Path

Inst

Invalid

Mapping ACE & un-ACE Instructions to the Instruction Queue

ACEInst

Architectural un-ACE

Micro-architectural un-ACE

SER & AVF are properties of a bit in hardware so instruction ACE-ness must be mapped to the hardware level

ace lifetime analysis 1 e g write through data cache
ACE Lifetime Analysis (1)(e.g., write-through data cache)
  • Idle is unACE
  • Assuming all time intervals are equal
  • For 3/5 of the lifetime the bit is valid
  • Gives a measure of the structure’s utilization
    • Number of useful bits
    • Amount of time useful bits are resident in structure
    • Valid for a particular trace

Fill

Read

Read

Evict

Idle

Valid

Valid

Valid

Idle

Write-through Data Cache

ace lifetime analysis 2 e g write through data cache

Fill

Read

Read

Evict

Idle

Idle

Write-through Data Cache

ACE Lifetime Analysis (2)(e.g., write-through data cache)
  • Valid is not necessarily ACE
  • ACE % = AVF = 2/5 = 40%
  • Example Lifetime Components
    • ACE: fill-to-read, read-to-read
    • unACE: idle, read-to-evict, write-to-evict
ace lifetime analysis 3 e g write through data cache
ACE Lifetime Analysis (3)(e.g., write-through data cache)
  • Data ACEness is a function of instruction ACEness
  • Second Read is by an unACE instruction
  • AVF = 1/5 = 20%

Fill

Read

Read

Evict

Idle

Idle

Write-through Data Cache

computing avf of address based structures tags
Computing AVF of address-based structures (tags)
  • A fault associated with a tag that is nominally associated with a particular instruction can impact the correct execution of a different independent instruction
  • False Negatives only error if writeback is necessary
    • Uses standard lifetime analysis
  • False Positives always result in error
    • Need bit-level analysis
false positive
False Positive
  • Expected Tag Miss, but got Hit – Error
  • How do you compute the AVF?

Incoming Address

Tag Address

1

0

0

1

1

0

0

0

  • Expect:

MISS

Tag Address

Incoming Address

1

0

0

1

1

0

0

1

  • Acquire:

HIT

hamming distance 1 analysis
Hamming-Distance-1 Analysis
  • Assuming a single-bit error model
  • Now we can use lifetime analysis on the identified bit(s)

Tag Array

101010

Hamming-Distance-1 Match

Incoming Address

001010

111010

000001

111000

Hamming-Distance-1 Match

010101

111111

outline3
Outline
  • Background
  • Terminology & Metrics
  • AVF Computation Techniques
  • AVF Results & Summary
  • Summary
instruction queue
Instruction Queue

IA64-like

SPEC 2K

ACE percentage = AVF = 29%

slide24

Summary

  • Soft Errors: real problem today
    • Culprits: neutrons from deep space & alpha from packaging
  • Major problem in next few technology generations
    • Problem scales with Moore’s Law, die size, & system size
    • HPC systems are particularly affected due to size
  • Industry looking for cost-effective solutions
    • AVF analysis is the first step towards accurately estimating soft error rates at the architectural level
    • ACE Lifetime Analysis
    • Hamming-distance-one analysis
data avfs average
Data AVFs (Average)

IA64-like

SPEC 2K

  • STB AVF lower due to large idle component and bytemasks
  • DTB AVF higher due to high average utilization
  • Dcache(WB) AVF higher than Dcache(WT) since dirty bytes still ACE after last read
tag avfs average
Tag AVFs (Average)

IA64-like

SPEC 2K

  • Tag AVFs are low for DTB and DCache (WT)
    • Only Hamming-Distance-1 matches contribute to ACE time
  • Tag AVFs higher than data for STB and DCache (WB)
    • Dynamically dead tags are still ACE for dirty bytes