1 / 27

Architectural Vulnerability Factor (does a soft error matter?)

Architectural Vulnerability Factor (does a soft error matter?). Arijit Biswas Shubu Mukherjee Intel Corporation July, 2009. Outline . Background Terminology & Metrics AVF Computation Techniques AVF Results & Summary. What is a soft error? Strike Changes State of a Single Bit. 0. 1.

lawrencia
Download Presentation

Architectural Vulnerability Factor (does a soft error matter?)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architectural Vulnerability Factor(does a soft error matter?) Arijit BiswasShubu Mukherjee Intel Corporation July, 2009

  2. Outline • Background • Terminology & Metrics • AVF Computation Techniques • AVF Results & Summary

  3. What is a soft error?Strike Changes State of a Single Bit 0 1

  4. source drain Impact of Neutron Strike on a Si Device neutron strike Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device + + - + + - - - Transistor Device • Secondary source of upsets: alpha particles from packaging

  5. p p n n p n n p n p n Earth’s Surface Cosmic Rays Come From Deep Space • Neutron flux is higher at higher altitudes

  6. Impact of Elevation Figure 8, Ziegler, et al., “IBM experiments in soft fails in computer electronics (1978 - 1994),” IBM J. of R. & D., Vol. 40, No. 1, Jan. 1996. • 3x-5x increase in Denver at 5,000 feet • 100x increase in airplanes at 30,000+ feet • No practical shielding exists (eg. >10 ft. of concrete)

  7. # Vulnerable Bits Growing with Moore’s Law • Aggressive designs have significantly higher number of vulnerable latches • Additional soft errors on RAM cells, static logic, & dynamic logic • Higher soft error rate in multiprocessor systems • If left unprotected, a data center with hundreds to thousands of such systems may encounter data corruptions on a weekly or daily basis • Especially troubling for HPC systems • Even detected errors can be a concern for HPC systems • High rate of detection without recovery could cripple HPC systems • System halt or app-kill • HPC systems must consider recovery and failover mechanisms

  8. Outline • Background • Terminology & Metrics • AVF Computation Techniques • AVF Results & Summary

  9. Bit Read? Bit has error protection benign fault no error benign fault no error benign fault no error Does bit matter? Does bit matter? False Detected Unrecoverable Error True Detected Unrecoverable Error Silent Data Corruption Strike on a bit (e.g., in register file) Particle Strike Causes Bit Flip! no yes Detection & Correction no Detection only yes no no yes SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error

  10. Metrics • Interval-based • MTTF = Mean Time to Failure • MTTR = Mean Time to Repair • MTBF = Mean Time Between Failures = MTTF + MTTR • Availability = MTTF / MTBF • Rate-based • FIT = Failure in Time = 1 failure in a billion hours • SER FIT is broken into SDC FIT and DUE FIT • Total Soft Error FIT = (for each vulnerable device i) (circuit-level soft error ratei * AVFi) • AVF = fraction of faults that become user-visible errors

  11. Architectural Vulnerability FactorDoes a bit matter? • Branch Predictor • Doesn’t matter at all (AVF = 0%) • Program Counter • Almost always matters (AVF ~ 100%) • Computing AVF for complex structures • Statistical Fault Injection • ACE Analysis

  12. Outline • Background • Terminology & Metrics • AVF Computation Techniques • AVF Results & Summary

  13. Statistical Fault Injection (SFI) into RTL Simulate Strike on Latch Logic 0 1 output 0 Does Fault Propagate to Architectural State AVF (crudely) = number of errors in arch. state / number of injected faults SFI into RTL good for latches, but very difficult for arch. structures  SFI is extremely compute intensive as many fault simulations are needed in order to provide statistically significant results

  14. Architecturally Correct Execution (ACE) Program Input • ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) • Anything else (un-ACE path) can be derated away Dynamically Dead Instruction Program Outputs

  15. NOP Prefetch ACE Inst Wrong- Path Inst Invalid Mapping ACE & un-ACE Instructions to the Instruction Queue ACEInst Architectural un-ACE Micro-architectural un-ACE SER & AVF are properties of a bit in hardware so instruction ACE-ness must be mapped to the hardware level

  16. ACE Lifetime Analysis (1)(e.g., write-through data cache) • Idle is unACE • Assuming all time intervals are equal • For 3/5 of the lifetime the bit is valid • Gives a measure of the structure’s utilization • Number of useful bits • Amount of time useful bits are resident in structure • Valid for a particular trace Fill Read Read Evict Idle Valid Valid Valid Idle Write-through Data Cache

  17. Fill Read Read Evict Idle Idle Write-through Data Cache ACE Lifetime Analysis (2)(e.g., write-through data cache) • Valid is not necessarily ACE • ACE % = AVF = 2/5 = 40% • Example Lifetime Components • ACE: fill-to-read, read-to-read • unACE: idle, read-to-evict, write-to-evict

  18. ACE Lifetime Analysis (3)(e.g., write-through data cache) • Data ACEness is a function of instruction ACEness • Second Read is by an unACE instruction • AVF = 1/5 = 20% Fill Read Read Evict Idle Idle Write-through Data Cache

  19. Computing AVF of address-based structures (tags) • A fault associated with a tag that is nominally associated with a particular instruction can impact the correct execution of a different independent instruction • False Negatives only error if writeback is necessary • Uses standard lifetime analysis • False Positives always result in error • Need bit-level analysis

  20. False Positive • Expected Tag Miss, but got Hit – Error • How do you compute the AVF? Incoming Address Tag Address 1 0 0 1 1 0 0 0 • Expect: MISS Tag Address Incoming Address 1 0 0 1 1 0 0 1 • Acquire: HIT

  21. Hamming-Distance-1 Analysis • Assuming a single-bit error model • Now we can use lifetime analysis on the identified bit(s) Tag Array 101010 Hamming-Distance-1 Match Incoming Address 001010 111010 000001 111000 Hamming-Distance-1 Match 010101 111111

  22. Outline • Background • Terminology & Metrics • AVF Computation Techniques • AVF Results & Summary • Summary

  23. Instruction Queue IA64-like SPEC 2K ACE percentage = AVF = 29%

  24. Summary • Soft Errors: real problem today • Culprits: neutrons from deep space & alpha from packaging • Major problem in next few technology generations • Problem scales with Moore’s Law, die size, & system size • HPC systems are particularly affected due to size • Industry looking for cost-effective solutions • AVF analysis is the first step towards accurately estimating soft error rates at the architectural level • ACE Lifetime Analysis • Hamming-distance-one analysis

  25. Backups

  26. Data AVFs (Average) IA64-like SPEC 2K • STB AVF lower due to large idle component and bytemasks • DTB AVF higher due to high average utilization • Dcache(WB) AVF higher than Dcache(WT) since dirty bytes still ACE after last read

  27. Tag AVFs (Average) IA64-like SPEC 2K • Tag AVFs are low for DTB and DCache (WT) • Only Hamming-Distance-1 matches contribute to ACE time • Tag AVFs higher than data for STB and DCache (WB) • Dynamically dead tags are still ACE for dirty bytes

More Related