1 / 33

Microprocessor Reliability

Microprocessor Reliability. Robert Pawlowski ECE 570 – 2/19/2013. Reliability. Involves different aspects about a processor that can affect performance and functionality. Ultimately can reduce the lifetime of the processor. I ssues typically manifest themselves at the device level.

muncel
Download Presentation

Microprocessor Reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microprocessor Reliability Robert Pawlowski ECE 570 – 2/19/2013

  2. Reliability • Involves different aspects about a processor that can affect performance and functionality. • Ultimately can reduce the lifetime of the processor. • Issues typically manifest themselves at the device level. • Solutions can be implemented at multiple design levels.

  3. Why the concern? • Operating at highest frequencies and/or lowest power possible increases sensitivity to process-related variabilities. • Gate length/doping concentration variations • Temperature • Supply voltage droops • This decreases processor yield • Decreasing device sizes  Increased effect of external issues

  4. Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact

  5. Processor Error Classification • Hard Errors will result in permanent processor failure. • Processor lifetime is inversely proportional to hard error rate. • Soft Errors do not permanently damage the device.

  6. Hard Errors • Extrinsic failures • Caused by process and manufacturing defects • Occur with decreasing rate over time • No impact from micro-architecture • Intrinsic failures • Related to processor wear-out • Occur with increasing rate over time • Related to wafer packaging, process parameters, and processor design.

  7. Hard Errors

  8. Soft Errors • Occur in both memory and logic • External radiation main issue in memory • Alpha particles • High energy neutrons • Thermal neutrons • Different causes of transient errors in logic • External radiation • Supply voltage droop • Power supply fluctuations • Ground bounce, cross-talk • Process variation, temperature • Affect delay of computational paths

  9. Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact

  10. Radiation-Induced Soft Errors • Ionized particle strike causing a state change • No permanent damage (Hard-error) • Combo logic – Single Event Transients (SET) • Memory cells – Single Bit Upset (SBU) Multi Bit Upset (MBU) • Three causes of soft errors • Alpha particles • Thermal neutrons • High-energy neutrons

  11. Alpha-Particles • Emitted from impurities in packaging materials. • Create electron-hole pairs through direct ionization • Range for a 10 MeV particle < 100um • Typical energy 4-9MeV • Improved manufacturing trends  Reduced effect • Purified materials • Shielding layers

  12. Neutrons • Result of cosmic ray reactions with atmosphere • High-Energy neutrons react with chip materials. • Concrete only shielding material • 1.4x lower flux/foot of thickness

  13. Neutrons • Thermal neutrons (<<< 1MeV) react with Boron-Doped Phosphosilicate Glass (BPSG) dielectric layer. • Produce ionized particles that can cause soft-errors • Solution  Remove BPSG from advanced processes • Mostly solved – SEU’s still found in 45nm, 90nm

  14. Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact

  15. Device-level solutions • Larger device sizes  Larger capacitance • Increase the amount of charge necessary to flip bit (critical charge) • Multiple VT design • Sensitivity to variation at low-VDD may limit effectiveness. • Body biasing also common to both radiation hardening and variation tolerance

  16. Circuit-level solutions • DICE cell • Used for SRAM, FF’s, latches • Built-in currentsensors on supply lines of memory cells.

  17. Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact

  18. Modular redundancy • Dual Modular Redundancy • Triple Modular Redundancy

  19. Redundant Circuits • Redundancy increases area/power • DMR/TMR in sub/near-VT • Timing variation between circuits increases • Utilization of redundant lanes for parallel operation can increase throughput at low-VDD

  20. Self-Checking Circuits • Partition circuit into smaller blocks • Error checker for each block • Use error detection codes • Berger codes • Arithmetic codes • Increases circuit delay for error computation

  21. Circuit-Level Speculation • Uses approximated circuit implementation • Goal is to reduce critical path

  22. Tunable Replica Circuits • Mirrors delay of critical path • Monitors for errors over voltage/frequency changes

  23. Timing Speculation • Razor timing error detection • Designed for transient faults • Effective against SET’s and SBU’s on flip-flops • Requires error recovery

  24. Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact

  25. Error Recovery Options in Scalar Processors • Clock Gating: • Global error signal • Clock gating • 1-cycle penalty

  26. Error Recovery Options in Scalar Processors • Multiple Issue: • Error signals propagated to control unit • Instructions must be flushed • Error instruction then replayed • 2N-cycle penalty

  27. Error Recovery Options in Scalar Processors • Counter-flow pipelining • Micro-rollback

  28. Error correcting codes for memories • Most common is Hamming code • Check bits stored when data written • Identifies error and erroneous bit position

  29. Error correcting codes for memories • Single-bit ECC adds area/power and delay • Low-VDD Increased delay • Hybrid VDD operation will reduce delay • Overhead increases for multi-bit ECC • Increased memory density  higher probability of MBU • Current research increase in ratio of MBU to total SER in sub-VT

  30. Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact

  31. System-Level Impact • Soft errors can have a large affect on processor functionality • Increasing issue with further device scaling • All methods off error detection/correction are costly • Need to be added to system blocks wisely • SEU distribution • Effects of process variation

  32. System-Level Impact • How to determine what blocks have the highest system-level impact? • Mostly through simulation • For radiation: all-encompassing • Includes fault injection @ circuit level • Different models have been developed • ReStore – University of Illinois at Urbana-Champaign • Focuses on system level effect of radiation-induced errors • RAMP – IBM • Directed more towards hard-errors and processor failure.

  33. Questions?

More Related