1 / 21

Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report

Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report. Shubu Mukherjee Joel Emer, Tryggve Fossum, & Steven K. Reinhardt* Fault Aware Computing Technology (FACT) Group Massachusetts Microprocessor Design Center, Intel Corporation

lindsey
Download Presentation

Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Scrubbing in Microprocessors: Myth or Necessity?Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum, & Steven K. Reinhardt* Fault Aware Computing Technology (FACT) Group Massachusetts Microprocessor Design Center, Intel Corporation 10th IEEE International Symposium Pacific Rim Dependable Computing, French Polynesia, March 3-5, 2004 * Also, University of Michigan, Ann Arbor

  2. Summary • SECDED ECC (single error correction, double error detection) • commonly used in on-chip caches • interleaving converts spatial multi-bit errors to multiple single bit errors • Scrubbing • periodically read cache blocks and correct all single bit errors • this prevents single bit errors from accumulating, thereby avoiding temporal double bit errors • Our conclusion: given detected error target of 10 year MTTF • Scrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)

  3. Origin of Cosmic Rays p p n n p n n p n p n Earth’s Surface • Cosmic rays come from deep space

  4. source drain Impact of Neutron Strike on a Si Device neutron strike Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device + + - + + - - - Transistor Device • Secondary source of upsets: alpha particles from packaging

  5. Strike Changes State of a Single Bit 0 1 • Example Solution • Error correction codes (ECC) for single bit correction • Overhead = 7 bits for 64 bits of data

  6. 0 1 0 1 Strike Changes State of Two Adjacent BitsSpatial Double Bit Error • Example solution • SECDED ECC (single error correction, double error detection) • 8 bits of code per 64 bits of data • Interleaving for the more general case …

  7. / X + X 0 + / 0 X + / 0 X = covered with single ECC code + = covered with different ECC code Interleaving bits bits • Interleaving converts • spatial multi-bit error  multiple single bit errors

  8. Cycle 1,000,000 Cycle 100 Two Separate Strikes on Different BitsTemporal Double Bit Errors • SECDED ECC (single error correction, double error detection) • could detect error, but cannot correct the error • if errors accumulate • single bit correctable error becomes a double bit detectable error

  9. Solutions for Temporal Double Bit Errors • Natural Effects • whenever a processor reads a cache block, we can correct the single bit error • check for errors when cache blocks are replaced from the cache • More Powerful ECC • SECDED ECC requires 8 bits per 64 bits • 7 bits for single bit correction • 8th bit for double bit detection • Overhead = 13% • ECC with two bit correction requires 12 bits per 64 bits • Overhead = 19% • Scrubbing • Periodically read memory and correct all single bit errors • Disallows accumulation of temporal double bit errors • Standard technique in main memories (DRAMs) • Our calculations (later) will assume the worst case for soft errors • cache blocks don’t get scrubbed naturally

  10. Memory Hierarchy of a Processor CPU • Do we need to scrub on-chip caches? • depends on the size of these caches L1 Cache kilobytes L2 Cache megabytes Main Memory (gigabytes)

  11. Cache: 62 FIT + IQ: 100 FIT + FU: 58 FIT Total of 210 FIT Detected Unrecoverable Error (DUE) • Interval-based • MTTF = Mean Time to Failure • E.g., goal = 10 years MTTF for application crash • Bossen, IRPS 2002 • Rate-based • FIT = Failure in Time = 1 failure in a billion hours • 10 year MTTF = 109 / (24 * 365 * 10) FIT = 11,415 FITs Hypothetical Example

  12. Second Strike, Probability = 1 / Q First Strike, Probability = Q / Q MTTF calculations: probabilities • 1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC • Q = # quadwords in cache memory • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike • Pd[1] = 0 • Pd[2] = 1 / Q Pd[2] = (Q/Q) * (1/Q) = 1/Q

  13. MTTF calculations: probabilities Second Strike, Probability = (Q-1) / Q First Strike, Probability = Q / Q Third Strike, Probability = 2/Q • 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC • Q = # quadwords in cache memory • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike • Pd[3] = [ (Q-1)/Q ] * [2/Q] Pd[3] = (Q/Q) * (Q-1/Q) * (2/Q)

  14. MTTF calculations: probabilities • 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC • Q = # quadwords in cache memory • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike • Pd[1] = 0 • Pd[2] = 1 / Q • Pd[3] = [ (Q-1)/Q ] * [2/Q] • Pd[4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q] • … • Pd[n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ]

  15. MTTF calculations: Equation • M = mean # of single bit errors to get a double bit error = Expected value of random variable with Pd[n] as the probability distribution function • M can be easily generated using a computer program • MTTF (double bit error) = M * MTTF (single bit error) • For a 32 megabyte cache & FIT/bit = 0.001 [Normand 1996, Tosaka 1996] • MTTF (double bit error) = M * MTTF (single bit error) = 2567 * (1 / Cache FIT) = 2567 * (109 / (0.001 * 222 * 72 * 24 * 365)) = 970 years • Saleh, et al.’s, 1990 closed form equation • MTTF (double bit error) = [ 1 / (72 * f)] * sqrt( / 2Q) = 970 years, f = FIT/bit

  16. Temporal Double BitMTTF variations with cache size • FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) • higher at higher altitudes (e.g., 3-5x at 1.5km in Denver) • Temporal double bit error has very small contribution to DUE rate • compared to a goal of 10 years DUE MTTF

  17. I I I MTTF with Scrubbing • I = scrubbing interval, scrub at the end of each interval I • N = # scrubbing intervals to reach MTTF = Expected value of random variable with probability distribution function: (1-pf)N * pf, where pf = probability of a temporal double bit error at the end of an interval Assuming 16 GB cache, FIT/bit = 0.001 (Normand 1996, Tosaka 1996), scrub once a year (I = 1 year) • MTTF(double bit error) = N * I = 2281 * 1 = 2281 years • Saleh, et al. 1990 closed form equation • 2 / [Q * I * (f * 72)2] = 2341 years, f = FIT/bit

  18. 16 Gigabyte Cache Impact of Scrubbing on Temporal Double Bit MTTF • FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) • higher at higher altitudes (e.g., 3-5x at 1.5km in Denver) • For 16 gigabytes of cache, scrubbing can help • compared to a DUE MTTF goal of 10 years

  19. Summary • SECDED ECC (single error correction, double error detection) • commonly used in on-chip caches • interleaving converts spatial multi-bit errors to multiple single bit errors • Scrubbing • periodically read cache blocks and correct all single bit errors • this prevents single bit errors from accumulating, thereby avoiding temporal double bit errors • Our conclusion: given detected error target of 10 year MTTF • Scrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)

  20. BACKUPS

  21. Raw soft error rate: 0.001 – 0.010 FIT/bit • Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” VLSI Symposium on VLSI Technology Digest of Technical Papers, 1996. • Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.

More Related