cache scrubbing in microprocessors myth or necessity practical experience report l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report PowerPoint Presentation
Download Presentation
Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report

Loading in 2 Seconds...

play fullscreen
1 / 21

Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report - PowerPoint PPT Presentation


  • 182 Views
  • Uploaded on

Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report. Shubu Mukherjee Joel Emer, Tryggve Fossum, & Steven K. Reinhardt* Fault Aware Computing Technology (FACT) Group Massachusetts Microprocessor Design Center, Intel Corporation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report' - lindsey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cache scrubbing in microprocessors myth or necessity practical experience report
Cache Scrubbing in Microprocessors: Myth or Necessity?Practical Experience Report

Shubu Mukherjee

Joel Emer, Tryggve Fossum, & Steven K. Reinhardt*

Fault Aware Computing Technology (FACT) Group

Massachusetts Microprocessor Design Center, Intel Corporation

10th IEEE International Symposium Pacific Rim Dependable Computing, French Polynesia, March 3-5, 2004

* Also, University of Michigan, Ann Arbor

slide2

Summary

  • SECDED ECC (single error correction, double error detection)
    • commonly used in on-chip caches
    • interleaving converts spatial multi-bit errors to multiple single bit errors
  • Scrubbing
    • periodically read cache blocks and correct all single bit errors
    • this prevents single bit errors from accumulating, thereby avoiding temporal double bit errors
  • Our conclusion: given detected error target of 10 year MTTF
    • Scrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)
slide3

Origin of Cosmic Rays

p

p

n

n

p

n

n

p

n

p

n

Earth’s Surface

  • Cosmic rays come from deep space
slide4

source

drain

Impact of Neutron Strike on a Si Device

neutron strike

Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device

+

+

-

+

+

-

-

-

Transistor Device

  • Secondary source of upsets: alpha particles from packaging
strike changes state of a single bit
Strike Changes State of a Single Bit

0

1

  • Example Solution
    • Error correction codes (ECC) for single bit correction
    • Overhead = 7 bits for 64 bits of data
strike changes state of two adjacent bits spatial double bit error

0

1

0

1

Strike Changes State of Two Adjacent BitsSpatial Double Bit Error
  • Example solution
    • SECDED ECC (single error correction, double error detection)
      • 8 bits of code per 64 bits of data
    • Interleaving for the more general case …
interleaving bits

/

X

+

X

0

+

/

0

X

+

/

0

X = covered with single ECC code

+ = covered with different ECC code

Interleaving bits

bits

  • Interleaving converts
    • spatial multi-bit error  multiple single bit errors
two separate strikes on different bits temporal double bit errors

Cycle 1,000,000

Cycle 100

Two Separate Strikes on Different BitsTemporal Double Bit Errors
  • SECDED ECC (single error correction, double error detection)
    • could detect error, but cannot correct the error
    • if errors accumulate
      • single bit correctable error becomes a double bit detectable error
solutions for temporal double bit errors
Solutions for Temporal Double Bit Errors
  • Natural Effects
    • whenever a processor reads a cache block, we can correct the single bit error
    • check for errors when cache blocks are replaced from the cache
  • More Powerful ECC
    • SECDED ECC requires 8 bits per 64 bits
      • 7 bits for single bit correction
      • 8th bit for double bit detection
      • Overhead = 13%
    • ECC with two bit correction requires 12 bits per 64 bits
      • Overhead = 19%
  • Scrubbing
    • Periodically read memory and correct all single bit errors
    • Disallows accumulation of temporal double bit errors
    • Standard technique in main memories (DRAMs)
    • Our calculations (later) will assume the worst case for soft errors
      • cache blocks don’t get scrubbed naturally
memory hierarchy of a processor
Memory Hierarchy of a Processor

CPU

  • Do we need to scrub on-chip caches?
    • depends on the size of these caches

L1 Cache

kilobytes

L2 Cache

megabytes

Main Memory (gigabytes)

detected unrecoverable error due

Cache: 62 FIT

+

IQ: 100 FIT

+

FU: 58 FIT

Total of 210 FIT

Detected Unrecoverable Error (DUE)
  • Interval-based
    • MTTF = Mean Time to Failure
    • E.g., goal = 10 years MTTF for application crash
      • Bossen, IRPS 2002
  • Rate-based
    • FIT = Failure in Time = 1 failure in a billion hours
    • 10 year MTTF = 109 / (24 * 365 * 10) FIT = 11,415 FITs

Hypothetical Example

mttf calculations probabilities

Second Strike, Probability = 1 / Q

First Strike, Probability = Q / Q

MTTF calculations: probabilities
  • 1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC
  • Q = # quadwords in cache memory
  • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike
  • Pd[1] = 0
  • Pd[2] = 1 / Q

Pd[2] = (Q/Q) * (1/Q) = 1/Q

slide13

MTTF calculations: probabilities

Second Strike, Probability = (Q-1) / Q

First Strike, Probability = Q / Q

Third Strike, Probability = 2/Q

  • 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC
  • Q = # quadwords in cache memory
  • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike
  • Pd[3] = [ (Q-1)/Q ] * [2/Q]

Pd[3] = (Q/Q) * (Q-1/Q) * (2/Q)

slide14

MTTF calculations: probabilities

  • 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC
  • Q = # quadwords in cache memory
  • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike
  • Pd[1] = 0
  • Pd[2] = 1 / Q
  • Pd[3] = [ (Q-1)/Q ] * [2/Q]
  • Pd[4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q]
  • Pd[n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ]
mttf calculations equation
MTTF calculations: Equation
  • M = mean # of single bit errors to get a double bit error

= Expected value of random variable with Pd[n] as the

probability distribution function

  • M can be easily generated using a computer program
  • MTTF (double bit error) = M * MTTF (single bit error)
  • For a 32 megabyte cache & FIT/bit = 0.001 [Normand 1996, Tosaka 1996]
  • MTTF (double bit error) = M * MTTF (single bit error)

= 2567 * (1 / Cache FIT)

= 2567 * (109 / (0.001 * 222 * 72 * 24 * 365))

= 970 years

  • Saleh, et al.’s, 1990 closed form equation
    • MTTF (double bit error) = [ 1 / (72 * f)] * sqrt( / 2Q)

= 970 years, f = FIT/bit

temporal double bit mttf variations with cache size
Temporal Double BitMTTF variations with cache size
  • FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996)
    • higher at higher altitudes (e.g., 3-5x at 1.5km in Denver)
  • Temporal double bit error has very small contribution to DUE rate
    • compared to a goal of 10 years DUE MTTF
mttf with scrubbing

I

I

I

MTTF with Scrubbing
  • I = scrubbing interval, scrub at the end of each interval I
  • N = # scrubbing intervals to reach MTTF

= Expected value of random variable with probability distribution

function: (1-pf)N * pf, where pf = probability of a temporal double bit

error at the end of an interval

Assuming 16 GB cache, FIT/bit = 0.001 (Normand 1996, Tosaka 1996),

scrub once a year (I = 1 year)

  • MTTF(double bit error) = N * I

= 2281 * 1 = 2281 years

  • Saleh, et al. 1990 closed form equation
    • 2 / [Q * I * (f * 72)2] = 2341 years, f = FIT/bit
impact of scrubbing on temporal double bit mttf

16 Gigabyte Cache

Impact of Scrubbing on Temporal Double Bit MTTF
  • FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996)
    • higher at higher altitudes (e.g., 3-5x at 1.5km in Denver)
  • For 16 gigabytes of cache, scrubbing can help
    • compared to a DUE MTTF goal of 10 years
slide19

Summary

  • SECDED ECC (single error correction, double error detection)
    • commonly used in on-chip caches
    • interleaving converts spatial multi-bit errors to multiple single bit errors
  • Scrubbing
    • periodically read cache blocks and correct all single bit errors
    • this prevents single bit errors from accumulating, thereby avoiding temporal double bit errors
  • Our conclusion: given detected error target of 10 year MTTF
    • Scrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)
raw soft error rate 0 001 0 010 fit bit
Raw soft error rate: 0.001 – 0.010 FIT/bit
  • Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” VLSI Symposium on VLSI Technology Digest of Technical Papers, 1996.
  • Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.