Strategies for Fault Detection in Electronic Design Automation

CS137:Electronic Design Automation Day 9: October 17, 2005 Fault Detection

Today • Faults in Logic • Error Detection Schemes • Optimization Problem

Problem • Gates, wires, memories: • built out of physical media • may fail

Device Physics • Represent a 1 or 0 with charge • On a gate, in a memory • Charge may be disrupted • -particle (other ionizing particles) • Ground bounce • Noise coupling • Tunneling • Thermal noise • Behavior of individual electrons is statistical

DRAMs • Small cells • Store charge dynamically on capacitor • Store about 50,000 electrons • Must be refreshed • Data leaks away through parasitic resistance • -particle can be 1,000,000 carriers?

System Reliability • Device fail with Probability: Pfail • Have N components in system • All must work for device to work • Psys = (1-Pfail)N

System Reliability • If NPfail << 1 • NPfail dominates higher order terms…

System Reliability • Psysfail N  Pfail

Modern System • 100 Million  1 Billion Transistors • Not to mention wiring… • > GHz = > 1 Billion Transitions / sec. • N = 1018 per second…

As we scale? • N increases • Charge/gate decreases • Less electrons • Higher probability they wander • Greater variability in behavior • Voltage levels decrease • Smaller barriers • Greater variability in device parameters Pfail increases

Exacerbated at Nanoscale • Small numbers of dopants (10s) • High variability • Small numbers of electrons (10-1000s?) • High variability • Highly susceptible to noise • Small number of molecules • May break, decay…

What do we do about it? • Tolerate faulty components • Detect faults • Not do anything bad • Try it again • If statistically unlikely error, • high likelihood won’t recur. • …Focus on detection…

Detect Faults • Key Idea: redundancy • Include enough redundancy in computation • Can tell that an error occurred

What kind of redundancy can we use? • Multiple copies of logic • Compute something about result • Parity on number of outputs • Count of number of 1’s in output

Error Detection

What do we protect against? • Any n errors • Worst-case selection of errors

Single Error Detection • If Pfail small: • No error: (1-Pfail)N 1-NPfail • One error:NPfail (1-Pfail)N-1 NPfail • Two errors: [N(N-1)/2] (Pfail )2(1-Pfail)N-1 • Probability of an error going undetected • For: NPfail << 1 • Goes from NPfail • to  (NPfail )2

Single Error Detection (Example) • Probability of an error going undetected • For: NPfail << 1 • Goes from NPfail • to  (NPfail )2 • N=1010 Pfail=10-20 • NPfail=10-10<<1  ~1010 cycles MTTF • Mean Time To Failure • 1GHz = 10s • (NPfail)2=10-20  1020 cycles MTTUF • Mean Time To Undetected Fault • 1011s = 3000 years

Detection Overhead • …but: Correction and detection circuitry increase circuit size. • Ndetect > Nlogic • Ndetect = c Nlogic • Probability of an error going undetected • Goes from NPfail • to  (cNPfail )2 • To come out ahead, want: c2 << 1/(NPfail ) • c=3, N=1010 Pfail=10-20 • (cNPfail)2=910-20  1019 cycles MTTUF • 1010s = 300 years

Detection Overhead • …but: Correction and detection circuitry increase circuit size. • Ndetect > Nlogic • Ndetect = c Nlogic • Probability of an error going undetected • Goes from NPfail • to  (cNPfail )2 • To come out ahead, want: c2 << 1/(NPfail ) • c=3, N=31010 Pfail=10-11 • NPfail=0.3 • (cNPfail)2=0.81  worse • Neither workable!

Reliability Tuning • Want NPfail small • Want: (cNPfail )2 very small • Idea: • Guard subsystems independently • Make Ns suitably small • Smaller probability there is a double error localized in this small subsystem • That is: as long as compartmentalization guarantees very small (cNsPfail )2: • can reduce to single detect case.

Guarding Subsystems

Composing Subsystems • Psysundetect = (Nsys/Ns) Psubundetect • Psubundetect = (cNsPfail )2 • Psysundetect = (Nsys/Ns) (cNsPfail )2 • Psysundetect = Nsys  Ns (cPfail )2 • Extermes: • Ns= Nsys • Ns=1 No benefit Maximum benefit factor of Nsys [in practice c=f(Ns)]

Composing Subsystems • Psysundetect = Nsys  Ns (cPfail )2 • Example: c=3, Nsys=31010 Pfail=10-11 • Ns=103 • 31010  103  (310-11)2 • 33  10-9 310-8 (<<0.81) • Still < 1s MTTUF …

Problem Motivates Problem: • Generate logic capable of detecting any single error

Terminology • Fault-secure: system never produces incorrect code word • Either produces correct result • Or detects the error • Self-testing: for every fault, there is some input that produces an incorrect code word • That detects the error

Terminology • Totally Self Checking: system is both fault-secure and self-testing.

Duplication Detects any single fault (even in checker)

Duplication • N original gates • Duplicate: + N • O outputs • O xors • O/2  2  2 ors • Total 3O gates • Total: 2N+3O • O<N • 2<c<5

Duplication • Total: 2N+3O • O<N • Rent’s Rule: O~kNp • p<1 • Total: 2N+3kNp • c(N)=2+3k/N(1-p) • N small  5 • N large  2

Duplication with PLA Logic Duplicate

PLA Duplication • N product terms in original • N in duplicate • 2 O product terms for matching • ON • 2<c<4

Can we do better? • Seems like overkill to compute twice?

Idea • Encode so outputs have some checkable property • E.g. parity

Will this work? Original Logic Extra cubes for parity parity

Problem • Single fault may produce multiple output errors

How Fix? • How do we fix?

No Logic Sharing • No sharing • Single fault effects single output

Parity Checking • To check parity • Need xor tree on outputs/parity • [(O+1)/2]22 = 2(O+1) xors • For PLA • xor would blow up • Wrap multiple times • 2 product terms per xor • 4O product terms

nanoPLA Wrapped xor Note: two planes here just for buffering/inversion

Better or Worse than Dual? (not include checking)

Can we allow sharing? • When?

Multiple Parity Groups • Can share with different parity groups • Common error flagged in both groups

Multi-Parity Group Compare (AMD) (not include checking)

Best Results from Winter2004 CS137 (not include checking)

Better or Worse than Dual? • Typical results from Mitra [ITC2002] • Multi-level gate mapping to LSI std. cell library (parity here includes multiple parity)

Admin • Assignment #2 due Friday • Wednesday reading online • Friday reading handout

Big Ideas • Low-level physics imperfect • Statistical, noisy • Larger number of devices  greater likelihood of faults • Redundancy • Self-checking circuits

Strategies for Fault Detection in Electronic Design Automation