1 / 40

Assertion-Based Microarchitecture Design for Improved Fault Tolerance

Assertion-Based Microarchitecture Design for Improved Fault Tolerance. Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg. Center for Embedded Systems Research (CESR) Department of Electrical & Computer Engineering North Carolina State University. Motivation. Technology scaling

lois-nieves
Download Presentation

Assertion-Based Microarchitecture Design for Improved Fault Tolerance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for Embedded Systems Research (CESR) Department of Electrical & Computer Engineering North Carolina State University

  2. Motivation • Technology scaling • Smaller, faster transistors • Transistors more susceptible to transient faults • How to build reliable processors using unreliable transistors?

  3. Motivation No fault tolerance, correct operation No fault tolerance, erroneous operation Inefficient fault tolerance, correct operation Due to high overheads of inefficient fault tolerance Performance Susceptible to transient faults Technology Scaling Need efficient fault tolerance solutions

  4. Redundant Multithreading (RMT) • Duplicate a program and compare outcomes to detect transient faults • Positives • Simple and straightforward • General solution • Complete fault coverage • Negatives • High overhead

  5. RMT using an extra processor Program Processor Processor Power Cost ==

  6. RMT using simultaneous multithreading PROGRAM Processor Pipeline stages Performance Power == ==

  7. Alternate solution: Targeted fault checks • Add regimen of fault checks • Specific to logic block • Arbitrary latches: Robust latch design • Arbitrary gates: Self-checking logic • FSM: Self-checking FSM designs • ALU: Self-checking ALUs, RESO • Storage and buses: Parity, ECC • Positives • No overhead of duplicating the program • Negatives • Not general, i.e., many types of checks needed

  8. Our contribution: Microarchitectural Assertions • Novel class of fault checks • Key Idea: Confirm µarch. “truths” within processor pipeline • Catch-all checks for microarchitecture mechanisms • Positives • Broad coverage (catch-all checks) • Very low-overhead solution (no redundant execution)

  9. Processor == == Spectrum of fault checks Architecture Level (RMT) Processor

  10. parity/ECC robust flip-flop == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == Processor self-check ALU self-check FSM == == == == Spectrum of fault checks Architecture Level Logic Circuit Level (RMT) Processor

  11. Processor == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == Processor == == == == == Spectrum of fault checks Architecture Level Logic Circuit Level (RMT) Processor

  12. Processor Processor == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == μarch assert μarch assert Processor == == == == == == == Spectrum of fault checks Microarchitecture Level Architecture Level Logic Circuit Level (RMT)

  13. Examples of Microarchitectural Assertions • Register Name Authentication (RNA) • Aims to detect faults in renaming unit • Asserts consistencies in renaming unit • Exploits inherent redundancy in renaming structures • Asserts expected physical register states • Timestamp-based Assertion Checking (TAC) • Aims to detect faults in issue unit • Asserts sequential order among data dependent instructions • Uses timestamps

  14. p4 p4 p3 p3 p2 p2 r1 r2 r3 r1 r2 r3 p1 Freelist Program I1 (r1) I1 (r1) p1 I2 (r1) Rename Logic I3 (r1) I1 (r1:p1) Rename Map Table p1 I1 (r1:p1) old_r1:p5 p5 p6 p7 Reorder Buffer I1 (r1:p1) old_r1:p5 p5 p1 Architecture Map Table I1 retires p6 p7

  15. r1 r2 r3 r1 r2 r3 p5 p4 p3 p2 Freelist Program I2 (r1) p2 I2 (r1) Rename Logic I3 (r1) I2 (r1:p2) Rename Map Table p1 p2 I2 (r1:p2) old_r1:p1 p5 p6 p7 Reorder Buffer I2 (r1:p2) old_r1:p1 p1 Architecture Map Table p6 p7

  16. r1 r2 r3 r1 r2 r3 p5 p4 p3 Freelist Program Rename Logic I3 (r1) Rename Map Table p1 p2 p5 p6 p7 Reorder Buffer Observation r1’s old == r1’s arch mapping mapping I2 (r1:p2) old_r1:p1 p1 Architecture Map Table p6 p7

  17. r1 r2 r3 r1 r2 r3 p5 p4 p3 Freelist Program Rename Logic I3 (r1) Rename Map Table p1 p6 p2 p5 p6 p7 Reorder Buffer I2 (r1:p2) old_r1:p1 RNA prev mapping check old_r1 == arch_r1? p1 Architecture Map Table p1 == p1? p6 p7

  18. r1 r2 r3 r1 r2 r3 p5 p4 Freelist Program p3 I3 (r1) Rename Logic I3 (r1:p3) Rename Map Table p1 p6 p3 I3 (r1:p3) old_r1:p6 p5 p6 p7 Reorder Buffer I3 (r1:p3) old_r1:p6 I2 (r1:p2) old_r1:p1 p1 Architecture Map Table p6 p7

  19. p5 p5 p4 p4 r1 r2 r3 r1 r2 r3 Freelist Program Rename Logic Rename Map Table p1 p6 p3 p5 p6 p7 Reorder Buffer I3 (r1:p3) old_r1:p6 I2 (r1:p2) old_r1:p1 RNA prev mapping check old_r1 == arch_r1? p1 p2 Architecture Map Table p1 == p1? p6 p7 I2 retires

  20. r1 r2 r3 r1 r2 r3 p1 p5 p4 Freelist Program Rename Logic Rename Map Table p1 p6 p3 p5 p6 p7 Reorder Buffer I3 (r1:p3) old_r1:p6 RNA prev mapping check old_r1 == arch_r1? p2 Architecture Map Table p6 == p2? p6 p7 FAULT DETECTED At I3 retirement

  21. r1 r2 r3 p4 p3 p2 p1 Program Freelist I1 (r1) I1 (r1) p1 I2 (r1) Rename Logic I1 (r1:p6) Rename Map Table p5 p6 I1 (r1:p6 old_r1:p5) r1 p6 r2 p7 r3 Reorder Buffer I1 (r1:p6) old_r1:p5 p5 Architecture Map Table p6 p7

  22. r1 r2 r3 p4 p3 p2 Program Freelist I2 (r1) p2 I2 (r1) Rename Logic I2 (r1:p2) Rename Map Table p5 p2 p6 I2 (r1:p2 old_r1:p6) r1 p6 r2 p7 r3 Reorder Buffer I2 (r1:p2) old_r1:p6 I1 (r1:p6) old_r1:p5 p5 Architecture Map Table p6 p7

  23. p4 p4 p3 p3 r1 r2 r3 Program Freelist Rename Logic Rename Map Table p5 p2 p6 r1 p6 r2 p7 r3 Reorder Buffer I2 (r1:p2) old_r1:p6 RNA prev mapping check I1 (r1:p6) old_r1:p5 old_r1 == arch_r1? p5 p6 p5==p5? Architecture Map Table p6 p7 I1 retires

  24. p4 p4 p3 p3 r1 r2 r3 Program Freelist Rename Logic Rename Map Table p5 p2 p6 r1 p6 r2 p7 r3 Reorder Buffer I2 (r1:p2) old_r1:p6 RNA prev mapping check old_r1 == arch_r1? p6 p6==p6? Architecture Map Table p6 FAULT NOT DETECTED p7 At I2 retirement

  25. p4 p3 p2 I1 I1 I1 I1 I1 (r1:p1 ) (r1:p1 ) (r1:p1 ) (r1 ) (r1 ) I3 (r1 ) p1 p1 p2 p2 p3 p3 p4 p4 p1 Program Freelist p1 I2 (r2 ) Rename Logic Rename Map Table p1 r1 r2 r3 Issue Queue issue Free bit unset FU FU FU FU writeback Rdy Free Register File 0 0 1 Rdy/Free Bit Array 0 1 0 1 0 1

  26. I2 I2 I2 I2 I1 I2 (r2 ) (r2 ) (r2:p2 ) (r1:p1 ) (r2:p2 ) (r2:p2 ) I3 (r1 ) p1 p1 p2 p2 p3 p3 p4 p4 p4 p3 p2 Program Freelist p2 Rename Logic Rename Map Table p1 r1 p2 r2 r3 Issue Queue issue FU FU FU FU writeback Rdy Free Register File 0 0 Rdy/Free Bit Array 0 0 1 0 1 0 1

  27. p4 p3 I2 I1 (r2:p2 ) (r1:p1 ) I3 (r1 ) p1 p1 p2 p2 p3 p3 p4 p4 Program Freelist Rename Logic Observation Ready bit == 0 (not executed) Free bit == 0 (not in freelist) Rename Map Table p1 r1 p2 r2 r3 Issue Queue issue Ready bit set FU FU FU FU writeback Rdy Free XXX Register File 0 1 0 XXX Rdy/Free Bit Array 0 1 0 0 1 0 1

  28. I3 I3 I3 I3 I3 (r1 ) (r1 ) (r1:p2) (r1:p2) (r1:p2) p1 p1 p2 p2 p3 p3 p4 p4 p4 p3 Program Freelist p3 Rename Logic RNA writeback check Free of p2 == 0? Ready of p2 == 0? Rename Map Table p2 p1 r1 p2 r2 FAULT DETECTED r3 Issue Queue issue FU FU FU FU writeback Rdy Free XXX Register File 1 0 XXX Rdy/Free Bit Array 1 0 0 0 1 0 1

  29. RNA summary • RNA previous mapping check • Detects faults in renaming structures • Rename map table, Architecture map table • Branch checkpoint tables • Active list (renaming state) • RNA writeback state check • Detects faults in renaming logic and freelist

  30. Source renaming • Pure source renaming faults undetected • E.g., fault in source renaming logic • Researching solutions similar to RNA • However, faults causing deadlock are detectable • Faulty source name causing cyclic dependency • Faulty source name points to unpopped freelist entry • Other faults that cause phantom producers • Use watchdog timer to detect deadlocks

  31. Timestamp-based Assertion Checking (TAC) • Confirm data dependent instructions issued sequentially • Assign timestamps to instructions at issue • At retirement, confirm instruction timestamp greater than producers’ timestamps • TAC check Instruction timestamp >= Producer’s timestamp + latency • Faults on checking logic can only cause false alarms

  32. I1 (r1) I2 (r3) I3 (r1, r1) I5 (r3, r3) I4 (r1, r1) I6 (r3, r3) r1 r2 r3 Scheduler Dataflow Issue Queue I6 I5 I4 I3 I2 I1 issue FU FU 3 5 6 0 2 4 1 Timestamp Counter writeback I6 I5 I4 I3 I2 I1 Reorder Buffer 5 2 4 0 1 3 TS 1 1 1 1 1 Lat. 1 TAC check Instr TS >= Prod TS + Prod Lat T retire H TS Lat. 0 0 Architectural State 0 0 0 0

  33. I1 (r1) I2 (r3) I3 (r1, r1) I5 (r3, r3) I4 (r1, r1) I6 (r3, r3) 4 0 1 3 1 1 1 1 H H H H H H H r1 I3: 4 >= 1 + 1 I2: 0 >= 0 + 0 I1: 1 >= 0 + 0 4 1 1 1 r2 r3 0 1 Scheduler Dataflow Issue Queue issue FU FU writeback I6 I5 I4 I3 I2 I1 Reorder Buffer TAC check 5 2 TS 1 1 Lat. retire T TS Lat. 0 0 Architectural State I4: 3 >= 4 + 1 0 0 FAULT DETECTED 0 0

  34. Experiments • Randomly injected faults in timing simulator • 1000 faults per benchmark • Faults target issue and rename state • Simulation ends 1 million cycles following fault injection • Observations • Fault detected by an assert (Assert) or not (Undet) • Fault corrupts architectural state (SDC) or not (Masked) • Possible outcomes • Assert+SDC • Undet+SDC • Assert+Masked • Undet+Masked

  35. TAC - Fault Injection Experiments • Type of faults injected • Ready bits prematurely set • Speculatively issued cache-missing load dependents not reissued

  36. TAC - Fault Injection Experiments 80% of faults cause SDC 20% of faults are masked All SDC faults are detected Zero undetected SDC faults

  37. RNA - Fault Injection Experiments • Faults injected • Flip bits of an entry in architecture map table • Flip bits of an entry in rename map table • Flip bits of an entry in freelist • Flip destination register bits at dispatch • Watchdog timer included to detect deadlocks • Additional possible outcomes • Assert+Wdog : RNA detected a future deadlock • Undet+Wdog : Deadlock undetected by RNA (possibly would have been caught by RNA in future)

  38. RNA - Fault Injection Experiments 68% SDC faults 24% deadlocks 8% masked faults 88% of SDC faults detected by RNA 12% of SDC faults undetected 25% of deadlocks detected by RNA beforehand 75% of deadlocks not detected by RNA (but detected by watchdog)

  39. RNA - Fault outcome distribution “dest” faults cause deadlocks because of phantom producers Deadlock blocks retirement Thus, RNA check can’t complete “arch_map” faults cause most undetected SDC Long live register range System trap before RNA check can complete

  40. Processor Processor == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == μarch assert μarch assert Processor == == == == == == == Conclusions Microarchitecture Level Architecture Level Logic Circuit Level (RMT)

More Related