1 / 32

Design and Simulation of an EM-Fault-Tolerant Processor with Micro-Rollback, Control-Flow Checking and ECC

This paper presents a design and simulation of an EM-fault-tolerant processor that incorporates micro-rollback, control-flow checking, and ECC to detect and classify computer system behavioral errors caused by EM-induced faults.

Download Presentation

Design and Simulation of an EM-Fault-Tolerant Processor with Micro-Rollback, Control-Flow Checking and ECC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Simulation of an EM-Fault-Tolerant Processor with Micro-Rollback, Control-Flow Checking and ECC Franco Trovo, Shantanu Dutt & Hasan Arslan Univ. of Illinois at Chicago

  2. Outline • Goals • Solution Adopted • Control Flow Checking • Hamming encoding on the buses • Instruction Micro rollback • Motorola 68040 and VHDL description • Simulation results • Conclusion

  3. Assumptions/Scenarios of Past FD/FT Work • Past Work on general fault detection: • Random single (sometimes double) faults • Deterministic faults • Types of faults: permanent, transient, intermittent; intermittent type not generally tackled • Past Work on EM-induced faults: • No how/why/what analysis and classification of computer failure due to EM interference

  4. Broad Goals of Our Work • Will determine and classify the following type of computer system behavioral error (i.e., program errors) due to different patterns, extent, duration and location of faults under EM-type faults: • Control flow errors -- incorrect sequence of instruction execution. Causes: address gen. error, memory faults, bus faults • Data errors. Causes: computation errors, memory & bus faults • Termination Errors (hung processor & crashes). Causes:C.U. transition to dead-end states, invalid instruction, out-of-bound address, divide-by-zero, spurious interrupts • Note: Error types are NOT mutually exclusive • Provide recipes for FT and reliable operation

  5. In This Work • Will detect • Control flow errors -- incorrect sequence of instruction execution. Causes: address gen. error, memory faults, bus faults • Raw bus errors using ECC • Provide a FT mechanism using these detections for reliable operation

  6. Outline • Goals • Solution Adopted • Control Flow Checking • Hamming encoding on the buses • Instruction Micro rollback • Motorola 68040 and VHDL description • Simulation results • Conclusion

  7. FD/FT Solutions • Fault Detection: • Control flow checking (CFC) by a concurrent error detection using watchdog (WD) processor • Hamming ECC (2-error detecting) on data & address buses • Fault Tolerance: • Instruction micro rollback triggered by • Hamming ECC • WD-monitored CFC

  8. DATA BUS MAIN PROCESSOR MAIN MEMORY ADD. BUS Performs various checks (CFC, address, etc.) WATCHDOG PROCESSOR General Structure of a System with a Watchdog

  9. DATA BUS ADD. BUS Cache WD CPU MM General Structure of a WD-Monitored System with On-Chip Cache

  10. A B C D E F Control Flow Checking [Mahmood, et al., IEEE TC’88] • Hybrid solution for detecting wrong block sequence execution • Starting from a program it extracts a Control Flow Graph Block A If cond1 then Block B if cond2 then Block D else Block E Else Block C End if Block F • Each node is associated to a block of branch free instructions + branch at end • Each edge is associated w/ a possible branch between two blocks

  11. A B C D E F Control Flow Checking • Block: branch free set of instructions • Signature: information added to the block in order to distinguish a block from another Branch Branch BLOCK sign JUMP Branch free set of instructions Jump free set of instructions Branch free set of instructions Jump free set of instructions Block Branch JUMP JUMP sign 1 Sign of 1st bra Block augmentation & sign. insertion JUMP sign 2 Sign of 2nd bra JUMP Branch

  12. GET1S Branch A Header Sign Eg. Bra Signatures? BLOCK sign JUMP Branch free set of instructions Jump free set of instructions B C D E JUMP sign 1 Sign of 1st bra JUMP sign 2 Sign of 2nd bra Branch JUMP F CFC Implemented State Diagram Reset BeginBlock Y N Header Error Wrong Bra N Error Wrong Jump or Faulted Signature Y MiddleBlock Y N GET2S Computed Sign. Eq. Header Sign? N Error Wrong Computed Signature Y Signature 1 No Branch signs Error Signature Expected Signature 2 Branch

  13. Micro Rollback [Tamir, et al., IEEE TC‘90] Individual State Registers (RAM based) Register File, Caches, Main Mem (DWB based)

  14. Support for Micro Rollback for Register File - example • MOVE 0000, D0 • ADD 000F, D0 • MOVE 0001, A3 (f) • SUB 0002, D0 • …

  15. Support for Micro Rollback for Register File - example • MOVE 0000, D0 • ADD 000F, D0 • MOVE 0001, A3 (f) • SUB 0002, D0 • Micro rollback2 levels • … XXXX XXXX XXXX XXXX 0000 XXXX D0 XX XX XX XX XX 0 0 0 0 0 1

  16. Support for Micro Rollback for Register File - example • MOVE 0000, D0 • ADD 000F, D0 • MOVE 0001, A3 (f) • SUB 0002, D0 • Micro rollback2 levels • … XXXX 0000 XXXX XXXX 000F XXXX D0 XX D0 XX XX XX 0 0 0 0 1 1

  17. Support for Micro Rollback for Register File - example • MOVE 0000, D0 • ADD 000F, D0 • MOVE 0001, A3 (f) • SUB 0002, D0 • Micro rollback2 levels • … 0000 000F XXXX XXXX 0101 XXXX A3 D0 D0 XX XX XX 0 0 0 1 1 1

  18. 0101 000F 0000 000D D0 A3 D0 D0 1 1 1 1 Support for Micro Rollback for Register File - example • MOVE 0000, D0 • ADD 000F, D0 • MOVE 0001, A3 (f) • SUB 0002, D0 • Micro rollback2 levels • … XXXX XXXX XX XX 0 0

  19. 0101 000D 000F 0000 D0 A3 D0 D0 0 0 1 1 Support for Micro Rollback for Register File - example • MOVE 0000, D0 • ADD 000F, D0 • MOVE 0001, A3 (f) • SUB 0002, D0 • Micro rollback2 levels • … XXXX XXXX XX XX 0 0

  20. 0001 000D A3 D0 1 0 Support for Micro Rollback for Register File - example • MOVE 0000, D0 • ADD 000F, D0 • MOVE 0001, A3 (f) • SUB 0002, D0… 000F XXXX XXXX 0000 D0 D0 XX XX 0 0 1 1

  21. MRB Unit ? ? uRB=1 uRB=3 HC WD CFC with Micro Rollback - Priority • Two concurrent fault detection techniques can request the processor a micro rollback • They generally requests different number of levels of rollback • Which technique should have the priority in case of simult. detection by both HC and WD? • We assign the priority to the Hamming code • Reason: shorter jump backs • Although a rationale exists for WD priority

  22. Branch A BLOCK sign JUMP Branch free set of instructions Jump free set of instructions B C D E JUMP sign 1 Sign of 1st bra Sign of 2nd bra JUMP sign 2 JUMP Branch F CFC with Instruction Micro Rollback – State Diagram Begin Block Reset Error Wrong Branch Header Y N GET1S urb_d = 1 N t<t1 Header Sign Eg. Jump Signatures? Error Wrong Branch or Faulted Signatures Multiple points of micro rollback Y Middle Block tt2 t = number of times the same error state is encountered. t < t1 : urb to BEGIN_BLOCK (1 instr) read header sign. again t1<=t<t2 : urb to “Branch” (2 instr) --re-exec prev. blk’s branch t >≥ t2 : urb to MIDDLE BLOCK (3 instr)-- re-read 2 branch signs. prev blk urb_d = 3 urb_d = bsize N Y GET2S t1<=t<t2 N Error Wrong Computed Signature Computed Sign. Eq. Header Sign? Y Signature 1 (re-execute previous branch) urb_d = 2 Hamming Code urb_d = 1 Signature 2 urb_d = 2 Branch

  23. Outline • Goals • Solution Adopted • Control Flow Checking • Hamming encoding on the buses • Instruction Micro rollback • Motorola 68040 and VHDL description • Simulation results • Conclusion

  24. IDBUS Enc \ Dec Encoder Instr Cache IABUS1 IABUS2 Decoder Encoder Decoder Encoder Decoder BC AddressBus CPU Data Bus enable rw OABUS2 ready Data Cache Decoder Encoder Decoder OABUS1 Encoder Enc \ Dec Enc \ Dec Enc \ Dec ODBUS Improved VHDL Model of 68040 + Watchdog connections Hamming code error detect. bits Control lines Data buses WD

  25. Outline • Goals • Solution Adopted • Control Flow Checking • Hamming encoding on the buses • Instruction Micro rollback • Motorola 68040 and VHDL description • Simulation results • Conclusion

  26. Simulation Environment • The Total Fault Injection Time is simply the total duration of the intermittent fault on the bus or buses considered. • The Delay Time is the time that the FG waits before starting the fault injection. • The Period Time is the period of the intermittent fault. • The Fault Time is the time of duration of the injection of a certain fault. Total Fault Injection Time Period Time Fault Time Delay Time Fault Enable Start Fault Injection First Fault Injected Second Fault Injected

  27. Fault Parameters Values • Simulations run on the model: • Faults injected on all cache buses • Fault types • Random Double, Triple, Quadruple Faults • Clustered 1 cluster 2bits, 1 cluster 4bits, 2 clusters 2bits • Three values of repeat frequency • Low (100 clock cycles = 100KHz) • Medium (10 clock cycles = 1MHz) • High (1 clock cycle = 10MHz) • Three values of duty cycle • 25% all the simulations • 50% all except high freq and 4 faults • 75% all 2 faults and 3faults middle frequencies

  28. Simulation Results (contd.)

  29. Simulation Results (contd.) • NOTE: • HC has better error coverage for cluster faults • Block sign check (part of CFC) has better err cov for rand faults

  30. Simulation Results (contd.)

  31. Conclusions • Micro-rollback coupled with FD for the first time • Micro-rollable WD state diagram for the first time • More extensive fault patterns than previous work • Good reliability for our FD/FT solutions (correct or fail-safe execution) • 3 faults: 94% low freq, 90% mid freq & 90% high freq • 4 faults: 86% low freq, 80% mid freq & 80% high freq • Average execution time linear with duty cycle and almost quadratic with the fault injection frequency • time ovhd 3 faults: 11% low, 12% med, 64% high freq • time ovhd 4 faults: 16% low, 32% med, 182% high freq • Data buses less tolerant to faults than address buses (latter causes more CFC errors and are so detected more easily)

  32. Future Work • Introduction of other fault detection techniques as triggers for micro rollback • Lower level fault detection like the micro instruction control flow checking -- can detect internal processor faults • Higher level fault detection like algorithm based fault tolerance (ABFT) for checking data errors -- can detect external & internal faults affecting data

More Related