260 likes | 277 Views
Explore fault tolerance implementation for single event upsets in synchronous system design. Learn about SEU detection, correction methods, TMR, EDAC techniques, and safe state machines.
E N D
A Simplified Approach to Fault Tolerant State Machine Design for Single Event Upsets Melanie Berg
Overview • Presentation describes “Hardened by Design” techniques at a high level of abstraction… FGPA/ASIC logic Design • Background • Definition of Fault Tolerance • State Machines • Synchronous Design Theory • Proposed Method of SEU detection • Proposed Method of SEU correction Berg
Definition of Fault Tolerance • Masking or recovering from erroneous conditions in a system once they have been detected • The degree of fault tolerance implementation is defined by your system level requirements… I.e. what actually is acceptable behavior upon error • Questions that must be answered within the system requirements documentation: • Does your system only need to detect an error? • How quickly must the system respond to an error? • Must your system also correct the error? • Is the system susceptible to more than one error per clock cycle? Berg
Synchronous Design with Asynchronous Events • This discussion focuses on sequential Single Event Upsets (SEUs) within a synchronous design environment. • The SEU is considered a soft (temporary) error which has occurred due to a DFF being hit by a charged particle. • Configuration or SRAM errors will not be considered • Although the design is synchronous, it is very important to note that the SEU is an asynchronous event… • Generally not taken into account • Metastability and unpredictable events can occur • Can invoke a SEFI Berg
Common Fault Tolerant Implementation • Triple Mode Redundancy (TMR) is the most commonly implemented solution of SEU tolerance. • Why …. Because it is a very simple solution • In many cases it is not implemented correctly • Glitches within the TMR voting logic (due to mitigation across separate clock domains or hazardous combinational logic) must be taken into account incase a SEU occurs near a clock edge • TMR can be very area extensive Berg
Proposed EDAC Methodology • Goal: The proposed EDAC techniques are: • Targeted for synchronous Finite State Machine Designs • Less area extensive than TMR • Glitch Free and synchronous: Reduces the rate of SEFI • Note: Synchronous Design techniques referred to in this presentation are derived from the ASIC industry and are implemented using HDL… • DFF data inputs should not change within the setup and hold of the DFF: Metastability and unpredictable functionality will occur • Within a synchronous design, metastability will only happen at clock domain crossings…Must use metastability filters (synchronizers) to protect against these Asynchronous events • Synchronous design theory minimizes clock boundary crossings • This is a challenge when SEUs can occur at any point in time anywhere in the circuit Berg
Synchronous State Machines • A Finite State Machine (FSM) is designed to deterministically transition through a pattern of defined states • A synchronous FSM utilizes flip-flops to hold its currents state, transitions according to a clock edge and only accepts inputs that have been synchronized to the same clock • Generally FSMs are utilized as control mechanisms • Concern/Challenge: • If an SEU occurs within a FSM, the entire system can lock up into an unreachable state: SEFI!!! Berg
Synchronous State Machines • The structure consists of four major parts: • Inputs • Current State Register • Next State Logic • Output logic Berg
Encoding Schemes • Each state of a FSM must be mapped into some type of encoding (pattern of bits) • Once the state is mapped, it is then considered a defined (legal) state • Unmapped bit patterns are illegal states Berg
Encoding Schemes Berg
Safe State Machines??? • A “Safe” State Machine has been defined as one that: • Has a set of defined states • Can deterministically jump to a defined state if an illegal state has been reached (due to a SEU). • Synthesis tools offer a “Safe” option (demand from our industry): TYPE states IS ( IDLE, GET_DATA, PROCESS_DATA, SEND_DATA, BAD_DATA ); SIGNAL current_state, next_state : states; attribute SAFE_FSM: Boolean; attribute SAFE_FSM of states: type is true; • However…Designers Beware!!!!!!! • Synthesis Tools Safe option is not deterministic if an SEU occurs near a clock edge!!!!! Berg
Binary Encoding: How Safe is the “Safe” Attribute? • If a Binary encoded FSM flips into an illegal (unmapped) state, the safe option will return the FSM into a known state that is defined by the others or default clause • If a Binary encoded FSM flips into a good state, this error will go undetected. • If the FSM is controlling a critical output, this phenomena can be very detrimental! • How safe is this? Berg
One-Hot vs. Binary • There used to be a consensus suggesting that Binary is “safer” than One-Hot • Based on the idea that One-Hot requires more DFFs to implement a FSM thus has a higher probability of incurring an error • This theory has been changed! • Most of the community now understands that although One-Hot requires more registers, it has the built-in detection that is necessary for safe design • Binary encoding can lead to a very “un-safe” design Berg
Proposed SEU Error Detection: One-Hot • One-Hot requires only one bit be active high per clock period • If more than one bit is turned on, then an error will be detected. • Combinational XNOR over the FSM bits is sufficient for SEU detection… even if a SEU occurs near a clock edge • A MUX can be used to transition the current state into a defined “ERROR STATE” if the parity check fails • If the system can not receive Multiple Event Upsets within one clock period, then the circuitry can never flip into a legal state (illegally)! Berg
FSM SEU: Error Correction : Using Companion States • There exists many publications on Error Correction theory. • None directly address how to correctly implement FSM fault correction while using current day synthesis tools. • Glitch control: Generally synthesis tools will produce “glitchy” logic • Synthesis “optimization” algorithms will erase the necessary redundancy for EDAC • The user must sometimes hand instantiate logic • The user must place the necessary attributes to avoid redundant logic erasure. Berg
Error Correction within One Cycle: Using Companion States • We’ll base the derivation off of a 4 state FSM: Berg
Error Correction within One Cycle: Using Companion States • 1.Find an encoding such that the states have a hamming distance of 3 (at least 3 bits must be different from state to state)... • 00000 (state-A), • 11100(state-B), • 01111(state-C), • 10011(state-D). • Five bits are necessary to encode a four-state machine in order to achieve the required hamming distance of three. Berg
Error Correction within One Cycle: Using Companion States • For each encoding, calculate the companion encodings such that the hamming distance is one… for example: • Companion encoding for state A (00000) is: • 00001,00010,00100,01000,10000 • Companion encoding for state B (11100) is: • 11101,11110,11001,10100,01100 Berg
Error Correction within One Cycle: Using Companion States • When implementing the state machine, state A is encoded as 00000 and then (theoretically) “OR-ed” with all of its companion encodings. This covers all possible SEUs • Do the same for all other states • Use the output of the “OR-ed” states to determine next state logic. • Thus if a bit flips… the companion state will catch it and the FSM will be able to correctly determine the next state • Be careful! The “OR” logic is more complex than simply using a string of “OR” gates. Berg
Error Correction within One Cycle: Glitch Control • One major issue that is extremely overlooked is SEUs occurring near clock edges • If this occurs, your error checking logic may cause a glitch • Due to routing timing differences, this can cause incorrect values to be latched into the current state registers. • Refer to a Karnaugh Map for glitch-less implementation • The designer may have to hand instantiate the logic if the synthesis tool does not adhere to the VHDL as expected Berg
Error Correction within One Cycle: Glitch Control • The designer will have to include the synthesis directives in order to turn off the tools “optimization”: • Preserve_driver • Preserve_signal • Always check the gate level output of the synthesis tool. Berg
Conclusion • This presentation proposes methods of Fault Tolerant State Machine implementation due to potential IC SEU susceptibility. • Be aware of potential glitches due to asynchronous SEUs occurring near a clock edge… • Mitigation Techniques must be Glitch Free! • Mitigation may need a synchronization circuit • Due to metastability and routing delay differences, can be more catastrophic than expected • Special directives must be used in order to drive the synthesis tools when implementing fault tolerant redundant logic because the tools are generally focused on area and speed optimization. Berg