A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs (“Birds-of-a-Feather”)

A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs(“Birds-of-a-Feather”) Carl Carmichael1, Brendan Bridgford1, Gary Swift2, Matt Napier3 1Xilinx Corporation, San Jose CA2Jet Propulsion Laboratory, Pasadena CA3Sandia National Laboratories, Albuquerque NM "This work was carried out in part by the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration." "Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology." 1

XTMR SEU Mitigation • Xilinx Triple Module Redundancy (XTMR) • Single Point Failures are eliminated by triplication of every logic node (gates & nets). • XTMR confers SEU and SET immunity • XTMR does not protect against SEFIs! • Any digital design can be XTMRed by: • “Triplication” of throughput (combinational & sequential) logic • “Triplication” of feedback logic and inserting majority voters • Adding redundant IO (outputs with minority voters) • Design cleanup (removing half-latches, SRL16s, etc.) 2

XTMR State-Machines “Pre-TMR” • XTMR provides autonomous re-synchronization of the separate redundant domains of a state-machine by inserting majority voters at the origin of any registered feed-back “Looped” path. • When a configuration upset disables one domain, the other two domains continue to operate providing a correct majority representation of state data and functionality. • When “Scrubbing” fixes the configuration of the upset domain, the embedded redundant voters automatically correct the state of the upset domain without any external intervention. • As long as the scrub rate is greater than the upset rate, a single bit upset cannot disturb more than one redundant domain. “Post-XTMR” 3

XTMR Inputs • Effective SEU Mitigation requires the use of triple redundant input pins for every input signal. • Not triplicating input Globalsignals (clk, rst, etc) can seriously compromise SEU resistance. • Triplication of input data paths can be traded for EDAC. • SEU resistance is sometimes a trade-off for resource utilization. 4

XTMR Outputs with Minority Voters • Outputs can be triplicated, using three pins for each output signal. • Minority voters monitor each of the triplicated design modules. • If one module is different from the others, its output pin is driven to High-Z • Voters are triplicated Minority Voter P TR0 P Minority Voter TR1 P Minority Voter TR2 Convergence point is outside FPGA, at trace 5

Previous SEE Test Methodology for Mitigation • The assertion of the combined mitigation method of XTMR & Scrubbing is that the complete removal of Single Even Functional Errors in the user logic confers any user design to an overall error rate determined by the remaining Single Event Functional Interrupts. Therefore, a successful mitigation test is expected to produce zero errors other than SEFIs. • Since the effectiveness of TMR is dependent upon no accumulation of errors in the configuration, experiments were attempted to maintain an upset rate that did not exceed the scrub rate. This methodology had two significant flaws: • One is an impracticality of testing at such low fluxes requiring unreasonably long run times and thus being incapable of reaching sufficient fluence for acceptable statistical significance of data. • The other flaw is that a zero error rate result is not useful for making any calculations or extrapolations. • These issues raise concerns over the validity of any results. 6

Improved SEE Test Methodology for Mitigation • There is an expected physical relationship between functional error rate of a mitigated system as a function of upset rate. The expected relationship is a function that predicts the increasing probability of upsetting bit combinations that will cause a mitigated (TMR) system to fail as a function of bit upset rate: MER = (1/2)(NBCA/TS)RU2 • MER = Mitigation Error Rate • NB = Number of Relevant Bits • CA = Average Cluster Size • TS = Scrub Time • RU = Upset Rate of Relevant Bits. • Therefore, testing at extremely high fluxes over several orders of magnitude variation can be performed to reveal this functional relationship between mitigation error rate and bit upset rate. • This function can then be extrapolated to make predictions at the much lower upset rates of earth orbits. 7

Plot Definitions • Predicted SEFI cross-section • Static and Dynamic SEE Characterization of the Virtex-II FPGA revealed several Single Event Functional Interrupt Modes: POR (2.5E-06), SMAP (1.72E-06), IOB (4.2E-06) • These combined cross-sections represent the minimum functional error cross-section for a single Virtex-II (XQR2V6000) device on orbit. • Worst Case Orbital Upset Rate • CREME96 calculation of the worst case orbital upset rate for a XQR2V6000 is 7,740 bit-errors/day (9E-02 bit-errors/sec) in a GEO orbit at 36,000km during the worst day of an Anomalously Large Solar Flare accounting for both Heavy Ion and Proton. In a 40MeV Kr beam the exact same upset rate is achieved with a Flux of 1.25E-01 p/cm2/s. This denotes that the equivalent upset rates for all other orbits and solar conditions would reside to the LEFT of this line. • Single Event Functional Interrupts • This is the average cross-section of the observed SEFI(s) while collecting the data represented in the plot. This cross-section is not Flux dependent. Variations from the predicted value are due to statistical significance of the total accumulated fluence during each test. • Functional Errors • Data plot of the observed events when the Device Under Test returned an incorrect result. Cross-section is determined by the number of error events divided by total fluence at the specified flux. TMR denotes that the DUT design was fully mitigated with XTMR and scrubbing. The Unmitigated results were obtained with an identically functional design without XTMR, however scrubbing was also used for the unmitigated test. • Extrapolation • A derived function describing the relation between Mitigation failure as a function of upset rate. Extension of the function predicts functional error cross-sections at worst case orbital upset rates to be less than SEFI cross-sections. 8

PLOT 1 3.5E-02 3.5E-01 3.5E+00 3.5E+01 3.5E+02 3.5E+03 Configuration Bit Errors per Scrub Cycle 36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day All other orbits 40 MeV Kr LET= 22.3 MeV/cm2/mg SEFIs drive error rate for all designs and all orbits. Mitigation errors on orbit are always less than SEFI errors by orders of magnitude 9

PLOT 2 3.5E-02 3.5E-01 3.5E+00 3.5E+01 3.5E+02 3.5E+03 3.5E+03 Configuration Bit Errors per Scrub Cycle 36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day All other orbits 40 MeV Kr LET= 22.3 MeV/cm2/mg SEFIs drive error rate for all designs and all orbits. Mitigation errors on orbit are always less than SEFI errors by orders of magnitude 10

PLOT 3 3.5E-02 3.5E-01 3.5E+00 3.5E+01 3.5E+02 3.5E+03 3.5E+03 Configuration Bit Errors per Scrub Cycle 36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day All other orbits SEFIs drive error rate for all designs and all orbits. 40 MeV Kr LET= 22.3 MeV/cm2/mg Mitigation errors on orbit are always less than SEFI errors by orders of magnitude 11

SEE Test Analysis • The experiments were conducted over a flux range of 7E+00 to 4E+04 (p/cm2/s). • The Flux rates have been normalized in the secondary (top) x-axis of the plots to “average bit upsets per scrub cycle” (RS). • Each experiment demonstrated a drop in failure cross-section over several orders of magnitude, crossing the SEFI cross-section at upset rates that are still several orders of magnitude above worst case orbital upset rates. • Extrapolating this data for each experiment clearly demonstrates a mitigation error cross-section at least 1 or more orders of magnitude below the SEFI cross-section at worst case orbital upset rates. • By Superposition of the data fit functions, the total effective mitigated error rate cross-section is SigmaTOTAL = SigmaBRAM + SigmaCLB + SigmaMULT + SigmaSEFI SigmaTOTAL = 5.0E-8(1.4 RS)(2) + 5.0E-6(0.7 RS)(0.5) + 1.75E-6(1.4 RS)(0.35)+ 8.42E-6 (cm2) • Therefore, at the worst case orbital upset rate of 9E-2 upsets/sec (RS=4.5E-2 upsets/scrub) the effective total cross-section for functional error is calculated: SigmaTOTAL = 1.05E-5 (cm2/device) {Orbital Worst Case} 12

Conclusions • Efficiency and accuracy of the validation of mitigation techniques is greatly improved by demonstrating the upset rate dependency of the mitigation method by testing at Flux rates that overwhelm the mitigation. • The static SEFI cross-section is the dominating factor for calculating orbital error rates for any Virtex-II design when mitigated with Full XTMR & Scrubbing. • Future Work • The authors recognize an anomaly in the data fit functions in that they were not all expressed as a square function. It is anticipated that this is due to the complexity of the bit clusters of the experimental designs. Additional research is called for to derive the separate coefficients for the MER equation for each design and explain their functional associations. 13

A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs (“Birds-of-a-Feather”)