A triple module redundancy scheme for seu mitigation of static latch based fpgas
Download
1 / 24

A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs - PowerPoint PPT Presentation


  • 338 Views
  • Uploaded on

A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs. Carl Carmichael 1 , Brendan Bridgford 1 , Gary Swift 2 , Matt Napier 3 1 Xilinx Corporation, San Jose CA 2 Jet Propulsion Laboratory, Pasadena CA 3 Sandia National Laboratories, Albuquerque NM.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs' - Thomas


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A triple module redundancy scheme for seu mitigation of static latch based fpgas l.jpg
A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs

Carl Carmichael1, Brendan Bridgford1, Gary Swift2, Matt Napier3

1Xilinx Corporation, San Jose CA2Jet Propulsion Laboratory, Pasadena CA3Sandia National Laboratories, Albuquerque NM

"This work was carried out in part by the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration." "Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology."

1


Abstract l.jpg
ABSTRACT Static Latch-Based FPGAs

“Xilinx Triple Module Redundancy,” or XTMR, is an SEU mitigation technique and design methodology intended to remove all single points of failure within the configuration control cells and user logic elements, including those in the voting circuitry, as well as preventing the propagation of single event transients, by “triplicating” all inputs, outputs, logic, clock domains and voters. Voters are also inserted on all state logic feedback paths, conferring full SEU and SET immunity while allowing for autonomous re-synchronization of just-reconfigured state logic to the redundant domains.

This paper presents the fundamental philosophy of the XTMR method, the automated implementation of XTMR provided by the new release of the “Xilinx TMRTool”, as well as Single Event Effects testing and analysis of the combined SEU mitigation technique of XTMR and autonomous partial re-configuration (scrubbing).

The SEE test analysis demonstrates that this combined SEU mitigation technique pushes the cross-section for functional error for any design in any orbit to at least one order of magnitude below the established cross-sections for device level Single Event Functional Interrupts (SEFI). This study has the potential to alleviate the requirement for many users of having to perform independent SEE testing on individual design implementations.

2


Xtmr seu mitigation l.jpg
XTMR SEU Mitigation Static Latch-Based FPGAs

  • Xilinx Triple Module Redundancy (XTMR)

    • Single Point Failures are eliminated by triplication of every logic node (gates & nets).

    • XTMR confers SEU and SET immunity

    • XTMR does not protect against SEFIs!

    • Any digital design can be XTMRed by:

      • “Triplication” of throughput (combinational & sequential) logic

      • “Triplication” of feedback logic and inserting majority voters

      • Adding redundant IO (outputs with minority voters)

      • Design cleanup (removing half-latches, SRL16s, etc.)

3


Xtmr state machines l.jpg
XTMR State-Machines Static Latch-Based FPGAs

“Pre-TMR”

  • XTMR provides autonomous re-synchronization of the separate redundant domains of a state-machine by inserting majority voters at the origin of any registered feed-back “Looped” path.

  • When a configuration upset disables one domain, the other two domains continue to operate providing a correct majority representation of state data and functionality.

  • When “Scrubbing” fixes the configuration of the upset domain, the embedded redundant voters automatically correct the state of the upset domain without any external intervention.

  • As long as the scrub rate is greater than the upset rate, a single bit upset cannot disturb more than one redundant domain.

“Post-XTMR”

4


Xtmr inputs l.jpg
XTMR Inputs Static Latch-Based FPGAs

  • Effective SEU Mitigation requires the use of triple redundant input pins for every input signal.

  • Not triplicating input Globalsignals (clk, rst, etc) can seriously compromise SEU resistance.

  • Triplication of input data paths can be traded for EDAC.

  • SEU resistance is sometimes a trade-off for resource utilization.

5


Xtmr outputs with minority voters l.jpg
XTMR Outputs with Minority Voters Static Latch-Based FPGAs

  • Outputs can be triplicated, using three pins for each output signal.

  • Minority voters monitor each of the triplicated design modules.

  • If one module is different from the others, its output pin is driven to High-Z

  • Voters are triplicated

Minority Voter

P

TR0

P

Minority Voter

TR1

P

Minority Voter

TR2

Convergence point is

outside FPGA, at trace

6


Xilinx tmrtool l.jpg
Xilinx TMRTool Static Latch-Based FPGAs

  • The Xilinx TMRTool is a graphical application that automates the implementation of XTMR for FPGA designs.

  • The designer is provided the flexibility to selectively apply XTMR to their design at the instance, component, and hierarchical levels.

  • Use of custom mitigation methods may be employed for specific portions of the design with the use of user created library macros.

  • Designs are imported from a Xilinx netlist (NGO/NGC) and exported as a single standard EDIF project source.

7


Xtmr see testing l.jpg
XTMR SEE Testing Static Latch-Based FPGAs

  • Validation of mitigation of architectural resources by superposition.

    • Separate experiments were created to cover the major elements of the Virtex-II architecture:

      • Configurable Logic Block

        • Combinatorial Logic, Sequential Logic, Arithmetics, Multiplexing.

        • Design implementation is an array of state-machines.

      • Multipliers

        • Dedicated 18 x 18 bit multiply function blocks.

        • Design implementation is array of Multiply and Accumulate functions.

      • Block Memories

        • Synchronous Dual Port 18k bit RAM blocks.

        • Design implemented as a single large memory space for high speed store and fetch functions.

      • Input Output Blocks

        • Multi-standard discrete & bi-directional un/registered device IO.

        • Design implemented as feed-thru channels from IOB to IOB.

      • Digital Clock Managers

        • Clock frequency synthesis and phase delay re-allignment.

        • This will be tested in future work.

8


2v6000 dynamic seu test l.jpg
2V6000 Dynamic SEU Test Static Latch-Based FPGAs

BEAM

Thinned DUT

Inside target room

Functional

Monitor/ Strip Chart

Front Side

Configuration

Monitor/ Strip Chart

Back Side

9


Clb test design l.jpg

DUT Static Latch-Based FPGAs

SERVICE

32

FSM

+1

+32

32

Configuration Manager Core

32

MUX 32x1x32

5

MODULE

Functional Monitor

32

Error Counters

mod0

SelectMAP

16

+

32

mod15

mod

MUX 32x1x16

5

5

10

CLB Test Design

10


Clb test functional description l.jpg
CLB Test Static Latch-Based FPGAsFunctional Description

  • The CLB test “pre-TMR” design consists of 512 (32 bit) counters created as 16 modules of 32 counters per module. Each counter in the module increments by a different value. The output of each module is a multiplex of the 32 counters. The outputs of all the modules are again multiplexed to a single 16 bit bus. A 10 bit address bus is used to select the output of a specific counter and select between the upper and lower 16 bit banks of the 32 bit module outputs.

  • The Xilinx TMRTool software is used to process the design into a fully XTMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT.

  • All counters are running continuously. Each counter is selected sequentially for sampling of it’s current state and operation.

  • For each module sample taken, the actual and expected values are recorded along with sequential count of state errors and the running count of event errors into a strip chart file on the host PC.

  • When counters are observed to be permanently in the wrong state the design is reset to regain a fully functioning test.

  • The final error count is calculated as the number of events that a counter either lost it’s state or moved to the wrong state.

11


Multiplier test design l.jpg

+1x1 Static Latch-Based FPGAs

FSM

SelectMAP

+

mod

Multiplier Test Design

DUT

SERVICE

36

MAC

32

36

Configuration Manager Core

+1x10

MAC

36

MUX 3x2x32

+1x11

MAC

3

MODULE

Functional Monitor

32

Error Counters

mod0

16

+

32

x

mod15

MUX 32x1x16

5

Constant

MAC

3

8

12


Multiplier test functional description l.jpg
Multiplier Test Static Latch-Based FPGAsFunctional Description

  • The Mutliplier test “pre-TMR” design consists of 48 (18x18x36 bit) Multiply and Accumulate (MAC) blocks created as 16 modules of 3 MACs per module. Each MAC in the module increments by 1 and multiplies by a different constant (1, 10, and 11, respectively). The output of each module is a multiplex of the 3 MACs and a select of the lower 32 bits and upper 4 bits of the 36 bit registered multiplier output. The outputs of all the modules are again multiplexed to a single 16 bit bus. An 8 bit address bus is used to select the output of a specific MAC and select between the upper and lower 16 bit banks of the 32 bit module outputs.

  • The Xilinx TMRTool software is used to process the design into a fully TMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT.

  • All MACs are constantly accumulating. Each MAC is selected sequentially for a periodic sampling of it’s sequence.

  • For each module sample taken, the actual and expected values are recorded along with sequential count of state errors and the running count of event errors into a strip chart file on the host PC.

  • When MACs are observed to be permanently in the wrong state the design is reset to regain a fully functioning test.

  • The final error count is calculated as the number of events that a MAC lost it’s state or produced an incorrect result.

13


Bram test design l.jpg

FSM Static Latch-Based FPGAs

SelectMAP

BRAM Test Design

DUT

SERVICE

Configuration Manager Core

128k byte

RAM

Functional Monitor

Error Counters

DATA

+

ADDRESS

-1

16

16

14


Bram test functional description l.jpg
BRAM Test Static Latch-Based FPGAsFunctional Description

  • The Block Memory test “pre-TMR” design consists of single large 128k byte single port memory space created from 64 memory blocks of 16k bits each.

  • The Xilinx TMRTool software is used to process the design into a fully TMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT.

  • Separate WRITE and READ routines are executed to all memory address locations. The data is derived from a decrement of the address value. The entire memory space is refreshed with a write operation and then the data is retrieved with a read operation.

  • During the read operation the retrieved data is compared against the expected value.

  • For each data sample taken, the actual and expected values are recorded with the running count of event errors into a strip chart file on the host PC.

  • Each error event is measured for it’s total word error size in bits: 1, 32, 64, 512, 1024, etc.

  • The final error count is calculated as the number of separate events of word errors.

15


Configuration error detection and correction algorithm l.jpg

CONFIGURE Static Latch-Based FPGAs

READBACK

CHECKPORT

READBACK

SCRUB

PREV CRC

SCRUB CRC

CONFIG CRC

CRC ERROR = 2

Configuration Error Detection and Correction Algorithm

  • Configure target FPGA with configuration data stored in the configuration PROM(s).

  • Read back configuration programming data from target FPGA and calculate 16 bit CRC. Store CRC value as “Config-CRC”.

  • Perform a Write/Read check on the internal Frame Address Register of target FPGA.

  • Scrub (background refresh) configuration data of target FPGA.

  • Read back configuration programming data from target FPGA and calculate 16 bit CRC. Store CRC value as “Rdbk-CRC” and perform bit-for-bit error detection of configuration data.

  • Compare “RDBK CRC” with “Config-CRC

  • If CRC values mismatch a second time then assert SEFI_ERROR and RECONFIGURE

0

START

DONE

1

YES

YES

SEFI

PREV = SCRUB

NO

NO

YES

NO

0

DONE

1

NO

CONFIG = SCRUB

CRC ERROR +1

YES

CRC ERROR = 0

16


Previous see test methodology for mitigation l.jpg
Previous SEE Test Methodology for Mitigation Static Latch-Based FPGAs

  • The assertion of the combined mitigation method of XTMR & Scrubbing is that the complete removal of Single Event Functional Errors in the user logic confers any user design to an overall error rate determined by the remaining Single Event Functional Interrupts. Therefore, a successful mitigation test is expected to produce zero errors other than SEFIs.

  • Since the effectiveness of TMR is dependent upon no accumulation of errors in the configuration, experiments were attempted to maintain an upset rate that did not exceed the scrub rate. This methodology had two significant flaws:

    • One is an impracticality of testing at such low fluxes requiring unreasonably long run times and thus being incapable of reaching sufficient fluence for acceptable statistical significance of data.

    • The other flaw is that a zero error rate result is not useful for making any calculations or extrapolations.

  • These issues raise concerns over the validity of any results.

17


Improved see test methodology for mitigation l.jpg
Improved SEE Test Methodology for Mitigation Static Latch-Based FPGAs

  • There is an expected physical relationship between functional error rate of a mitigated system as a function of upset rate. The expected relationship is a function that predicts the increasing probability of upsetting bit combinations that will cause a mitigated (TMR) system to fail as a function of bit upset rate:

    MER = (1/2)(NBCA/TS)RU2

    • MER = Mitigation Error Rate

    • NB = Number of Relevant Bits

    • CA = Average Cluster Size

    • TS = Scrub Time

    • RU = Upset Rate of Relevant Bits.

  • Therefore, testing at extremely high fluxes over several orders of magnitude variation can be performed to reveal this functional relationship between mitigation error rate and bit upset rate.

  • This function can then be extrapolated to make predictions at the much lower upset rates of earth orbits.

  • 18


    Plot definitions l.jpg
    Plot Definitions Static Latch-Based FPGAs

    • Predicted SEFI cross-section

      • Static and Dynamic SEE Characterization of the Virtex-II FPGA revealed several Single Event Functional Interrupt Modes: POR (2.5E-06), SMAP (1.72E-06), IOB (4.2E-06)

      • These combined cross-sections represent the minimum functional error cross-section for a single Virtex-II (XQR2V6000) device on orbit.

    • Worst Case Orbital Upset Rate

      • CREME96 calculation of the worst case orbital upset rate for a XQR2V6000 is 7,740 bit-errors/day (9E-02 bit-errors/sec) in a GEO orbit at 36,000km during the worst day of an Anomalously Large Solar Flare accounting for both Heavy Ion and Proton. In a 40MeV Kr beam the exact same upset rate is achieved with a Flux of 1.25E-01 p/cm2/s. This denotes that the equivalent upset rates for all other orbits and solar conditions would reside to the LEFT of this line.

    • Single Event Functional Interrupts

      • This is the average cross-section of the observed SEFI(s) while collecting the data represented in the plot. This cross-section is not Flux dependent. Variations from the predicted value are due to statistical significance of the total accumulated fluence during each test.

    • Functional Errors

      • Data plot of the observed events when the Device Under Test returned an incorrect result. Cross-section is determined by the number of error events divided by total fluence at the specified flux. TMR denotes that the DUT design was fully mitigated with XTMR and scrubbing. The Unmitigated results were obtained with an identically functional design without XTMR, however scrubbing was also used for the unmitigated test.

    • Extrapolation

      • A derived function describing the relation between Mitigation failure as a function of upset rate. Extension of the function predicts functional error cross-sections at worst case orbital upset rates to be less than SEFI cross-sections.

    19


    Plot 1 l.jpg
    PLOT 1 Static Latch-Based FPGAs

    3.5E-02

    3.5E-01

    3.5E+00

    3.5E+01

    3.5E+02

    3.5E+03

    Configuration Bit Errors per Scrub Cycle

    36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day

    All other orbits

    40 MeV Kr LET= 22.3 MeV/cm2/mg

    SEFIs drive error rate for all designs and all orbits.

    Mitigation errors on orbit are always less than SEFI errors by orders of magnitude

    20


    Plot 2 l.jpg
    PLOT 2 Static Latch-Based FPGAs

    3.5E-02

    3.5E-01

    3.5E+00

    3.5E+01

    3.5E+02

    3.5E+03

    3.5E+03

    Configuration Bit Errors per Scrub Cycle

    36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day

    All other orbits

    40 MeV Kr LET= 22.3 MeV/cm2/mg

    SEFIs drive error rate for all designs and all orbits.

    Mitigation errors on orbit are always less than SEFI errors by orders of magnitude

    21


    Plot 3 l.jpg
    PLOT 3 Static Latch-Based FPGAs

    3.5E-02

    3.5E-01

    3.5E+00

    3.5E+01

    3.5E+02

    3.5E+03

    3.5E+03

    Configuration Bit Errors per Scrub Cycle

    36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day

    All other orbits

    SEFIs drive error rate for all designs and all orbits.

    40 MeV Kr LET= 22.3 MeV/cm2/mg

    Mitigation errors on orbit are always less than SEFI errors by orders of magnitude

    22


    See test analysis l.jpg
    SEE Test Analysis Static Latch-Based FPGAs

    • The experiments were conducted over a flux range of 7E+00 to 4E+04 (p/cm2/s).

    • The Flux rates have been normalized in the secondary (top) x-axis of the plots to “average bit upsets per scrub cycle” (RS).

    • Each experiment demonstrated a drop in failure cross-section over several orders of magnitude, crossing the SEFI cross-section at upset rates that are still several orders of magnitude above worst case orbital upset rates.

    • Extrapolating this data for each experiment clearly demonstrates a mitigation error cross-section at least 1 or more orders of magnitude below the SEFI cross-section at worst case orbital upset rates.

    • By Superposition of the data fit functions, the total effective mitigated error rate cross-section is

      SigmaTOTAL = SigmaBRAM + SigmaCLB + SigmaMULT + SigmaSEFI

      SigmaTOTAL = 5.0E-8(1.4 RS)(2) + 5.0E-6(0.7 RS)(0.5) + 1.75E-6(1.4 RS)(0.35)+ 8.42E-6 (cm2)

    • Therefore, at the worst case orbital upset rate of 9E-2 upsets/sec (RS=4.5E-2 upsets/scrub) the effective total cross-section for functional error is calculated:

      SigmaTOTAL = 1.05E-5 (cm2/device) {Orbital Worst Case}

    23


    Conclusions l.jpg
    Conclusions Static Latch-Based FPGAs

    • Efficiency and accuracy of the validation of mitigation techniques is greatly improved by demonstrating the upset rate dependency of the mitigation method by testing at Flux rates that overwhelm the mitigation.

    • The static SEFI cross-section is the dominating factor for calculating orbital error rates for any Virtex-II design when mitigated with Full XTMR & Scrubbing.

    • Future Work

      • The authors recognize an anomaly in the data fit functions in that they were not all expressed as a square function. It is anticipated that this is due to the complexity of the bit clusters of the experimental designs. Additional research is called for to derive the separate coefficients for the MER equation for each design and explain their functional associations.

    24