1 / 20

Using Software Rules To Enhance FPGA Reliability

Using Software Rules To Enhance FPGA Reliability. Chandru Mirchandani Lockheed-Martin September 7-9, 2005. MIRCHANDANI. 1. P226-W/MAPLD2005. FPGA Fault Tolerance. Historically realized through triple redundancy, error correcting codes and replicated elements

jadzia
Download Presentation

Using Software Rules To Enhance FPGA Reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Software Rules To Enhance FPGA Reliability Chandru Mirchandani Lockheed-Martin September 7-9, 2005 MIRCHANDANI 1 P226-W/MAPLD2005

  2. FPGA Fault Tolerance • Historically realized through triple redundancy, error correcting codes and replicated elements • The fault tolerance process is as good as the tests run to validate its performance, e.g. • When invalid data is not ignored due to an inherent fault in the lookup and compare sequence • The testing was not rigorous enough • The testing was not complete • Lack of real estate and logic on the device precludes the ideal solution, • Make educated judgment calls on how much is acceptable and for how long MIRCHANDANI

  3. Reconfiguring FPGAs • Replicated circuitry or triple redundancy, achieved by having different devices or on the same device • Same device to replicate a complete circuit will not meet the constraint of lack of real estate and will decrease performance due to routing • Could be used to one’s advantage if sub-sets of the circuit were replicated • Yu and McCluskey - reconfiguring the chip so that a damaged configurable logic block (CLB) or routing resource is not used by a design MIRCHANDANI

  4. Types of Errors • Yu and McCluskey – When concurrent error detection (CED) mechanisms detect an error for the first time, it is treated as a transient error; otherwise, it is treated as a permanent error • Transient error - the system recovers from corrupt data and resumes normal operation • Permanent fault - fault diagnosis is initiated to determine the location of the damaged resource, and a suitable configuration is chosen according to the available area • In the case of both types of errors, the design in VHDL, i.e. FPGA software is the key to success MIRCHANDANI

  5. Software Reliability • Develop Criteria for Design Objective Acceptance • Prioritize tasks or functions in order of criticality • Develop metrics to measure performance of tasks with respect to constraints • Evaluate design options based on measured reliability metrics MIRCHANDANI 5 P226/MAPLD2005

  6. Processor 1 Application A1 (I-ary) Application A1 (II-ary) Processor 2 Typical Software Options • Critical software functions are distributed as redundant instances on multiple processors, thus minimizing the loss of service due to a processor failure…….. MIRCHANDANI 6 P226/MAPLD2005

  7. Redundant Instances of Software • Initially detect, contain and recover from faults as soon as possible, and in the event this is not possible • Allow the control to be passed on to the redundant instance within the reliability and availability requirements levied on the system • Finally, include language defined mechanisms to detect and prevent the propagation of errors MIRCHANDANI 7 P226/MAPLD2005

  8. Methodology • Estimate the reliability based on instruction set and operational usage • Re-design critical elements to decrease risk • Re-evaluate the risk of failure based on a change in critical task design based on performance and requirements • Re-evaluate the reliability based on failure rate • Factor in the Uncertainty in Evaluation MIRCHANDANI 8 P226/MAPLD2005

  9. Task Times MIRCHANDANI

  10. FPGA System - Conceptual • Consider a FPGA-based system comprising of the Reading, Parsing and Pre-Processing Tasks….. …each Task is a subsystem MIRCHANDANI

  11. Task Reliability Block Diagram (exp(-γh.uh.λhwi.t).exp(-γs.us.λswi.t) [1-{1-(exp(-(1-γh).λshwi.t).exp(-(1-γs).λsswi.t))}^2] AND OR MIRCHANDANI

  12. Definitions MIRCHANDANI

  13. Parameters & Derivations • Failure Intensity: λshwi = λhwi.uh.(1-γh) • Failure Intensity: λsswi = λswi.us.(1-γs) • Common Cause: λhwi.uh.(γh) and λswi.us.(γs) • Execution Time t: ei . t • RSSi : Subsystem Reliability • System Reliability RS : RSS1 .RSS2 .RSS3 MIRCHANDANI 13 P226/MAPLD2005

  14. Extending the Rules • The programmed design, be it the original duplex design, duplicated or diverse, or the option for re-configuration, will optimize whatever option is used to enhance Fault Tolerance • For example, in the Reading Task, it is shown that the area usage and operational profile have an effect on the predicted overall reliability of the FPGA-based design • Yu and McCluskey, state that the designs of the CED techniques are area dependent, more conservative a design in terms of area, less efficiently will the error detection algorithm perform, however, but more efficiently or optimally the re-configured design in the event of a permanent failure. MIRCHANDANI

  15. Further Extension • Area usage has a higher propensity for multiple faults, the operational profile that exercises a part of the code more often, then the design and its associated code has a greater propensity for failures • The common cause fractions used in the paper are relative numbers to illustrate the model • Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area. MIRCHANDANI

  16. Assertions • The common cause fractions used in the paper are relative numbers to illustrate the model • Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area. • Implemented on different devices, this fraction drops to ¼ % because now the physical defects are almost negligible, and the only common effects are more environmental, i.e. temperature, power and external stresses. MIRCHANDANI

  17. More Assertions • Software common cause fraction is high in both cases, since we assume nearly all software failures are common cause, very little change from same device to different device, since the design implemented is the same, but because the devices are different, this a slight chance that certain timing conditions may vary and hence the ¼ % variation • Diverse design paradigm, the hardware dependence remains in the same ratio relatively, but the software fractions vary drastically. In the same device, the common cause fraction is 50 % and it drops to 10 % in the case of diverse designs on different devices MIRCHANDANI

  18. System Configuration Options MIRCHANDANI

  19. Results MIRCHANDANI 19 P226/MAPLD2005

  20. Conclusions • Cost and Schedule Slips • Development Delays and Costs • Adaptive Model • Optimization and Design Constraints Contact Address: chandru.j.mirchandani@lmco.com MIRCHANDANI 20 P226/MAPLD2005

More Related