1 / 25

Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures

Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures. John Lach 1 William H. Mangione-Smith 1 Miodrag Potkonjak 2 UCLA Departments of Electrical Engineering 1 and Computer Science 2. Outline. Motivation Project goals General approach

howie
Download Presentation

Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures John Lach1 William H. Mangione-Smith1 Miodrag Potkonjak2 UCLA Departments of Electrical Engineering1 and Computer Science2

  2. Outline • Motivation • Project goals • General approach • Tiling and redundant interconnect • Architecture applications • Experimental results • Conclusions 2

  3. Motivation • Long-life and mission critical applications • Space and remote systems • FPGA applications • Radiation faults • System constraints • power • cost • size 3

  4. Project Goals • High tolerance of multiple faults • logic • interconnect • radiation and manufacturing/operating imperfections • Reduced system down time • Transparent • Low area and timing overhead • Low memory requirements • Low design effort • Applicable to diverse FPGA architectures 4

  5. General Approach • Create and store many different instances of the same design • Each resource is unused in at least one instance • In the face of a fault, an instance is activated that does not use the faulty resource 5

  6. General Approach Problem:Tolerance of Multiple Faults A 6x6 LB design with 4 LB area free 6

  7. Solution: Fine-Grained Tiling • Partition design into a set of “tiles” • Each tile has some unused resources • area overhead??? • Lock the interface between tiles • tile independence • Generate instances of each tile • Atomic Fault-Tolerant Blocks (AFTBs) • Each resource is unused in at least one AFTB • In the face of a fault, an AFTB is invoked that does not use the faulty resource 7

  8. Tiling Example A 6x6 LB design partitioned into 4 3x3 tiles AFTB example 8

  9. Design Example(Xilinx XC4000 Device) Initial floorplan for PREP 5 benchmark After tiling and one AFTB identified After fault detected at (20,3) 9

  10. Benefits of Approach • High reliability • Low overhead • physical resources, circuit performance, memory, design effort • Runtime management • tolerates faults on-line • minimizes system downtime • Flexibility • variable timing constraints, resource limitations, and estimated reliability 10

  11. Interconnect Faults • Tiling tolerates most interconnect faults • Inter-tile interconnect • global • overlapped/segmented • Reserve redundant interconnect 11

  12. Diverse Architectures • Sanders CSRC device • Xilinx XC4000 family • Altera Flex 10k series 12

  13. Sanders CSRCCSLA and Level 1 Routing 13

  14. Sanders CSRCData Pipes and Level 3 Routing 14

  15. Sanders CSRCPre and Post Pipe Portion Fault Configurations 15

  16. Sanders CSRCHierarchical Redundancy 16

  17. Xilinx XC4000 Initial floorplan for PREP 5 benchmark After tiling and one AFTB identified After fault detected at (20,3) 17

  18. Xilinx XC4000Inter-Tile Interconnect Fault Recovery 18

  19. Altera Flex 10k 19

  20. Altera Flex 10kHierarchical Redundancy 20

  21. Timing and Area Overhead 21

  22. Reliability Enhancement:Variable Resource Reliability 22

  23. Reliability Enhancement:Correlated Fault Model with Variable μ 23

  24. 5000 CLB Design Reliability 24

  25. Conclusion • Tolerate logic and interconnect faults • High tolerance of multiple faults • Short system down time • runtime management • transparent • Low area, timing, memory, effort overhead • Flexible • applications • architectures 25

More Related