1 / 20

Space Processor Radiation Mitigation and Validation Techniques for an 1,800 MIPS Processor Board

Space Processor Radiation Mitigation and Validation Techniques for an 1,800 MIPS Processor Board. Robert Hillman, Mark Conrad, Phil Layton, Chad Thibodeau Maxwell Technologies. Outline. Commercial & Rad-Hard Processor Trends SCS750 Design Strategy Radiation Mitigation Radiation Testing

arlene
Download Presentation

Space Processor Radiation Mitigation and Validation Techniques for an 1,800 MIPS Processor Board

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Space Processor Radiation Mitigation and Validation Techniques for an 1,800 MIPS Processor Board Robert Hillman, Mark Conrad, Phil Layton, Chad Thibodeau Maxwell Technologies 1

  2. Outline • Commercial & Rad-Hard Processor Trends • SCS750 Design Strategy • Radiation Mitigation • Radiation Testing • Radiation Analysis 2

  3. Commercial vs. Rad-Hard MIPS PPC750GX PPC750FX PPC750 PPC603 80486 Technology Gap >10 years 3

  4. Rad-Hard Upset Trend: Less SEU Immunity PPC750GX PPC750FX MEAN TIME TO UPSET DECREASING SUBSTANTIALLY! PPC750 PPC603 80486 Mean Time to Upset Decreasing Exponentially Due to Increasing Bits Per Device (Caches) * Assumes 1E-10 Upsets/bit-day Processor Bits Increasing Exponentially Dominated by L1 and L2 Cache Sizes Error Correction is Needed Rad-Hard is insufficient with increasing number of bits 4

  5. 1000X Importance of L2 Cache On-chip L1 Cache Boundary (32kByte) SCS750 @ 800MHz with 512KByte L2 Cache RH PowerPC @ 133MHz without L2 Cache L2 Cache on-chip is Critical for High Performance 5

  6. SCS750 Design Strategy Commercial Technology Mitigation Technique Result Latest SOI PowerPC @ 800MHz TMR/Resynch & Scrubbing Double Device Correct & HW Scrub (Reed-Solomon) SCS750 1 Uncorrected Error > 300 Years High Performance SDRAM Actel RT-AX Built-in TMR On-Board Control Logic Better Upset Immunity Than Rad-Hard SBC’s! ~ 1 Upset Per Day Unacceptable 6

  7. SCS750 Prototype Engineering Module (PEM) – Early Dev. Platform FM Boards will be conduction cooled with class S or equivalent components 7

  8. Functional Block Diagram 8

  9. Radiation Tolerance • Entire board designed from the ground up to have the highest upset immunity and highest performance • All components Latch-Up Immune • Component Screening • Effective SEU rate (entire board!) < 1E-5/board-day • One error every 300 years or better! • Reed-Solomon, TMR/Scrubbing, component selection • Typical Geosynchronous orbit • Total Dose >= 100KRad • Component screening and/or shielding 9

  10. PowerPC 750FX™ Features • 0.13um SOI, Copper interconnect, SiLK • Extremely low power, runs at 1.2V – 1.5V • 400MHz, ~ 2Watts each  >900 MIPS • 800MHz, ~ 4 Watts each  > 1,800 MIPS • L1 cache 32KB I, 32KB D with parity • L2 Cache 512KByte on-chip, built-in SEC/DED EDAC • L2 cache runs at full processor clock rate • Tags are parity protected • Dual PLL’s for on-the fly software control of clock rate • Future products are drop-in compatible 10

  11. Processor Correction Methodology • Triple Modular Redundant (TMR) Processing • Three processors, each completely isolated • No delay in voting (voting and controller all on one chip!) • Detection • Single processor error: Flags & Isolates Processor • Detection of double processor error & auto restart • Resynchronization & Scrubbing • Saves processor states, Resets, Re-loads processor • Proven to correct any single event effect in the processors • < 1 ms delay, < 0.1% overhead with 1 second scrub Transparent to Application Software! 11

  12. Processor Correction Snapshot All three are Resynchronized Processor 2 is Upset CPU 2 Held in Reset 12

  13. Processor Error Correction Flow Software Transparent No Uncorrected Errors Overhead < 0.1% Occurs typically once in >12,000 years Start State No Errors User Called Resynch Scrub Timer Expired One Processor Error Isolate Processor & Flag Error Double Processor Error Restart 3 Processors & Flag double error Waits for Software Semaphore to Continue Resynchronize Processors <1ms delay 13

  14. SDRAM Error Correction Comparison Worst Best Assumptions: 256 Mbyte Memory Memory Scrub 1 per day SCS750 has built-in, programmable hardware memory scrubbing (DMA) 14

  15. Prototype Demonstrated TMR (Voting) Error Detection Resynchronization Radiation Performance Sept 02 Development Plan Phase I – July 03 Phase II – Dec 03 SCS750 EM – Q2 04 • Engineering boards will use “E” or “I” (Engineering or Industrial Grade) SCS750 FM – Q4 04 • Flight board • Fully qualified • All parts class “S” or equivalent SCS750 Proto Engineering • Software compatible with flight • Same form factor • Commercial parts • Customer development Actel Axcellerator, RT-AX For Flight Xilinx FPGA’s for Prototype Development 15

  16. Maxwell 750FX vs. JPL 750FX Data (Maxwell) Single Processor Error Comparison Maxwell (Dynamic) vs. JPL (Register testing) Good Correlation 16

  17. Texas A&M Cyclotron Facility: SCS750 PEM - Processor Test Setup Upsets by Processor Good Uniformity 17

  18. Double Processor Error Rate Cross section as a function of scrub rate SCS750 Triple Processor Results Excellent correlation with actual vs. computed! 18

  19. Notes: Assuming equivalent shielding of 100 mil Aluminum GCR = Galactic Cosmic Ray background at solar minimum DCF = JPL Design Case Flare (at one A.U.) (1 day) * Scrub @ 10/sec, performance overhead of < 1% Uncorrected Error Rate Comparison Source: G. Swift (JPL) RADECs 2003 19

  20. CONCLUSION • Under heavy-ion irradiation, the SCS750 upset mitigation scheme detected and corrected all single processor errors. • Upon the coincidence of double processor error, the SCS750 successfully recovered with a reboot and reported the double processor error. • Power cycling was never required for correction. SCS750 architecture using commercial processors gives better upset immunity and higher performance than Rad-Hard Processors 20

More Related