Advanced Space Computing with System-Level Fault Tolerance

Advanced Space Computing with System-Level Fault Tolerance Grzegorz Cieslewski, Adam Jacobs, Chris Conger, Alan D. George ECE Dept., University of Florida NSF CHREC Center

Outline • Overview • NASA Dependable Multiprocessor • Reconfigurable Fault Tolerance (RFT) • Space Applications • Novel Computing Platforms • RapidIO • Conclusions

Overview • What is advanced space computing? • New concepts, methods, and technologies to enable and deploy high-performance computing in space – for an increasing variety of missions and applications • Why is advanced space computing vital? • On-board data processing • Downlink bandwidth to Earth is extremely limited • Sensor data rates, resolutions, and modes are dramatically increasing • Remote data processing from Earth is no longer viable • Must process sensor data where it is captured, then downlink results • On-board autonomous processing & control • Remote control from Earth is often not viable • Propagation delays and bandwidth limits are insurmountable • Space vehicles and space-delivered vehicles require autonomy • Autonomy requires high-speed computing for decision-making • Why is it difficult to achieve? • Cannot simply strap a Cray to a rocket! • Hazardous radiation environment in space • Platforms with limited power, weight, size, cooling, etc. • Traditional space processing technologies (RadHard) are severely limited • Potential for long mission times with diverse set of needs • Need powerful yet adaptive technologies • Must ensure high levels of reliability and availability

NMR FT-HLL SIFT N-Modular Redundancy Fault-Tolerant HLL (e.g. MPI) Software-Implemented Fault Tolerance CED CR Concurrent Error Detection Checkpointing & Roll-back SCP BR Self-Checking Pairs Byzantine Resilience ABFT NVP ECC Algorithm-Based Fault-Tolerance N-Version Programming Error Correction Codes Taxonomy of Fault Tolerance • First, let us define various possible modes/methods of providing fault tolerance (FT) • Many other options beyond simply throwing triple-modular redundancy (TMR) at the problem • Software FT vs. hardware FT concepts largely similar, differences only at implementation level • Radiation-hardening not listed, falls under “prevention” as opposed to detection or correction Most of these FT modes are currently being used at UF Temporal and spatial variants possible for many techniques Detect Correct or Mask

Instruments System Spacecraft I / F Controller B Reconfigurable Cluster Computer Spacecraft I / F . . . System Controller A Data Processor (PPC, FPGA) #N Data Processor (PPC, FPGA) #1 ( RHPPC ) High-Speed Network A High-Speed Network B Mission - Specific Mission - Specific Spacecraft I / F Devices Spacecraft Interface NASA/Honeywell/UF Project NASA Dependable Multiprocessor (DM) • 1st Space Supercomputer • Funded by NASA NMP • In-situ sensor processing • Autonomous control • Speedups of 100 to 1000 • First fault-tolerant, parallel, reconfigurable computer for space • Infrastructure for fault-tolerant, high-speed computing in space • Robust system services • Fault-tolerant MPI services • Application services • FPGA services • Standard design framework • Transparent API to resources for earth & space scientists

Dependable Multiprocessor • DM System Architecture • Dual system controllers • Redundant radiation-hardened PPC boards • Monitor data processors’ health and communicate with spacecraft • Data processing engines • High-performance, low-power COTS SBCs running Linux • PowerPC with AltiVec capabilities • Optional FPGA co-processor for additional performance • Scalable to 20 data processing nodes • Redundant Interconnect • Dual GigE connections • Automatically switch networks when error is detected • DM Middleware (DMM) • FT System Services • Manages status and health of multiple concurrent jobs • FT Embedded MPI (FEMPI) • Lightweight subset of MPI • Allows fault recovery without restarting an entire parallel application • Application & FPGA Services • Commonly used libraries such as ATLAS, FFTW, GSL • Simplified, generic API for FPGA usage through USURP* • High-Availability Middleware • Framework used to enable health monitoring of cluster * USURP is a standardized interface specification for RC platforms, developed by researchers at UF

DMM Components • Mission Manager (MM) • Controls high-level job deployment • Facilitates replication of lower-level jobs • Spatial or temporal replication • Automatically compares and validates outputs • Monitors real-time deadlines • Enables roll-forward / roll-back when faults occur • Job Manager (JM) • Controls low-level job deployment and scheduling across system • FT Manager (FTM) • Manages low-level system faults (node crash, job crash) • JM Agent (JMA) • Deploys and monitors programs on given node • Provides application “heartbeat” to system controller • Mass Data Store (MDS) • Provides reliable centralized data services • Enables reliable checkpointing

Fault-tolerant Partial Transform Computation Flow of Fault-tolerant 2D-FFT Experimental Overhead of Fault-tolerant RDP vs. a Fault-intolerant Version Algorithm-Based Fault Tolerance • Commonly refers to matrix coding method that is preserved through certain linear algebra operations • Matrix and vector multiply • Discrete Fourier Transform • Discrete Wavelet Transform • Matrix decomposition: C = AB (LU, QR, Cholesky) • Matrix inversion • Used to detect errors in these operations, and in certain cases allows for error correction • ABFT algorithms integrate with DM through Application Services API • An improved method of using ABFT on the 2D-FFT and SAR has been researched at UF • Uses Hamming encoding • Low overhead due to ABFT • Important aspects of ABFT currently under investigation at UF • Round-off analysis • Coverage analysis • Code types • Encoding and Decoding strategies • Overhead

Source Code Transformations • Most science applications are inherently non-fault-tolerant • Requires SIFT framework to improve reliability • Possible to immunize programs against most errors by transforming application source code • Less overhead • More control over FT techniques • Compiler-independent • Integrates with DM system through Application Services API • Custom source-to-source (S2S) transformation tool is currently under development at UF • Accepts C source files as inputs • Generates fault tolerant versions • Uses fine-grain NMR-type of approach to provide improved reliability and dependability • Provides means of control flow checking (CFC) through software • Minimizes number of undetected errors • Transformation options to be supported by the tool • Variable replication • Function replication • Memory duplication / memory checking • Synchronization intervals • Condition evaluation • Post-evaluation verification • Evaluation using replicated variables • Block protection

Satellite orbits, passing through the Van Allen radiation belt Reconfigurable Fault Tolerance • GOAL – Research how to take advantage of reconfigurable nature of FPGAs, to provide dynamically-adaptive fault tolerance in RC systems • Leverage partial reconfiguration (PR) where advantageous • Explore virtual architectures to enable PR and reconfigurable fault tolerance (RFT) • MOTIVATION – Why go with fixed/static FT, when performance & reliability can be tuned as needed? • Environmentally-aware & adaptive computing is wave of future • Achieving power savings and/or performance improvement, without sacrificing reliability • CHALLENGES – limitations in concepts and tools, open-ended problem requires innovative solutions • Conventional methods typically based upon radiation- hardened components and/or fault masking via chip-level TMR • Highly-custom nature of FPGA architectures in different systems and apps makes defining a common approach to FT difficult

Virtual Architecture for RFT Novel concept of adaptable component-level protection (ACP) Common components within VA: Adaptable protection frame – largely module/design-independent (see figure above) Error Status Register (ESR) for system-level error tracking/handling Re-synchronization controller or interfaces, for state saving and restoration Configuration controller, two options: Internal configuration through ICAP External configuration controller Benefits of internal protection: Early error detection and handling = faster recovery Redundancy can be changed into parallelism PR can be leveraged to provide uninterrupted operation of non-failed components Challenges of internal protection: Impossible to eliminate single points of failure, may still need higher-level (external) detection and handling Stronger possibility of fault/error going unnoticed Single-event functional interrupts (SEFI) are major concern “sockets” for modules Reconfigurable FT Adaptable Component- level Protection B A B A D B BLANK A BLANK C no parallel, SCP 2× parallel, SCP no parallel, TMR 4× parallel, single FPGA VA concept diagram

Space Applications • Synthetic Aperture Radar (SAR) • Used to form high-resolution images of Earth’s surface from moving platform in space • Patch-based processing with significant amount of overlap between patch boundaries • Parallelizable on multiple levels of granularity, possible without need for any inter-processor communication (one patch per node) • 2-dimensional data set, can range in size from several hundred Megabytes to Gigabytes • Data set not significantly reduced through course of application • Highly amenable to ABFT; possible application for the Dependable Multiprocessor project

Space Applications • Hyperspectral Imaging (HSI) • Uses traditional beamforming techniques to perform coarse-grained classification on hyperspectral images • Adjustable to enable real-time processing • Mostly embarrassingly parallel, exception being weight computation (shown in red below) • 3-dimensional data set, reduced through course of application • Auto-correlation sample matrix (ACSM) calculation and beamforming (detection) amenable to ABFT • Suggest NMR for weight computation (weight) • Parallel and multi-FPGA decompositions explored

Space Applications • Cosmic Ray Elimination • Uses image processing techniques to remove artifacts caused by cosmic rays • Image shows pre- and post-processed versions of a Hubble Telescope observation • Images are highly parallelizable, with minimal communication necessary • Main computation: median filtering • Fault-tolerant median filter developed • Other portions of algorithm replicated by hand or S2S translator • Other aerospace-related application kernels • Space-Time Adaptive Processing (STAP) • Ground Moving Target Indicator (GMTI) • Airborne LIDAR • Digital Down Conversion (DDC) • PDF Estimation

Novel Computing Platforms Fixed multi-core (FMC) devices Cell Heterogeneous, vector compute engine, 3.2 GHz clock rate, ~70 W max. power consumption GPU Homogeneous, many (e.g. 100+) stream processors, ~1.5 GHz clock rate, ~120 W max. power consumption Reconfigurable multi-core (RMC) devices Field-Programmable Object Array (FPOA) Heterogeneous, coarse-grained processing elements, 1 GHz clock rate, ~35 W max power consumption Field-Programmable Gate Array (FPGA) Heterogeneous, fine-grained processing elements, max. clock rate ~500 MHz, achievable clock rate varies, ~30 W max. power consumption Tilera Homogeneous, coarse-grained processing elements (64 32-bit MIPS-like processors on-chip), ~750 MHz clock rate, ~30 W max. power consumption Element CXi Heterogeneous, coarse-grained processing elements, 200 MHz clock rate, ~1 W max. power consumption Cell processor block diagram - http://www.research.ibm.com/journal/rd/494/kahle.html FPOA architecture - http://www.mathstar.com/Architecture.php

RC: Vital Technology for Space Versatility in space missions (adapts as needs demand) Fixed archs. burdened with fixed choices, limited tradeoffs Performance in space missions (speed, power, size, etc.) e.g. Computational density per Watt (CDW) device metric FPGAs far exceed FMC devices (CPU, Cell, GPU, etc.) Parallel Operations– scales up to max. # of adds and mults (# of adds = # of mults) possible Achievable Frequency – lowest frequency after PAR of DSP & logic-only impls. of add & mult comp. cores [FPGA] Power – scales linearly with resource util; max. power reduced by ratio of achievable freq. to max. freq. [FPGA] HPEC devices featured here; similar results vs. 65nm Xeon, 90nm GPU, etc. (see RSSI’08). Results excerpted from pending presentation from CHREC-UF site for HPEC’08 Workshop. 16

Experimental logic analyzer measurements RapidIO • High-speed embedded system interconnect, replacement for bus-based backplanes • Parallel and serial variants, serial is wave of future • Multiple programming models • Research with RapidIO at UF • Simulative research studying capability of RapidIO-based computing platforms to support space-based radar (SBR) processing • Custom testbed designed and built, for verification of simulation models & experimentation with RapidIO & FPGAs Visualization of simulated GMTI application progress

Conclusions • Fault tolerance for space should be more than RadHard components & spatial TMR designs • Fixed worst-case designs extremely limited in perf/Watt • Instead, many FT methods & modes can be exploited • Adaptive systems that react to environmental changes • COTS featured inside critical performance path • RadHard for FT management, outside critical perf. path • UF active on many space-related FT issues • NASA Dependable Multiprocessor, CHREC RFT F4-08 • Modes: SIFT, ABFT, S2S, RFT, FEMPI, CR, CED, etc. • Devices: PPC/AV, FPGA, FPOA, Tilera, ElementCXi, etc. • Space apps: HSI, SAR, LIDAR, GMTI, CRE, et al.

2009 IEEE Aerospace Conference • Track 7.12 Dependable Software for High Performance Embedded Computing Platforms • Transient error detection and recovery techniques • Compiler-based fault-tolerant techniques • Algorithm-based fault-tolerant techniques • Tools and techniques for designing reliable software • SIFT management frameworks • Software dependability analysis • Adaptive fault-tolerant techniques • FT applications • Track Chairs • Richard Linderman Richard.Linderman@rl.af.mil • Grzegorz Cieslewski cieslewski@hcs.ufl.edu • Dates • Abstract Submissions Due: July 1st, 2008 • Paper Submissions Due: November 2nd, 2008

Advanced Space Computing with System-Level Fault Tolerance

Advanced Space Computing with System-Level Fault Tolerance

Presentation Transcript

Fault Tolerance in Reconfigurable Computing / FPGAs

Fault Tolerance

Fault Tolerance

Practical QoS network system with fault tolerance

Fault Tolerance

Application Level Fault Tolerance and Detection

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Application Level Fault Tolerance and Detection

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance