Reliability-Aware OS Support for FPGA-Based Systems

Reliability-Aware OS Support for FPGA-Based Systems M. Kandemir, G. Chen, and F. Li Department of Computer Science & Engineering The Pennsylvania State University, USA 224/MAPLD 2004

Introduction and Acronyms • Increasing soft-error rates make reliability an important factor in system design • Our focus: Reliability-aware OS scheduling for FGPA based systems • FPGA: Field Programmable Gate Array • CLB: Configurable Logic Block • STG: SubTask Graph 224/MAPLD 2004

Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB The Reconfigurable System Process 2 Process 3 Process 1 a 6X8 CLB array the interconnects and input-output blocks are omitted 224/MAPLD 2004

Improving Reliability • Traditionally, OS-scheduler schedules parallel executions of multiple processes to maximize FPGA space utilization • Data dependencies between different processes might prevent the full utilization of FPGA space • Our approach utilizes the available FPGA space to duplicate processes and improve reliability 224/MAPLD 2004

CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Duplicating Processes Process 2 Duplicate of Process 1 Process 3 Process 1 Duplicate of Process 3 224/MAPLD 2004

Issues in Duplicating Processes • Tasks (processes) have different criticality • Each task may require a different amount of FPGA space • Duplications can cause performance degradation • We use a QoS parameter to indicate the maximum tolerable performance degradation • A checker task is scheduled for each duplicated task to check the outputs of the primary task and the duplicate 224/MAPLD 2004

Each node represents a process code portion (subtask) that will be executed in a single quantum of time once it gets scheduled. The jth node of process i is denoted as STGij Indicates a data or control dependence from vi to vj Subtask Graph (STG) Vi Vj Each process to be scheduled is presented by a subtask graph 224/MAPLD 2004

Subtask Graph Vi Vj Since our processes are extracted from the same application, there might be data dependences between different processes 224/MAPLD 2004

Our Approach • Five Step • Task duplication under QoS guarantees • Current implementation focuses only on error detection Annotation step QoS specification step Task identification step Task ranking step Scheduling step 224/MAPLD 2004

Our Approach The application programmer indicates which data structure are critical from the reliability view point using annotations. Annotation step QoS specification step Task identification step The application programmer also indicates the tolerable latency during application execution as a result of the reliability provided. Task ranking step Scheduling step 224/MAPLD 2004

Our Approach An automatic application code analyzer analyzes the source code, and identifies tasks. Annotation step QoS specification step Task identification step Based on how these tasks operate on critical data, they are ranked. They are ordered from the most important task to the least important one. Task ranking step Scheduling step 224/MAPLD 2004

Our Approach Annotation step QoS specification step The OS scheduler is modified such that whenever there is opportunity, the OS duplicates tasks that run on FPGA device. Whenever the scheduler predicts the QoS limit is about to be reached, it stops duplicating the tasks. Task identification step Task ranking step Scheduling step 224/MAPLD 2004

Experimental Setup • An error injection module injects errors with a specified probability • Two real-life embedded applications: encr and usonic • The performance of our reliability-aware scheduler is compared with that of a normal Short-Job-First scheduler • Tolerate at most 5% performance degradation • Rank tasks according to the frequency of accesses to critical data • Fatal errors: Errors that would lead to crash of the application 224/MAPLD 2004

Experimental Data 224/MAPLD 2004

Ongoing Work • Experimenting with a diverse set of benchmarks • Implementing task duplication within other types of OS schedulers such as First-Come-First-Server 224/MAPLD 2004

Conclusion • The OS scheduler tries to provide reliability through task duplication under QoS guarantees • Improving FPGA space utilization by duplicating for reliability • Providing reliability for critical tasks first • Catching most fatal errors 224/MAPLD 2004

Reliability-Aware OS Support for FPGA-Based Systems