ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution

Overview • Introduction • Watchdog techniques • Timers, watchdog processors, error model, control flow checking, memory access and assertion checking • Re-execution for fault-tolerance • Basic techniques: RESO concept, program re-execution, instruction re-execution • Case studies: Fine grain parallel architecture (CRAY), SMT architecture, multiscalar architecture. Chip Multiprocessor • Summary ECE 753 Fault Tolerant Computing

Introduction • References • Watchdog - [mahm:88] • Re-execution - [rotenberg:99], [rashid:00] [subra:10], [kala:13] • Sohi, Franklin, and Saluja, “A study of time-redundant fault-tolerant techniques for high-performance pipelined computers,” Proceedings FTCS-19, June 1989, pp. 436-443. ECE 753 Fault Tolerant Computing

Introduction (contd.) • Somewhat higher level than ECC and masking at circuit level • Bordering between hardware and software (hardware often assisted by software) • These are some of the very first fault-tolerance methods ECE 753 Fault Tolerant Computing

Processor watchdog Watchdog techniques • Key concept • A process or processor is checked by another hardware (normally) unit of its actions. Actions checked include if the process is still active, alive, not executing incorrect paths during execution, etc. ECE 753 Fault Tolerant Computing

Processor timer Error Watchdog: Timers • Check for aliveness • Processor resets the timer at certain intervals or on certain conditions • Timer raises error flag if not reset before it overruns ECE 753 Fault Tolerant Computing

Processor A Processor B Timer Watchdog: Timers (contd.) • Check for timeout • Processor sends a message and starts a timer, the second processor must reply within this time (hardware/software implementation) ECE 753 Fault Tolerant Computing

Watchdog: Timers (contd.) • Applications • Processor control systems (chemical, mechanical and other control systems) • Switching systems – messages sent or received often await certain length of time before they are repeated • Networks – email messages often have timeouts associated with them ECE 753 Fault Tolerant Computing

Memory data address BUS control Processor Watchdog Watchdog: Processors • Architecture – can be complex but let us consider the following simple architecture (observer) ECE 753 Fault Tolerant Computing

Watchdog: Processors (contd.) • What can it achieve? • Observe the address bus • Can observe the data • Can observe instructions • Can check the flow of program control • Need to know what kind of errors can occur to determine the capability of this method ECE 753 Fault Tolerant Computing

Watchdog: Error models • Experimental setup to develop error models applicable at this level • Processor-memory architecture • Inject faults (random errors) - in I/O processor, within processor (register file, states), within memory • Simulate • Also hardware was designed to inject such faults and study the impact/behavior ECE 753 Fault Tolerant Computing

Watchdog: Error models (contd.) • Conclusions of the studies • Program flow could change (branch to no branch, or vise a versa) • Instruction fetched from data space • Access to non existence memory space • Data fetched from instruction space • Illegal instruction • Writing in protected area (ROM) • 60% of all faults could be detected by monitoring control flow – Thus we need to develop methods that are good in monitoring control flow ECE 753 Fault Tolerant Computing

Watchdog: Control flow checking • Basic principle • Analyze the program and extract control information • Branch free intervals • Subroutine calls • Assign signatures to branch free intervals and provide these signatures to the watchdog processor to check these values ECE 753 Fault Tolerant Computing

Watchdog: Control flow checking (contd.) • A simple example Program watchdog start ------------ receive start branch observe bus free cont. to form code signature check sig X --- Check X against collected sig ECE 753 Fault Tolerant Computing

Watchdog: Control flow checking (contd.) • Details and variations • Structural integrity checking • Analyze the program control flow – create a program control flow graph • Assign unique identifier to the nodes of the graph • Provide control flow graph to the watchdog along with the identifiers • In case of branches, watchdog expects one of the many possible identifiers • Limitations • Performance impact – insertion of special instructions • Inability to detect data processing variations – add to sub ECE 753 Fault Tolerant Computing

Watchdog: Control flow checking (contd.) • Details and variations (contd.) • Derived signature checking • Compiler identifies branch free intervals and generates signatures (such as check sum) for these intervals • At run time these signatures are provided to the watchdog using tag bits to differentiate between regular instructions and watchdog messages • Watchdog monitors the bus and generates the signatures and compare these signatures with the signatures captured from the bus (compiled signature) • Example: associate two tag bits with every memory word to differentiate between instructions and compiled signatures – when a tag for signature appears on the bus watchdog captures the tag and forces a NOP on the bus for the regular processor ECE 753 Fault Tolerant Computing

Watchdog: Control flow checking (contd.) • Details and variations (contd.) • Derived signature checking (contd.) • Coverage • Can detect random errors in instructions in branch free intervals (but aliasing can occur) • Overheads • Memory width increase due to tag bits • Memory increase due to signatures insertions • Performance impact due to NOPs • Solutions • Using path signature method – reduces the number of signatures needed • Branch address hashing – merge signature and branch address ECE 753 Fault Tolerant Computing

Watchdog: Mem access and assertion checks • What to do about memory/data errors • Use ECC • Few other methods using watchdog • Check for non existent memory addresses • Check for out of range addresses • Capability based checking for objects is also possible • Assertion based checking and sanity checks using watchdog (independent hardware) is also possible ECE 753 Fault Tolerant Computing

Re-execution for fault-tolerance • Key concept • Execute a program/instruction twice (or more times) and then compare the results. • A time redundancy technique, but if multiple hardware platforms are available, it is a hardware redundancy technique • Can detect transient faults. But it can also be employed to detect some permanent faults (see RESO next) even if the same hardware is used. ECE 753 Fault Tolerant Computing

Re-execution: Basic Techniques • RESO concept • Re-execution of an instruction with shifted operands • Already discussed early in the course • Can detect transient faults • Can also detect many permanent faults ECE 753 Fault Tolerant Computing

Re-execution: Basic Techniques (contd.) • Program Re-execution • Make two copies the program • Execute them serially • Can use RESO if the hardware platform is same for both executions • Execute them in parallel if sufficient hardware redundancy is available • May take twice as long or twice the hardware • When/how to compare: impacts the system complexity • Performance impact • Serial computation: High latency • Parallel computation: Complex implementation, and hence possible loss of performance ECE 753 Fault Tolerant Computing

Re-execution: Basic Techniques (contd.) • Instruction Re-execution – fine grain parallelism • Re-execute every instruction on same or different hardware, depending upon the redundancy available • May use RESO if same hardware is used for instruction re-execution • If sufficient resources are available, this method may have little impact on the performance ECE 753 Fault Tolerant Computing

Re-execution: Case studies • Introduction to case studies • CRAY • Instruction re-execution • SMT architecture • Two copies the program are interleaved as two threads for simultaneous execution • Multiscalar architecture • Two copies of the program are executed on many processing elements simultaneously • Chip multiprocessor • With critical value forwarding (DSN-2010) ECE 753 Fault Tolerant Computing

Re-execution: Case studies (contd.) • CRAY • Instruction re-execution • Duplication of instruction in hardware • Sufficient resources and pipelining available for re-execution without doubling the execution time • Consider a generic fine grain parallel architecture (OH) • Consider executing a code segment (OH) • Now look at ways of duplicating instructions and executing original and duplicated instructions (OH) • Some experimental results ECE 753 Fault Tolerant Computing

Re-execution: Case studies (contd.) • AR-SMT • High level view of the technique (OH) • Concept of execution (Active) streams • Re-execution of the instruction stream – Redundant stream • Issue of delay buffer length and latency • Implementation issues and coverage • Performance impact ECE 753 Fault Tolerant Computing

Re-execution: Case studies (contd.) • Multiscalar • Concept of control flow graph (OH) • Basic architecture (OH) • Static division of PUs and performance impact (OH) • Dynamic division of PUs and performance impact (OH) ECE 753 Fault Tolerant Computing

Re-execution: Case studies (contd.) • Chip Multiprocessor (See slide set) • Intro • Design Overview and concept • Evaulation • Conclusion ECE 753 Fault Tolerant Computing

Watchdog and Re-execution: Comments • Concepts discussed here can be used to design high performance processors • Performance improvement via speculation • Have a very high performance speculative processor • Verify the control flow using watchdog or use a second processor to fully verify the executed stream by the speculative processor. • This will lead to a processor with high performance (throughput) albeit high latency ECE 753 Fault Tolerant Computing

Summary • Watchdog • Timer • Processor • Control flow checking • Re-execution • Basic techniques • Case studies: CRAY, AR-SMT, Multiscalar ECE 753 Fault Tolerant Computing

ECE 753: FAULT-TOLERANT COMPUTING