Self-Stabilization Copes with Soft-Errors

Self-Stabilization Copes withSoft-Errors ShlomiDolev Dagstuhl 2008

Trustworthy Systems: Why is it So Hard? • Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“http://larch-www.lcs.mit.edu:8001/~corbato/turing91/ • "You must pay extreme attention to detail here. One wrong bit will make things fail… "http://my.execpc.com/~geezer/os/pm.htm • From Pentium’s manual:“… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down due to a lack of stack space. No exception is generated to indicate this condition"

Mars Rover - Spirit • …The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems…The operating system is Wind River Systems' Vx-Works.. • …attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended… • …Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot http://www.eetimes.com/sys/news/OEG20040220S0046

Linux and Windows do not Stabilize

Self-Stabilization • Self-healing, Self-managing, Self-* • Recovery Oriented Computing [Berkeley, Stanford] • Autonomic Computing [IBM] • Self-Stabilization • Self-Stabilizing algorithm for mutual exclusion in a ring topology [Dijkstra’74]

Well Established Theory !

Self-Stabilization • The combination and type of faults cannot be totally anticipated in on-going systems • Any on-going system must be self stabilizing (or manually monitored) E L

First Self-Stabilizing Algorithm: Token Passing [Dij74]

Token Passing 1 P1:do forever 2if x1=xnthen 3x1:=(x1+1)mod(n+1) 4 Pi(i ≠ 1):do forever 5 ifxi≠xi-1then 6 xi:=xi-1

Token Passing Cont. {0; 0; 0; 0; 0}; {1; 0; 0; 0; 0}; {1; 1; 0; 0; 0}; {1; 1; 1; 0; 0}; {1; 1; 1; 1; 0}; {1; 1; 1; 1; 1}; {2; 1; 1; 1; 1}; {2; 2; 1; 1; 1}; {2; 2; 2; 1; 1}; {2; 2; 2; 2; 1}; {2; 2; 2; 2; 2} … • Surely works when we start in x1 = x2 = … = xn = 0. • One processor may change a state at a time.

Token Passing: Faults • Transient fault, soft errors, wrong CRC, unexpected temporal severe conditions, etc. • Assigns each processor with an arbitrary state (in the range of its state space). • For example {3; 4; 4; 1; 0}. • p2; p4; and p5 have tokens! • Will the system ever recover?

Token Passing: Automatic Recovery • p1 changes state infinitely often, • Otherwise, let s1 be the fixed state of p1, • p2 eventually copies s1 from p1, then • p3 eventually copies s1 from p2, then ... • pn eventually copies s1 from pn-1, then • p1 changes state. • p1 changes state in the order 4; 5; 0; 1; 2; 3; 4; 5; 0; ...

Token Passing: Automatic Recovery Cont. • In any initial state at least one state is missing, {4; 4; 1; 0; 2}, 3 and 5 are missing. • Once p1 reaches the missing state e.g., 5, all the processors must copy 5, before p1 reads 5 from pn and changes state to 0.

Will It Stabilize With mod (n - 2)? Mod 3 {0,0,2,1,0} p1 {1,0,2,1,0} p5 {1,0,2,1,1} p4 {1,0,2,2,1} p3 {1,0,0,2,1} p2 {1,1,0,2,1} +1 mod 3 !

Is Self-Stabilization a Toy?

Stabilization Stack • Self Stabilizing Microprocessor [DH04,DH06] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recovery Oriented Programming[BD05]

Lower Bound • Analysis of circuit resiliency to Soft-Errors. • Complexity of logical masking analysis.

Masking Soft Errors • Essential for reducing stabilization periods frequency and duration. • Needed a scheme for analyzing the expected effect on a given circuit. • Current solutions use simulations with fault injection. • Two sided error of the estimation. • No feedback regarding “problematic” portions of the circuit.

Logic gates Input latches Output latches current time gate threshold seu duration Logical pulse Analyzing Soft-Error Resiliency • Scope (motivated by, e.g., pipeline architecture): Electrical pulse

Input Crucial Times (ICT) • Not all soft errors effect the latched result. • Given a gate/latch u, we define ICT(u): • The time in which it is crucial that u accepts correct input.

Putting it all together • Compute a topological sort of the circuit graph • For each node (in backward topological order): • If then • Else • Compute for each node u. • = Pr[u is not effected at its crucial time]

Talk Outline • Analysis of circuit resiliency to Soft-Errors. • Complexity of logical masking analysis.

Why is it only a bound? • The above algorithm does not take into account logical masking. • Incorrect computation in the internal gates that does not result in an incorrect output. • Consider the formula below: When : • A formula may favor certain inputs.

Logical Masking Analysis Complexity • Is there an efficient algorithm that considers logical masking? • Instance: Formula implementing a Boolean function , where each gate g succeeds with probability , and a threshold . • Question: is there an input such that: No! Unless P=NP NP-Complete

NP-Hardness proof. • By reduction from SAT. • Consider a formula . • Note: • If then there exists an input x such that: • Otherwise, for all inputs x:

Towards Hamming Processor ShlomiDolev, Sergey Frenkel

Error protecting coded operations • Incorporating error correcting schemes in logic/arithmetic processing. • An execution of all logic/arithmetic operations while maintaining Hamming distance • Traditional approaches deal, in fact, with a protection of the result of the operations. • Given inputs that their accumulated Hamming distance is t the result is at most of t+e Hamming distance, where e soft errors happen during computation

Bitwise And Preserving/Reducing the Hamming Distance

Full Adder Preserving/Reducing the Hamming Distance

Comparison with TMR • the TMR approach is simple, but seems to be too expensive with regards to classical error correcting codes (e.g. Hamming codes, Reed-Solomon codes) that do not duplicate bits but use the Hamming distance between code words to correct errors. • In general,TMR does not mask soft-errors since the soft-errors may change the result of the last gate of the TMR circuit. • In contrast to the Hamming processor approach, TMR implies a probability for incorrect output even when a single corruption takes place.

Talk Outline • Analysis of circuit resiliency to Soft-Errors. • Complexity of logical masking analysis. • Methodologies for validating Self-Stabilization property of a microprocessor.

Self-Stabilizing Microprocessor • Self-Stabilizing algorithms assume that the microprocessor executes them. • Soft-Errors may cause the microprocessor to be stacked in a faulty state. • A microprocessor self-stabilizes if: • Started in any internal state, it converge in a finite number of steps into the set of safe states. • Safe states, from which the microprocessor behaves as it should. • The definition of the desired behavior of the microprocessor is sensitive • A function of the level of information hiding used in the instruction set specification.

control MAR MDR PC MBR SP LV CPP TOS OPC H Data Micro-Code Controller control MIR MPC Stack 1 bit flip flops op Code Z,N address Our Test Case – Mic-1 • Presented in Tanenbaum’s book.

Proving Convergence • The state space of the microprocessor – • Every possible assignment to the machine memory elements (including internal registers). • Safe states – • States in which the microprocessor behaves according to the specification. • Ultra-Safe states – • Subset of the safe states that is easily defined and frequently visited. • A “bad” cycle in the transition graph – • Cycle that does not travel throw an ultra-safe state.

h D a g j A b E i l d B c F k C e f Proving Convergence – cont. • We wish to validate that there exists no “bad” cycle in the transition graph of the microprocessor. • Too large ! (we must explore the entire graph) • Using an abstraction, including the value of the micro-code program counter, of the graph we can validate no “bad” cycles.

Implementation Bottleneck • Ask Intel, AMD, IBM to design a self-stabilizing microprocessor… • Technology for converting off-the shelf processor to be self-stabilizing [DH06] • Ask Microsoft, IBM, Red Hat, to convert existing code of OS to be self-stabilizing… • Stabilizing Virtual Machine [DY07]

Enforcing stabilization by resetting • Processors behave correctly after reset • Periodic reset ensures correct behavior • But damages closure… • Need careful solutions

Periodic Reset Monitor • Find a location P in OS code reached at least every T time • At P: • Save necessary information to RAM • Request a reset and loop forever. • Stabilizing watchdog accepts request and resets processor • Upon reset: restore information and continue • Stabilizing watchdog verifies that a reset is performed at least every T+ epsilon time

Self-Stabilization Preserving Compiler Shlomi Dolev, Yinnon A. Haviv, Department of Computer Science Ben-Gurion University, Israel Mooly Sagiv, Department of Computer Science Tel Aviv University, Israel

The Gap. • Need a transformation between: • Input program P written in a high abstraction language, e.g., (D)ASM. • Output program Q in a machine language, say, JVM. • Existing compilers? • P and Qbehaves the same when started in the initial state. • What if Q reaches an unexpected state due to soft-error experienced by microprocessor?

Trivial Example mov ax, 10 mov cx, 0 loop1: push cx call f inc cx cmp cx,ax jne loop • A statement of the form: For each i in {0..9} do f(i) • May be compiled to  • Start with cx=12 inside the loop… • Moreover: Any runtime mechanism can get stuck / inconsistent.

Stabilization Preserving Compiler – a closer look Ensuring that Q eventually behaves as P: • State space of P • State space of Q

Enforce invariants Variable declarations Scheduler condition_1 … condition_n upon <condition_1> do <statement_1> Statement_1 upon <condition_n> do <statement_n> Statement_n The Transformation

Self-Stabilization Preserving Compiler: Summary • Front end of compiler for ASM. • Self Stabilization preserving compiler. • Language with clear semantics from any state. • New demands for a compiler.

Self-Stabilization Copes with Soft-Errors