1 / 46

Self-Stabilization Copes with Soft-Errors

Self-Stabilization Copes with Soft-Errors. Shlomi Dolev Dagstuhl 2008. Trustworthy Systems: Why is it So Hard?. Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“ http://larch-www.lcs.mit.edu:8001/~corbato/turing91/

taji
Download Presentation

Self-Stabilization Copes with Soft-Errors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-Stabilization Copes withSoft-Errors ShlomiDolev Dagstuhl 2008

  2. Trustworthy Systems: Why is it So Hard? • Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“http://larch-www.lcs.mit.edu:8001/~corbato/turing91/ • "You must pay extreme attention to detail here. One wrong bit will make things fail… "http://my.execpc.com/~geezer/os/pm.htm • From Pentium’s manual:“… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down due to a lack of stack space. No exception is generated to indicate this condition"

  3. Mars Rover - Spirit • …The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems…The operating system is Wind River Systems' Vx-Works.. • …attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended… • …Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot http://www.eetimes.com/sys/news/OEG20040220S0046

  4. Linux and Windows do not Stabilize

  5. Self-Stabilization • Self-healing, Self-managing, Self-* • Recovery Oriented Computing [Berkeley, Stanford] • Autonomic Computing [IBM] • Self-Stabilization • Self-Stabilizing algorithm for mutual exclusion in a ring topology [Dijkstra’74]

  6. Well Established Theory !

  7. Self-Stabilization • The combination and type of faults cannot be totally anticipated in on-going systems • Any on-going system must be self stabilizing (or manually monitored) E L

  8. First Self-Stabilizing Algorithm: Token Passing [Dij74]

  9. Token Passing 1 P1:do forever 2if x1=xnthen 3x1:=(x1+1)mod(n+1) 4 Pi(i ≠ 1):do forever 5 ifxi≠xi-1then 6 xi:=xi-1

  10. Token Passing Cont. {0; 0; 0; 0; 0}; {1; 0; 0; 0; 0}; {1; 1; 0; 0; 0}; {1; 1; 1; 0; 0}; {1; 1; 1; 1; 0}; {1; 1; 1; 1; 1}; {2; 1; 1; 1; 1}; {2; 2; 1; 1; 1}; {2; 2; 2; 1; 1}; {2; 2; 2; 2; 1}; {2; 2; 2; 2; 2} … • Surely works when we start in x1 = x2 = … = xn = 0. • One processor may change a state at a time.

  11. Token Passing: Faults • Transient fault, soft errors, wrong CRC, unexpected temporal severe conditions, etc. • Assigns each processor with an arbitrary state (in the range of its state space). • For example {3; 4; 4; 1; 0}. • p2; p4; and p5 have tokens! • Will the system ever recover?

  12. Token Passing: Automatic Recovery • p1 changes state infinitely often, • Otherwise, let s1 be the fixed state of p1, • p2 eventually copies s1 from p1, then • p3 eventually copies s1 from p2, then ... • pn eventually copies s1 from pn-1, then • p1 changes state. • p1 changes state in the order 4; 5; 0; 1; 2; 3; 4; 5; 0; ...

  13. Token Passing: Automatic Recovery Cont. • In any initial state at least one state is missing, {4; 4; 1; 0; 2}, 3 and 5 are missing. • Once p1 reaches the missing state e.g., 5, all the processors must copy 5, before p1 reads 5 from pn and changes state to 0.

  14. Will It Stabilize With mod (n - 2)? Mod 3 {0,0,2,1,0} p1 {1,0,2,1,0} p5 {1,0,2,1,1} p4 {1,0,2,2,1} p3 {1,0,0,2,1} p2 {1,1,0,2,1} +1 mod 3 !

  15. Is Self-Stabilization a Toy?

  16. Stabilization Stack • Self Stabilizing Microprocessor [DH04,DH06] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recovery Oriented Programming[BD05]

  17. Lower Bound • Analysis of circuit resiliency to Soft-Errors. • Complexity of logical masking analysis.

  18. Masking Soft Errors • Essential for reducing stabilization periods frequency and duration. • Needed a scheme for analyzing the expected effect on a given circuit. • Current solutions use simulations with fault injection. • Two sided error of the estimation. • No feedback regarding “problematic” portions of the circuit.

  19. Logic gates Input latches Output latches current time gate threshold seu duration Logical pulse Analyzing Soft-Error Resiliency • Scope (motivated by, e.g., pipeline architecture): Electrical pulse

  20. Input Crucial Times (ICT) • Not all soft errors effect the latched result. • Given a gate/latch u, we define ICT(u): • The time in which it is crucial that u accepts correct input.

  21. Putting it all together • Compute a topological sort of the circuit graph • For each node (in backward topological order): • If then • Else • Compute for each node u. • = Pr[u is not effected at its crucial time]

  22. Talk Outline • Analysis of circuit resiliency to Soft-Errors. • Complexity of logical masking analysis.

  23. Why is it only a bound? • The above algorithm does not take into account logical masking. • Incorrect computation in the internal gates that does not result in an incorrect output. • Consider the formula below: When : • A formula may favor certain inputs.

  24. Logical Masking Analysis Complexity • Is there an efficient algorithm that considers logical masking? • Instance: Formula implementing a Boolean function , where each gate g succeeds with probability , and a threshold . • Question: is there an input such that: No! Unless P=NP NP-Complete

  25. NP-Hardness proof. • By reduction from SAT. • Consider a formula . • Note: • If then there exists an input x such that: • Otherwise, for all inputs x:

  26. Towards Hamming Processor ShlomiDolev, Sergey Frenkel

  27. Error protecting coded operations • Incorporating error correcting schemes in logic/arithmetic processing. • An execution of all logic/arithmetic operations while maintaining Hamming distance • Traditional approaches deal, in fact, with a protection of the result of the operations. • Given inputs that their accumulated Hamming distance is t the result is at most of t+e Hamming distance, where e soft errors happen during computation

  28. Bitwise And Preserving/Reducing the Hamming Distance

  29. Full Adder Preserving/Reducing the Hamming Distance

  30. Comparison with TMR • the TMR approach is simple, but seems to be too expensive with regards to classical error correcting codes (e.g. Hamming codes, Reed-Solomon codes) that do not duplicate bits but use the Hamming distance between code words to correct errors. • In general,TMR does not mask soft-errors since the soft-errors may change the result of the last gate of the TMR circuit. • In contrast to the Hamming processor approach, TMR implies a probability for incorrect output even when a single corruption takes place.

  31. Talk Outline • Analysis of circuit resiliency to Soft-Errors. • Complexity of logical masking analysis. • Methodologies for validating Self-Stabilization property of a microprocessor.

  32. Self-Stabilizing Microprocessor • Self-Stabilizing algorithms assume that the microprocessor executes them. • Soft-Errors may cause the microprocessor to be stacked in a faulty state. • A microprocessor self-stabilizes if: • Started in any internal state, it converge in a finite number of steps into the set of safe states. • Safe states, from which the microprocessor behaves as it should. • The definition of the desired behavior of the microprocessor is sensitive • A function of the level of information hiding used in the instruction set specification.

  33. control MAR MDR PC MBR SP LV CPP TOS OPC H Data Micro-Code Controller control MIR MPC Stack 1 bit flip flops op Code Z,N address Our Test Case – Mic-1 • Presented in Tanenbaum’s book.

  34. Proving Convergence • The state space of the microprocessor – • Every possible assignment to the machine memory elements (including internal registers). • Safe states – • States in which the microprocessor behaves according to the specification. • Ultra-Safe states – • Subset of the safe states that is easily defined and frequently visited. • A “bad” cycle in the transition graph – • Cycle that does not travel throw an ultra-safe state.

  35. h D a g j A b E i l d B c F k C e f Proving Convergence – cont. • We wish to validate that there exists no “bad” cycle in the transition graph of the microprocessor. • Too large ! (we must explore the entire graph) • Using an abstraction, including the value of the micro-code program counter, of the graph we can validate no “bad” cycles.

  36. Implementation Bottleneck • Ask Intel, AMD, IBM to design a self-stabilizing microprocessor… • Technology for converting off-the shelf processor to be self-stabilizing [DH06] • Ask Microsoft, IBM, Red Hat, to convert existing code of OS to be self-stabilizing… • Stabilizing Virtual Machine [DY07]

  37. Enforcing stabilization by resetting • Processors behave correctly after reset • Periodic reset ensures correct behavior • But damages closure… • Need careful solutions

  38. Periodic Reset Monitor • Find a location P in OS code reached at least every T time • At P: • Save necessary information to RAM • Request a reset and loop forever. • Stabilizing watchdog accepts request and resets processor • Upon reset: restore information and continue • Stabilizing watchdog verifies that a reset is performed at least every T+ epsilon time

  39. Self-Stabilization Preserving Compiler Shlomi Dolev, Yinnon A. Haviv, Department of Computer Science Ben-Gurion University, Israel Mooly Sagiv, Department of Computer Science Tel Aviv University, Israel

  40. The Gap. • Need a transformation between: • Input program P written in a high abstraction language, e.g., (D)ASM. • Output program Q in a machine language, say, JVM. • Existing compilers? • P and Qbehaves the same when started in the initial state. • What if Q reaches an unexpected state due to soft-error experienced by microprocessor?

  41. Trivial Example mov ax, 10 mov cx, 0 loop1: push cx call f inc cx cmp cx,ax jne loop • A statement of the form: For each i in {0..9} do f(i) • May be compiled to  • Start with cx=12 inside the loop… • Moreover: Any runtime mechanism can get stuck / inconsistent.

  42. Stabilization Preserving Compiler – a closer look Ensuring that Q eventually behaves as P: • State space of P • State space of Q

  43. Enforce invariants Variable declarations Scheduler condition_1 … condition_n upon <condition_1> do <statement_1> Statement_1 upon <condition_n> do <statement_n> Statement_n The Transformation

  44. Self-Stabilization Preserving Compiler: Summary • Front end of compiler for ASM. • Self Stabilization preserving compiler. • Language with clear semantics from any state. • New demands for a compiler.

More Related