1 / 82

Stabilization Enabling Technology

Stabilization Enabling Technology. Shlomi Dolev. Trustworthy Systems: Why is it So Hard?. Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“ http://larch-www.lcs.mit.edu:8001/~corbato/turing91/

Download Presentation

Stabilization Enabling Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stabilization Enabling Technology Shlomi Dolev

  2. Trustworthy Systems: Why is it So Hard? • Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“http://larch-www.lcs.mit.edu:8001/~corbato/turing91/ • "You must pay extreme attention to detail here. One wrong bit will make things fail… "http://my.execpc.com/~geezer/os/pm.htm • From Pentium’s manual:“… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down due to a lack of stack space. No exception is generated to indicate this condition"

  3. Mars Rover - Spirit • …The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems…The operating system is Wind River Systems' Vx-Works.. • …attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended… • …Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot http://www.eetimes.com/sys/news/OEG20040220S0046

  4. Linux and Windows do not Stabilize

  5. Self-Stabilization • Self-healing, Self-managing, Self-* • Recovery Oriented Computing [Berkeley, Stanford] • Autonomic Computing [IBM] • Self-Stabilization • Self-Stabilizing algorithm for mutual exclusion in a ring topology [Dijkstra’74]

  6. Well Established Theory !

  7. Self-Stabilization • The combination and type of faults cannot be totally anticipated in on-going systems • Any on-going system must be self stabilizing (or manually monitored) E L

  8. First Self-Stabilizing Algorithm: Token Passing [Dij74]

  9. Token Passing 1 P1:do forever 2if x1=xnthen 3x1:=(x1+1)mod(n+1) 4 Pi(i ≠ 1):do forever 5 ifxi≠xi-1then 6 xi:=xi-1

  10. Token Passing Cont. {0; 0; 0; 0; 0}; {1; 0; 0; 0; 0}; {1; 1; 0; 0; 0}; {1; 1; 1; 0; 0}; {1; 1; 1; 1; 0}; {1; 1; 1; 1; 1}; {2; 1; 1; 1; 1}; {2; 2; 1; 1; 1}; {2; 2; 2; 1; 1}; {2; 2; 2; 2; 1}; {2; 2; 2; 2; 2} … • Surely works when we start in x1 = x2 = … = xn = 0. • One processor may change a state at a time.

  11. Token Passing: Faults • Transient fault, soft errors, wrong CRC, unexpected temporal severe conditions, etc. • Assigns each processor with an arbitrary state (in the range of its state space). • For example {3; 4; 4; 1; 0}. • p2; p4; and p5 have tokens! • Will the system ever recover?

  12. Token Passing: Automatic Recovery • p1 changes state infinitely often, • Otherwise, let s1 be the fixed state of p1, • p2 eventually copies s1 from p1, then • p3 eventually copies s1 from p2, then • ... • pn eventually copies s1 from pn-1, then • p1 changes state. • p1 changes state in the order 4; 5; 0; 1; 2; 3; 4; 5; 0; ...

  13. Token Passing: Automatic Recovery Cont. • In any initial state at least one state is missing, {4; 4; 1; 0; 2}, 3 and 5 are missing. • Once p1 reaches the missing state e.g., 5, all the processors must copy 5, before p1 reads 5 from pn and changes state to 0.

  14. Will It Stabilize With mod (n - 2)? Mod 3 {0,0,2,1,0} p1 {1,0,2,1,0} p5 {1,0,2,1,1} p4 {1,0,2,2,1} p3 {1,0,0,2,1} p2 {1,1,0,2,1} +1 mod 3 !

  15. Is Self-Stabilization a Toy?

  16. Stabilization Stack • Self Stabilizing Microprocessor [DH04,DH06] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recovery Oriented Programming[BD05]

  17. Implementation Bottleneck • Ask Intel, AMD, IBM to design a self-stabilizing microprocessor… • Technology for converting off-the shelf processor to be self-stabilizing [DH06] • Ask Microsoft, IBM, Red Hat, to convert existing code of OS to be self-stabilizing… • Stabilizing Virtual Machine [DY07]

  18. Enforcing stabilization by resetting • Processors behave correctly after reset • Periodic reset ensures correct behavior • But damages closure… • Need careful solutions

  19. Periodic Reset Monitor • Find a location P in OS code reached at least every T time • At P: • Save necessary information to RAM • Request a reset and loop forever. • Stabilizing watchdog accepts request and resets processor • Upon reset: restore information and continue • Stabilizing watchdog verifies that a reset is performed at least every T+ epsilon time

  20. Implementationusing Intel XScale core • Used in numerous processors • Network, I/O, Handheld, Cellular etc. • RISC architecture (ARMv5 compatible) • Debug interface • Allows interaction between WD and OS • External debug break used for notifying the upcoming reset

  21. Up to now • Virtual Self-stabilizing processor on top of commercial quality processor • Towards repeating the concept in OSs and VMMs (enforcing configuration and protecting critical operations)

  22. Toward Self-Stabilizing Operating System (SOS) Shlomi Dolev and Reuven Yagel, SAACS’04 Workshop, Zaragoza

  23. Basic Directions • Black-box • Take existing OS (Unix, Windows, RTOS) • Add stabilization layer • Carefully tailoring a tiny kernel Processor scheduling Memory management Device driver Hosting Byzantine processes

  24. Assumptions • Every configuration (processor/memory) is possible • At least some program code is hardwired (in ROM) and is correct – Harvard Model • Processor: • Instruction manual (e.g. x86\IA-32) defines a transition function. • Self-stabilizing [DH04]

  25. Black Box Periodic Reset Re-install and Execute • Watchdog timer (self-stabilizing) • Periodic processor reset • During bootstraps OS reinstall from ROM Weak self-stabilization • E = (ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …., ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …. • Is it always acceptable? Alternative: Periodic re-install code only, add consistency check and enforcement

  26. Tailored Kernel • Tiny Scheduler Tiny Memory Manager • Requirements: • Self-stabilizing • Fair • Process stabilization preserving (e.g. validity of P.C. value)

  27. Tiny SOS Scheduler • ~70 lines of a real machine assembly code • 16bit Real mode & 32bit Protected mode. • Standard build and emulation tools (Nasm, ld, Bochs) • Detailed proof of requirement preservation ; increase task 10 mov word ax, [currentProc] 11 and ax, PROC_MASK ... ; load task state ... ;restore ip 52 mov ax, [bx+4] ;validate ip 53 and ax, IP_MASK 54 mov word [ss:STACK TOP], ax ;restore general registers 55 mov cx, word [bx+12] 56 mov dx, word [bx+14] 57 mov si, word [bx+16] 58 mov di, word [bx+18]

  28. Sketch of Proof • In every execution E, the code of the scheduler is started to be executed and is executed from the first instruction to the last instruction infinitely often • In every execution E of the scheduler each process is executed infinitely often • The self-stabilizing scheduler preservers stabilization of processes.

  29. Talk Outline • Self Stabilizing Microprocessor [DH06] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For • Eventual Byzantine Software [BDK03] • Recover Oriented Programming[BD05]

  30. Self-Stabilization Preserving Compiler Shlomi Dolev, Yinnon A. Haviv, Department of Computer Science Ben-Gurion University, Israel Mooly Sagiv, Department of Computer Science Tel Aviv University, Israel

  31. The Gap. • Need a transformation between: • Input program P written in a high abstraction language, e.g., (D)ASM. • Output program Q in a machine language, say, JVM. • Existing compilers? • P and Qbehaves the same when started in the initial state. • What if Q reaches an unexpected state due to soft-error experienced by microprocessor?

  32. Trivial Example mov ax, 10 mov cx, 0 loop1: push cx call f inc cx cmp cx,ax jne loop • A statement of the form: • For each i in {0..9} do f(i) • May be compiled to  • Start with cx=12 inside the loop… • Moreover: Any runtime mechanism can get stuck / inconsistent.

  33. Stabilization Preserving Compiler – a closer look Ensuring that Q eventually behaves as P: • State space of P • State space of Q

  34. Enforce invariants Variable declarations Scheduler condition_1 … condition_n upon <condition_1> do <statement_1> Statement_1 upon <condition_n> do <statement_n> Statement_n The Transformation

  35. Self-Stabilization Preserving Compiler: Summary • Front end of compiler for ASM. • Self Stabilization preserving compiler. • Language with clear semantics from any state. • New demands for a compiler.

  36. Talk Outline • Self Stabilizing Microprocessor [DH04] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recover Oriented Programming[BD05]

  37. Self-Stabilization and Evolving Systems • Real world systems cannot be verified exhaustively… • We enforce safety and live-ness specifications • Contract between the client, project manager and programmers, that is checked on line! • Make sure that the additional (thin) monitoring and recovering layer is self-stabilizing • A change can be made to the • implementation/specification • to support evolving environments

  38. Self-Stabilizing Recoverer for Eventual Byzantine Software Olga Brukman, Shlomi Dolev Department of Computer Science Ben-Gurion University, Israel Hillel Kolodner, Haifa Research Labs IBM, Israel

  39. Software Contains Bugs • Heisenbugs, corrupt states, leaked resources are common… • Correct and faultless SW is hard • Long-lived running programs, e.g., OS • Usually software is tested when starting from initial state and considering limited time scenarios.

  40. Fault Model Reflecting Reality • Software packages can be trusted to work as required after restart. • Eventual Byzantine software. • System administrators and users use reboot to deal with faults.

  41. OS Kernel <Preds,RActs>1 <Preds,RActs>2 … <Preds,RActs>n <Preds,RActs> OMR <Preds,RActs> <Preds,RActs> <Preds,RActs> Middleware Architecture

  42. <Pred,RActs>1 <Pred,RActs>2 … Monitor-Restarter for Process and Subsystem

  43. Restart Actions – Mature Approach • Subsystem waits for completion of a restart of its components. • Restart action may vary, depending on component internal state. • Reschedule • Roll-back • Kill & Restart • Few restart attempts with more drastic restart actions.

  44. Computational Model: rsf-execution • An execution E is rsf (restart supporting fair)-execution iff E is a fair execution in which every subsystem subi that is initialised during E respects its specification function ssi. Requirement: Every rsf-execution E has a suffix in which the system respects its specification function ss.

  45. Tools for Implementation – Black Box Approach • Software package is ablack box. • Package is monitored by recording it’s IO (e.g., strace in Linux). • Monitors are independent of specific implementation

  46. Tools for Implementation – Transparent Box Approach • Software package implementation tool is known. • Run-Time Reflection tools are used to monitor and restart the package. • Possible in Java, C++, CORBA, COM.

  47. Practical Experience: Printers Problem • Corrupted pdf,doc or ps file sent to printing server. • Printer can’t print the file. • Cause retries by printing server • Printer is “stuck” on one job. • Predicate for printing server: • Restrict number of retries, try format conversions, send error message to user.

  48. Recovery Oriented Programming Olga Brukman and Shlomi Dolev Department of Computer Science Ben-Gurion University, Israel

  49. Towards Robust Software • Programming • Structural programming, OOD, Design Patterns… • Testing and debugging • Unit testing [JUnit, CppUnit]… • Design By Contract (Eiffel) … • Formal specification languages • ASM, IO Automata, NURPL • Model checking • Online recovery • ROC [PBB02]. • Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software [BDK03] - black box software packages.

More Related