1 / 48

Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software

Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software. Olga Brukman, BGU Shlomi Dolev, BGU Elliot K. Kolodner, IBM. Software Contains Bugs. Heisenbugs, corrupt states, leaked resources are common… Correct and faultless SW is hard Long-lived running programs, e.g., OS

bran
Download Presentation

Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software Olga Brukman, BGU Shlomi Dolev, BGU Elliot K. Kolodner, IBM

  2. Software Contains Bugs • Heisenbugs, corrupt states, leaked resources are common… • Correct and faultless SW is hard • Long-lived running programs, e.g., OS • Usually software is tested when starting from initial state and considering limited time scenarios.

  3. Fault Model Reflecting Reality • Software packages can be trusted to work as required after restart. • Eventual Byzantine software. • System administrators and users use reboot to deal with faults.

  4. So Reboot… • It does work in practice! • Automatic reboot (e.g., for satellites) • Be careful not to reboot with no reason • Not to reboot portions that work o.k. • Make sure the automatic reboot layer works…

  5. Current Research Interest • Automatic recovery, self-managing systems, self-healing systems, evolving systems… • Imply need for robust and stable systems instead of performance optimised systems.

  6. Current Research Activity • ROC project, Berkeley-Stanford • Kinesthetics eXtreme, Columbia • Autonomic(holistic) computing, IBM

  7. Related Work : ROC • Hierarchical restart • minimizing MTTR instead of maximizing MTTF. • Adding layer that monitors components and restarts them upon failure.

  8. Related Work: ROC – Drawbacks • Limited hierarchies considered • empty graph, tree. • Monitoring by heartbeats • no monitoring of system state and progress. • Monitoring-restarting layer itself may crash.

  9. Our ContributionSelf-Stabilizing Autonomic Recoverer

  10. OMR Kernel OS Self-Stabilizing Monitoring Restarting Layer

  11. OMR Kernel Self-Stabilizing Monitoring Restarting Layer OS

  12. OMR Kernel OS Self-Stabilizing Monitoring Restarting Layer

  13. OS Kernel <Preds,RActs>1 <Preds,RActs>2 … <Preds,RActs>n <Preds,RActs> OMR <Preds,RActs> <Preds,RActs> <Preds,RActs> System’s Genericness

  14. Subsystem’s Gracious Restart DAG Hierarchy

  15. Detailed DesignSelf-Stabilizing Autonomic Recoverer

  16. <Pred,RActs>1 <Pred,RActs>2 … Monitor-Restarter for Process

  17. <Pred,RActs>1 <Pred,RActs>2 … Monitor-Restarter for Subsystem

  18. Restart Actions – Naive Approach

  19. Restart Actions –Naive Approach

  20. Restart Actions –Naive Approach

  21. Restart Actions –Naive Approach

  22. Restart Actions – Mature Approach • Subsystem waits for completion of a restart of its components. • Restart action may vary, depending on component internal state. • Reschedule • Roll-back • Kill & Restart • Few restart attempts with more drastic restart actions.

  23. Computational Model: rsf-execution • An execution E is rsf (restart supporting fair)-execution iff E is a fair execution in which every subsystem subi that is initialised during E respects its specification function ssi. Requirement: Every rsf-execution E has a suffix in which the system respects its specification function ss.

  24. On-line Safety Assurance. • In any execution [DS01] safety can be achieved by adding monitoring layer. <Pred,RActs>1 <Pred,RActs>2 …

  25. On-line Liveness Assurance. • In any execution E of |Si|+1 or more configurations there exists a sub execution E’=c1, c2,…, cj in which • statesubi(c1) = statesubi(cj) • If no progress of subi during E’, then E’= Ecirc. • If there is Ecirc then there is an infinite execution in which liveness does not hold. * |Si| is number of possible states of subi

  26. Liveness Concern

  27. Tools for Autonomic Recoverer Implementation – Black Box Approach • Software package is ablack box. • Package is monitored by recording it’s IO (e.g., strace in Linux). • Monitors are independent of specific implementation

  28. Tools for Autonomic Recoverer Implementation – Transparent Box Approach • Software package implementation tool is known. • Run-Time Reflection tools are used to monitor and restart the package. • Possible in Java, C++, CORBA, COM.

  29. Global Predicates Distributed Concerns

  30. Self-Stabilization of the System • OMR is self-stabilizing. • Eventually each process will have monitor. • Each monitor is self-stabilizing as well. • Eventually each process/subsystem is safe. • Corrupted history causes monitor state corruption • Restarts initialize history variables. • Eventually monitor will see correct history.

  31. p4 p1 p4 p1 p2 p3 p4 Task Example: Mutual Exclusion With Tournament Algorithm [PF77] 1 2 3

  32. Tournament Algorithm Procedure Node(v:integer, side:{0,1}) 1: Wantv[side] := 0 2. Wait until (Wantv[1-side] = 0 or Priorityv =side) 3. Wantv[side] :=1 4. If (Priorityv =1-side) then • If (Wantv[1-side] = 1) then goto Line 1 6. Else wait until (Wantv[1-side] = 0) 7. If (v = 1) • <Critical Section> 9. Else Node( v/2, v mod 2) 10. Priorityv := 1-side 11. Wantv[side] := 0

  33. OS Kernel OMR ME Mutual Exclusion Task in Autonomic Recoverer Context <Preds,RActs>1 <Preds,RActs>2 … <Preds, RActs>ME … <Preds,RActs>n <Preds,RActs> <Preds,RActs>ME <Preds,RActs> <Preds,RActs>

  34. p1 p2 p3 p4 Processes to Monitor • Tournament process – v • Location process (phantom process) – Priority, Want 1 Location processes 2 3 Tournament processes

  35. 1 2 3 p1 p2 p3 p4 Subsystems to Monitor

  36. ME Recovery Tuples: Examples • If there are no N tournament processes, fork tournament processes. • If there are no N-1 location processes, fork location processes. • If there are no monitor-restarter for tournament/location processes, fork monitor-restarter .

  37. ME Recovery Tuples: Examples (Cont.) • If processes are not on their correct path to the root node, restart those processes (or their subsystems). • If more than two processes competing for location, restart them. • If there is starvation in some node, restart processes in node’s subsystem. • If process is in critical section too long, restart process. • …

  38. Recovery Tuples for ME Task Monitor <MonitorPred, RestartAct>ME, N mp1 : if |processes(TP)| ≠ N ra1 : forkProcesses(TP, N) mp2 : if psi in processes(TP) and no monitor(psi ) ra2 : forkMonitorRestarter(<MonitorPred,RestartAct>psi) mp3 : if |processes(LP)| ≠ N-1 ra3 : forkProcesses(LP, N-1) mp4 :iflpsiin processes(LP) and nomonitor(lpsi ) ra4 : forkMonitorRestarter(<MonitorPred,RestartAct>lpsi)

  39. p1 p2 p3 p4 ME Task Monitor Goals (N=4) ME

  40. Recovery Tuples for ME Task Monitor <MonitorPred, RestartAct>ME, N mp1 : if |processes(TP)| ≠ N ra1 : forkProcesses(TP, N) mp2 : if psi in processes(TP) and no monitor(psi ) ra2 : forkMonitorRestarter(<MonitorPred,RestartAct>psi) mp3 : if |processes(LP)| ≠ N-1 ra3 : forkProcesses(LP, N-1) mp4 :iflpsiin processes(LP) and nomonitor(lpsi ) ra4 : forkMonitorRestarter(<MonitorPred,RestartAct>lpsi)

  41. 1 2 3 p1 p2 p3 p4 ME Task Monitor Goals (N=4) ME

  42. Recovery Tuples for ME Task Monitor <MonitorPred, RestartAct>ME, N mp1 : if |processes(TP)| ≠ N ra1 : forkProcesses(TP, N) mp2 : if psi in processes(TP) and no monitor(psi ) ra2 : forkMonitorRestarter(<MonitorPred,RestartAct>psi) mp3 : if |processes(LP)| ≠ N-1 ra3 : forkProcesses(LP, N-1) mp4 :iflpsiin processes(LP) and nomonitor(lpsi ) ra4 : forkMonitorRestarter(<MonitorPred,RestartAct>lpsi)

  43. 1 2 3 p1 p2 p3 p4 ME Task Monitor Goals (N=4) ME

  44. Lemma “Safety”: • Every rsf-execution E has a suffix E’ such that in every configuration cE’ ! pi,pj: pc (pi,1,c)=8  pc(pj,1,c)=8

  45. Lemma “Liveness”: Every rsf-execution E has infinitely many configurations cEsuch that pc(p,1,c)=8 for some process p.

  46. Lemma “No starvation”: Every rsf-execution E has suffix E’ such that pk ciE’: pc(pk,1,ci)=8.

  47. Practical Experience: Printers Problem • Corrupted pdf, doc or ps file sent to printing server. • Printer can’t print the file. • Cause retries by printing server • Printer is “stuck” on one job. • Predicate for printing server: • Restrict number of retries, try format conversions, send error message to user.

  48. Concluding Remarks • Theory foundations of self-stabilization and restart techniques could serve as a basis for the new paradigms. • General framework for design and correctness proof for autonomic recoverer. • Printers experience coordinated with IBM.

More Related