self stabilizing autonomic recoverer for eventual byzantine software n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software PowerPoint Presentation
Download Presentation
Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software

Loading in 2 Seconds...

play fullscreen
1 / 48

Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software - PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on

Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software. Olga Brukman, BGU Shlomi Dolev, BGU Elliot K. Kolodner, IBM. Software Contains Bugs. Heisenbugs, corrupt states, leaked resources are common… Correct and faultless SW is hard Long-lived running programs, e.g., OS

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software' - bran


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
self stabilizing autonomic recoverer for eventual byzantine software
Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software

Olga Brukman, BGU

Shlomi Dolev, BGU

Elliot K. Kolodner, IBM

software contains bugs
Software Contains Bugs
  • Heisenbugs, corrupt states, leaked resources are common…
  • Correct and faultless SW is hard
    • Long-lived running programs, e.g., OS
  • Usually software is tested when starting from initial state and considering limited time scenarios.
fault model reflecting reality
Fault Model Reflecting Reality
  • Software packages can be trusted to work as required after restart.
  • Eventual Byzantine software.
  • System administrators and users use reboot to deal with faults.
so reboot
So Reboot…
  • It does work in practice!
  • Automatic reboot (e.g., for satellites)
    • Be careful not to reboot with no reason
    • Not to reboot portions that work o.k.
    • Make sure the automatic reboot layer works…
current research interest
Current Research Interest
  • Automatic recovery, self-managing systems, self-healing systems, evolving systems…
  • Imply need for robust and stable systems instead of performance optimised systems.
current research activity
Current Research Activity
  • ROC project, Berkeley-Stanford
  • Kinesthetics eXtreme, Columbia
  • Autonomic(holistic) computing, IBM
related work roc
Related Work : ROC
  • Hierarchical restart
    • minimizing MTTR instead of maximizing MTTF.
  • Adding layer that monitors components and restarts them upon failure.
related work roc drawbacks
Related Work: ROC – Drawbacks
  • Limited hierarchies considered
    • empty graph, tree.
  • Monitoring by heartbeats
    • no monitoring of system state and progress.
  • Monitoring-restarting layer itself may crash.
system s genericness

OS

Kernel

<Preds,RActs>1

<Preds,RActs>2

<Preds,RActs>n

<Preds,RActs>

OMR

<Preds,RActs>

<Preds,RActs>

<Preds,RActs>

System’s Genericness
restart actions mature approach
Restart Actions – Mature Approach
  • Subsystem waits for completion of a restart of its components.
  • Restart action may vary, depending on component internal state.
    • Reschedule
    • Roll-back
    • Kill & Restart
  • Few restart attempts with more drastic restart actions.
computational model rsf execution
Computational Model: rsf-execution
  • An execution E is rsf (restart supporting fair)-execution iff E is a fair execution in which every subsystem subi that is initialised during E respects its specification function ssi.

Requirement: Every rsf-execution E has a suffix in which the system respects its specification function ss.

on line safety assurance
On-line Safety Assurance.
  • In any execution [DS01] safety can be achieved by adding monitoring layer.

<Pred,RActs>1

<Pred,RActs>2

on line liveness assurance
On-line Liveness Assurance.
  • In any execution E of |Si|+1 or more configurations there exists a sub execution E’=c1, c2,…, cj in which
    • statesubi(c1) = statesubi(cj)
  • If no progress of subi during E’, then E’= Ecirc.
  • If there is Ecirc then there is an infinite execution in which liveness does not hold.

* |Si| is number of possible states of subi

tools for autonomic recoverer implementation black box approach
Tools for Autonomic Recoverer Implementation – Black Box Approach
  • Software package is ablack box.
  • Package is monitored by recording it’s IO (e.g., strace in Linux).
  • Monitors are independent of specific implementation
tools for autonomic recoverer implementation transparent box approach
Tools for Autonomic Recoverer Implementation – Transparent Box Approach
  • Software package implementation tool is known.
  • Run-Time Reflection tools are used to monitor and restart the package.
  • Possible in Java, C++, CORBA, COM.
self stabilization of the system
Self-Stabilization of the System
  • OMR is self-stabilizing.
    • Eventually each process will have monitor.
  • Each monitor is self-stabilizing as well.
    • Eventually each process/subsystem is safe.
  • Corrupted history causes monitor state corruption
    • Restarts initialize history variables.
    • Eventually monitor will see correct history.
tournament algorithm
Tournament Algorithm

Procedure Node(v:integer, side:{0,1})

1: Wantv[side] := 0

2. Wait until (Wantv[1-side] = 0 or Priorityv =side)

3. Wantv[side] :=1

4. If (Priorityv =1-side) then

  • If (Wantv[1-side] = 1) then goto Line 1

6. Else wait until (Wantv[1-side] = 0)

7. If (v = 1)

  • <Critical Section>

9. Else Node( v/2, v mod 2)

10. Priorityv := 1-side

11. Wantv[side] := 0

mutual exclusion task in autonomic recoverer context

OS

Kernel

OMR

ME

Mutual Exclusion Task in Autonomic Recoverer Context

<Preds,RActs>1

<Preds,RActs>2

<Preds, RActs>ME

<Preds,RActs>n

<Preds,RActs>

<Preds,RActs>ME

<Preds,RActs>

<Preds,RActs>

processes to monitor

p1

p2

p3

p4

Processes to Monitor
  • Tournament process – v
  • Location process (phantom process) – Priority, Want

1

Location processes

2

3

Tournament processes

me recovery tuples examples
ME Recovery Tuples: Examples
  • If there are no N tournament processes, fork tournament processes.
  • If there are no N-1 location processes, fork location processes.
  • If there are no monitor-restarter for tournament/location processes, fork monitor-restarter .
me recovery tuples examples cont
ME Recovery Tuples: Examples (Cont.)
  • If processes are not on their correct path to the root node, restart those processes (or their subsystems).
  • If more than two processes competing for location, restart them.
  • If there is starvation in some node, restart processes in node’s subsystem.
  • If process is in critical section too long, restart process.
recovery tuples for me task monitor
Recovery Tuples for ME Task Monitor

<MonitorPred, RestartAct>ME, N

mp1 : if |processes(TP)| ≠ N

ra1 : forkProcesses(TP, N)

mp2 : if psi in processes(TP) and no monitor(psi )

ra2 : forkMonitorRestarter(<MonitorPred,RestartAct>psi)

mp3 : if |processes(LP)| ≠ N-1

ra3 : forkProcesses(LP, N-1)

mp4 :iflpsiin processes(LP) and nomonitor(lpsi )

ra4 : forkMonitorRestarter(<MonitorPred,RestartAct>lpsi)

recovery tuples for me task monitor1
Recovery Tuples for ME Task Monitor

<MonitorPred, RestartAct>ME, N

mp1 : if |processes(TP)| ≠ N

ra1 : forkProcesses(TP, N)

mp2 : if psi in processes(TP) and no monitor(psi )

ra2 : forkMonitorRestarter(<MonitorPred,RestartAct>psi)

mp3 : if |processes(LP)| ≠ N-1

ra3 : forkProcesses(LP, N-1)

mp4 :iflpsiin processes(LP) and nomonitor(lpsi )

ra4 : forkMonitorRestarter(<MonitorPred,RestartAct>lpsi)

recovery tuples for me task monitor2
Recovery Tuples for ME Task Monitor

<MonitorPred, RestartAct>ME, N

mp1 : if |processes(TP)| ≠ N

ra1 : forkProcesses(TP, N)

mp2 : if psi in processes(TP) and no monitor(psi )

ra2 : forkMonitorRestarter(<MonitorPred,RestartAct>psi)

mp3 : if |processes(LP)| ≠ N-1

ra3 : forkProcesses(LP, N-1)

mp4 :iflpsiin processes(LP) and nomonitor(lpsi )

ra4 : forkMonitorRestarter(<MonitorPred,RestartAct>lpsi)

lemma safety
Lemma “Safety”:
  • Every rsf-execution E has a suffix E’ such that in every configuration

cE’ ! pi,pj: pc (pi,1,c)=8  pc(pj,1,c)=8

lemma liveness
Lemma “Liveness”:

Every rsf-execution E has infinitely many configurations cEsuch that pc(p,1,c)=8 for some process p.

lemma no starvation
Lemma “No starvation”:

Every rsf-execution E has suffix E’ such that

pk ciE’: pc(pk,1,ci)=8.

practical experience printers problem
Practical Experience: Printers Problem
  • Corrupted pdf, doc or ps file sent to printing server.
  • Printer can’t print the file.
  • Cause retries by printing server
    • Printer is “stuck” on one job.
  • Predicate for printing server:
    • Restrict number of retries, try format conversions, send error message to user.
concluding remarks
Concluding Remarks
  • Theory foundations of self-stabilization and restart techniques could serve as a basis for the new paradigms.
  • General framework for design and correctness proof for autonomic recoverer.
  • Printers experience coordinated with IBM.