Jordan Adamek Mikhail Nesterenko Sébastien Tixeuil

Symposium on Stabilization, Safety, and Security of Distributed Systems Evaluating Practical Tolerance Properties of Stabilizing Programs Through Simulation The Case of Propagation of Information with Feedback Jordan Adamek Mikhail Nesterenko Sébastien Tixeuil Toronto, Canada October, 2012

Why Simulate Stabilization • Stabilizing program has to recover from an arbitrary system state • To prove the algorithm correct, the designer has to focus on stabilization from degenerate states that are rarely achieved in practice. • Such exercise tells little about the algorithm’s practical performance • Performance evaluations in the area of stabilization are relatively rare. However, they present unique challenges.What to consider? • states: randomization is a common answer. Yet, uniformly randomized states may be “mild” - evenly distribute process states and may not represent systemic faults • execution models: the model needs to be realistic yet, the results should pertain to the algorithm, not be artifact of the model • parameters: stabilization time is common, yet it often hides the complexity of failure recovery. Other parameters need to be considered. • We simulate stabilizing PIF and analyze its performance using realistic initial state, three classic execution models and compute a number of stabilization parameters

Outline • PIF algorithm • parameter selection and experiment setup • results • analysis • conclusion

PIF Algorithm propagation of information with feedback (PIF) • used to deliver information on rooted trees from root to leaves and get an ack • often considered in stabilization literature; proven ideally- and self- [8,9] as well as snap-stabilizing [1] description • each process can be in one of three states: idle (i), requesting (rq), replying (rp) • root initiates a wave by switching from idle to requesting • each intermediate process p propagates request to its children Ch.p • each leaf reflects the wave back by switching from idle to replying • intermediate processes propagate reply back to root • root waits for reply from all children and repeats the cycle

Initial State Selection tree selection • problem: how to select trees that donot favor particular topology or shape • solution:Prüfer sequence: a sequence of n-2 labels uniquely defines one ofall possible trees of n-labels • random sequence chooses labeledtree with equal probability initial state – need to select initial state, then perturb it by fault of varied extent • problem: not all states occur with equal probabilityex: root is seldom idle • solution: start from idle state, randomly pick a number from range significantly larger than system size, run the algorithm fault-free that number of states, then induce fault

Execution Models & Faults execution models • problem: execution model should not appear to favor particular system and or architecture • solution: selected 3 classic well-studied execution semantics • interleaving – randomly execute one enabled action • power-set – randomly pick the number X of actions to execute, randomly pick first, exclude enabled neighbors; continue until X or all enabled actions are selected; execute selected actions • synchronous – same as power-set only continue randomly selecting actions until none remains faults • randomly pick a process and randomly select its state. Note, may have no observable effect if fault state is the same as correct state • all processes are faulty – arbitrary initial state: classic stabilization

Experiment Setup • 100 processes • avg. tree height 21.64.9 • avg. number of leaves 37.53.1 • faults varied from one to 100 • ran 1,000 experiments for each fault number

Metrics • stabilization time – number of execution steps for algorithm to achieve legitimate state (a single wave) • number of actions until stabilization* • overhead – number of action executions outside the propagation of correct wave (wait time for interleaving semantics [1]) • longest causality chain* – - actions are causally related if executed on same or neighbor process of actions* • scale – number of processes in the system __ * metrics were not included in published proceedings

Stabilization Time

Overhead

Longest Causality Chain

Scale • interleaving semanitcs • varied the system size from 100 to 1000 processes • fixed % of faults (100% is arbitrary state, classic stabilization)

Analysis • simulation results present a detailed picture of algorithm behavior • notes • effort (overhead, actions, time) rises then diminishes with fault extent. In legitimate state single fault may launch spurious wave in opposite direction. Stabilization proportional to system size. Further faults tend to break up this wave and accelerate stabilization • parallel execution semantics (synchronous, power-set) result in greater overhead

Future Research & Conclusion • the study is not exhaustive: the fault location affects the system differently. We believe that the fault closer to the root has a greater ability to perturb the system state • engagement with practice provides feedback for stabilization research: designers are induced to consider and address the problems of practical import • in our case – space fault spurious “counter-wave” was wholly unexpected – may need algorithmic measures to handle it

Thank You Questions?

Jordan Adamek Mikhail Nesterenko Sébastien Tixeuil