1 / 32

Continuous fault containment and local stabilization in path-vector routing

Continuous fault containment and local stabilization in path-vector routing. Hongwei Zhang Anish Arora. Motivation. Study of fault containment has focused largely on cases where faults either stop occurring after certain moment in time or faults occur with low frequency

Download Presentation

Continuous fault containment and local stabilization in path-vector routing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Continuous fault containment and local stabilization in path-vector routing Hongwei Zhang Anish Arora

  2. Motivation • Study of fault containment has focused largely on cases where • faults either stop occurring after certain moment in time • or faults occur with low frequency • In practice, faults may occur with high frequency, and the interval between faults may be shorter than the time taken for the system to stabilize • E.g., under Code Red/Nimda attack (2002), memory overflow causes edge BGP speakers to repeatedly fail-stop and rejoin at a frequency as high as once every minute • the oscillation propagates farther away, in spite of MRAI timer and RFD

  3. Objectives • Formulate concepts that characterize, and develop mechanisms that achieve the following properties: • in the presence of high-frequency faults the impact of faults is always locally contained • once faults stop occurring the system stabilizes within time that is a function of the degree of fault perturbation • We study these issues in the context of path-vector routing • to simplify the presentation, we first present a solution for continuous fault containment and local stabilization in path-vector routing, then we present the concepts

  4. Outline • Fault propagation in path-vector protocols • CPV • design pattern • protocol • Generic concepts for tolerating high-frequency faults • Analytical & simulation results for CPV • Concluding remarks

  5. Fault propagation in path-vector protocols d e [e, d] [f, e, d] f the fresh info. (route-announcement) always lags behind the obsolete info (route-withdrawal) [g, f, e, d] g all are affected unaffected ? [h, g, f, e, d] h [i, h, g, f, e, d] i

  6. Outline • Fault propagation in path-vector protocols • CPV • design pattern • protocol • Generic concepts for tolerating high-frequency faults • Analytical & simulation results for CPV • Concluding remarks

  7. Design pattern of CPV • Key idea: to design a mechanism that • enables information regarding a new network state to catch up with and stop the propagation of the information regarding the preceding state (which has become obsolete) • works whether or not faults stop occurring • Parallel diffusing waves (with different propagation speed)

  8. Outline of CPV • Whenever a node j needs to change state, it engages a containment wave cw0 before engaging a new stabilization wave sw1 • so that cw0 stops the previous stabilization wave sw0 from propagating the existing state of j • In the presence of high-frequency faults, another fault f may occur before j executes sw1, then there are two cases • j does not need to change state any more: j engages an undo-containment wave uw0 to stop cw0 • j still needs to change state: j lets cw0 to propagate

  9. A little more detail • Containment wave • piggybacks the expected next state of a node to its neighbors, so that a neighbor can decide whether to hold an existing SW • is a one-way diffusing process, by which CW can co-exist with the corresponding SW (which is required to contain continuously-occurring faults) • Stabilization wave • takes into account predicated state when choosing next-hop • Undo-containment wave • does not introduce new variables

  10. Outline • Fault propagation in path-vector protocols • CPV • design pattern • protocol • Generic concepts for tolerating high-frequency faults • Analytical & simulation results for CPV • Concluding remarks

  11. ds > α·(dc+U), dc > α·(du+U), du≥ 0 containment wave Protocol CPV

  12. loop freedom a node not in CW does not execute SW, if the next-hop has executed CW • nodes not involved in any CW rank higher than those involved in a CW • consider the expected next route of a neighbor, if available via a CW Action SW (contd.)

  13. CPV (contd.): actions CW and UW Note: we skip the actions for information synchronization between neighbors here

  14. Example revisited CW1 SW1 CW2 SW2 UW1 d e f g h i

  15. Outline • Fault propagation in path-vector protocols • CPV • design pattern • protocol • Generic concepts for tolerating high-frequency faults • Analytical & simulation results for CPV • Concluding remarks

  16. Generic concepts • Objective: to define concepts that capture the desired system properties in the presence of continuously-occurring faults • Key issue: to differentiate the impact of faults and protocol actions • Concepts defined: • Perturbed vs. contaminated node • Perturbation size & contamination range • F-containment & F-stabilization

  17. Preliminaries • A System HistoryH is a sequence q.0, (e.1, t.1), q.1, (e.2, t.2), …, q.(k-1), (e.k, t.k), q.k, …, of alternating system states and events, where • an event is either the execution of a protocol action or the occurrence of a fault • each state transition “q.(k-1), (e.k, t.k), q.k” means that event e.k at time t.k changes the system state from q.(k-1) to q.k • every moment in time, at most one event can occur at a node • Given a system history H and a state q.k in H, the history prefixH(q.k) = the subsequence of H that is between q.0 and q.k • A computation is a system history (or its suffix) where no fault occurs

  18. Preliminaries (contd.) • Given a state q.k and H(q.k), a protocol executionE(q.k) is a set of computations each of which specifies a computation C(q.k, E(q.k)) for a different state q.k’ in H(q.k) that is either the initial state or a state reached immediately after a fault occurs • Given q.k, E(q.k), the stabilization set of q.k, S(q.k, E(q.k)), is the set of nodes that need to change state for the system to stabilize from q.k in the absence of faults

  19. Perturbation vs. contamination • Given “q.k-1, (e, t), q.k” and E(q.k), • the corruption set of e at t cpt(e, t, E(q.k)) = S(q.k, E(q.k)) \ S(q.k-1, E(q.k)) • if e is not a state corruption, the correction set of e at t cct(e, t, E(q.k)) = (S(q.k-1, E(q.k)) \ S(q.k, E(q.k)))  V.(q.k) • For every node j  cpt(e, t, E(q.k)), • j is perturbed by e if e is a fault • j is contaminated via e if e is the execution of a protocol action • For every node j  cct(e, t, E(q.k)), • j is corrected by e

  20. Perturbed vs. contaminated node • a perturbed node remains perturbed until it is corrected by a fault or the system reaches a legitimate state • a contaminated node remains contaminated until it is corrected by a fault or the execution of a protocol action

  21. perturbed corrected Example with existing path-vector protocol d e f contaminated g h i

  22. Perturbation size & contamination range • Given q.k, H(q.k), and E(q.k), the perturbation size at q.k, P(q.k, H(q.k), E(q.k)), is the number of perturbed nodes at q.k • The contamination range of a perturbed region S’ at q.k, R(S’, q.k), is the maximum hop-distance from the corresponding set of contaminated nodes to S’

  23. F-containment & F-stabilization • A system is F-containing if and only if for every perturbed region S’ at an arbitrary state q.k, R(S’, q.k) = O(F(| S’ |), where F is a function • A system is F-stabilizing if and only if starting at an arbitrary state q. k with an arbitrary H(q. k) and E(q.k), the system computation is guaranteed to reach a legitimate state within O(F(P(q.k, H (q.k), E(q.k)))) time in the absence of faults, where F is a function

  24. Outline • Fault propagation in path-vector protocols • CPV • design pattern • protocol • Generic concepts for tolerating high-frequency faults • Analytical & simulation results for CPV • Concluding remarks

  25. Analytical results • L = {q: every up node has found its best route at state q} • Properties of CPV • the contamination range R(S’, q.k) of every perturbed region S’ at any state q.k is O(|S’|) • the distance to which a state of a node i propagates is proportional to the time the state lasts • starting at any state q.k with an arbitrary H(q.k) and E(q.k), the system where CPV is used reaches a legitimate state within O(F(P(q.k, H(q.k), E(q.k)))) time in the absence of faults • F is function reflecting the routing policies used, and is linear if every node chooses a shortest path

  26. Simulation results • SSFNet, a network simulator with standard-conforming protocol implementations • Simulation setup • parameter setup for CPV and BGP • CPV: ds = 30 sec, dc = 10 sec, du = 1 sec • BGP: with MRAI timers (30 seconds) and RFD • Fault scenario a node repeatedly fail-stops and then rejoins every 30 seconds • Internet-type network topology • the shortest-path-first policy

  27. Contamination range and the number of nodes affected

  28. Time taken to stabilize

  29. Stability adaptiveness

  30. Outline • Fault propagation in path-vector protocols • CPV • design pattern • protocol • Generic concepts for tolerating high-frequency faults • Analytical & simulation results for CPV • Concluding remarks

  31. Concluding remarks • Frequent transient faults do happen (especially when systems work under unexpected conditions) • fault containment and stabilization are desirable as well as possible • Quality of service and system behavior during stabilization • perspectives other than convergence only: time, space, stability, etc. • modeling issues: descriptive, derivative

  32. Low frequency faults Destination joins Destination fail-stops

More Related