Micro-loop Prevention Methods

Micro-loop Prevention Methods draft-bryant-shand-lf-conv-frmwk-00.txt draft-zinin-microloop-analysis-00.txt

Outline of Talk • Convergence Strategy and Motivation • Solution Taxonomy • Existing Solution Space • Summary

Traditional convergence strategy • Switch to new as fast as you can independently • Required for failures • Strategy optimized for this case • traffic to failed element is lost • so speed is essential • Used for everything else • common method • traffic can be lost due to loops Fast-Reroute prevents traffic loss due to failure but loops can still cause loss.

Micro-Loop Properties • Independent decisions can cause micro-loops. • Loops may occur between pairs of nodes or cycles of nodes. • Duration depends on relative time to update FIBs. • Implementation differences • Number of affected destinations • Propagation time • Loss due to Loop duration may be longer (an order of magnitude) than Loss during the Fast Reroute failover.

Controlled convergence • Made feasible for failure case by fast reroute • Traffic is not lost so can afford to take time • Can use common method for both failure and management change events • Traditional convergence optimized for failure case without fast-reroute. We can do better… (but keep traditional as safe fall-back for single failure assumption violation.)

Solution taxonomy • Controlled Information flow • Incremental cost change • Controlled Distributed Behavior • Synchronized FIB installation • Ordered FIB changes • Path locking

Method Comparison

Incremental cost change • A change in a link cost of x can only cause loops whose “cyclic” cost is <=x • Minimum cycle is 2 (1 in each direction) • Hence cost change of 1 can never cause a loop. • Where minimum cycle is larger, larger increments can be used. • Once cost reaches cost of alternate path no more loops possible. No Cooperation Required – But Can Take Hours

Synchronized FIB swap • Network synchronized change-over at predetermined time • Signal/determine time to change • Network Synchronized Time (NTP is there) • Either Two FIBs for fast swap • Substantial hardware implications • Or FIB update “fast-enough” from change-over time. • Dependent on NTP Conceptually simple with minimal signalling – NTP dependency & implementation concerns

Ordering by signalling alone • On change, tell old primary neighbors to wait for you • Wait for all neighbors as instructed, install FIB, and tell your old primary neighbors. • Assumes a single non-SRLG failure • Otherwise communication per destination is required No Estimation Required for FIB Compute/Install - Require Reliable Fast Signalling and Non-Trivial Protocol Extensions

Ordered FIB changes • For any isolated link/node change • Determine “safe” ordering for FIB installation • bad news: update from edge to failure, • good news: update from change to edge • Each router computes its “rank” with respect to the change. • Delays for a number of worst-case FIB compute/install times proportional to its rank.

Computing the ordering • Single Reverse SPF rooted at change node • Use old SPT to determine relevant node • For bad news:- count maximum depth of sub-tree below you • For good news:- count maximum hops to change

Delay Proportional to Network Diameter • For Good News, rSPF gives necessary depth. • For Bad News, rSPF is overly pessimistic for some topologies. • Strategies to reduce unnecessary delay • Prune rSPF by only considering the branch across the failure – but still too pessimistic. • Run SPF rooted at edge nodes to correctly prune them – but doesn’t scale. • Compare rSPFs before and after failure • Avoids all micro-loops and requires single FIB install. • Delay dependent on network diameter so may be unacceptable.

Signalling optimization to Reduce delay • Use actual FIB compute/install instead of worst-case • In many cases, actual delay is 0 b/c no change needed. • Signal to parents in rSPF when • Nothing to do, or • Completed FIB changes • Can change FIB when received signal from all children (or when delay expires) • Only an optimization • Loss of signals falls back to delay based

SRLG Concerns • Diverse failures may require mutually incompatible ordering • Different orderings for individual destination sets may help • Need Rules to merge multiple rSPFs

Ordered SPF Summary • No forwarding changes required. • No signalling required at time of change. • Complete prevention of loops for isolated node or link changes. • Requires cooperation from all routers • May delay re-convergence for tens of seconds (unless optional signalling used) • SRLGs require per destination delays and may delay re-convergence more.

Path Locking Framework • Obtain a fixed convergence delay regardless of network. • Avoid ordering issue by providing transitional paths. • Handles SRLGs • Different methods to • Determine/Create transitional paths • Direct traffic to use transitional paths Standard trade-off of complexity versus coverage. • Tunnels for Transitional Paths • Safe Neighbors for Transitional Next-Hops • Marked Packets to Use Transitional Topology • U-turn Packets to Use New Topology

Time-Line of Convergence • Change Discovery Time – At this point, all routers know about the change. Routers install transitional path support. • For some methods, immediately start use of self-determined transitional paths. • Use Transitional Paths Time (1 worst-case FIB compute/install later) – All routers use transitional paths, if available, and new primary next-hops otherwise. • Lock to New Topology Time (1 worst-case FIB compute/install later) – All routers use new primary next-hops. All micro-loops avoided if a transitional path always exists.

Create Tunnels • Requires tunnel computation/creation at topology change • Old topology Locking • Tunnel to the upstream side of the failure • Single tunnel for all affected destinations (if link/node failure). • New topology locking • Tunnel to first unaffected router on new primary path • Tunnels provide a transitional path that can traverse non-supporting routers. • Non-supporting routers can only loop locally originated traffic.

Safe Neighbors • Find a safe neighbor to use as a transitional next-hop. • Safety condition is a neighbor that is loop-free on old topology and a downstream path on new topology. • If two neighboring routers don’t have a safe neighbor, a micro-loop can form on that link. • Analysis of real topologies shows pretty good coverage. • Local micro-loops possible with non-supporting routers.

Typical Coverage

Packet Marking • Can mark packets to force forwarding according to a particular topology. • Topology can be new or old. • All marking starts at the Use Transitional Paths Time • If using new topology, traffic on new topology after 1+ worst-case FIB compute/install delay.

U-turn Packet • Create transitional next-hop by directing U-turn packets to the new primary next-hops. • At Use Transitional Paths Time, send traffic to new primaries (potentially explicitly marked as U-turn packets). • If implicitly determined U-turn packets, doesn’t require marking. • Explicit method for signalling support of U-turns

Lots of Possibilities… • What are important criteria? • Time to be converged • Affects single failure assumption • Network Stability • Ballpark requirement is 10s • Simplicity • Support for SRLGs • No additional mechanisms beyond IP (but coverage may suffer…) • Common additional mechanisms for this and IPFRR advanced methods. • Should also work for LDP

Conclusions & Next Steps • Incremental Cost Change is impractical. • Synchronized FIB Swap – what is the implementation complexity? Implications of coupling NTP to routing? • Ordered SPF – long delay and poor SRLG support. Is that enough to be an issue? • Path Locking • Seem most promising • Many possibilities to get similar results Please send suggestions and comments to the list. This solution set may not be complete.

Micro-loop Prevention Methods