Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast Re-Routing (IP FRR)

Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast Re-Routing (IP FRR) Network Failures and Intra-Domain Routing (IGP) ISP Link Failure Studies Failure Characteristics, Causes and Impacts IGP Fast Routing Convergence Speed up routing convergence after routing changes IP Fast Re-Routing (IP FRR) Fast Rerouting Schemes: Failure Insensitive Routing Other Schemes Readings: Please do the required readings

Why Network Fails • Many, many possible reasons and causes • Human Errors • misconfigurations • other mistakes: e.g, let’s see what that red button does • Software Bugs • buggy implementation, incompatibility, … • Hardware failures • flaky interfaces, link errors, fiber cuts, router crashes due to CPU overload or running of memory, …. • Malicious attacks • Network Overload • traffic surges causing network congestion, … • Others: e.g., natural disasters, major accidents • E.g., Baltimore tunnel fire, Ohio train accident, …

Understanding Network Link Failure Characteristics • Failure Characteristics Within an ISP network • How often do links/routers failure? • How many? Are they random, correlated? • How long do they last? • … • What about inter-domain or Internet wide? • What causes BGP to update/withdraw routes? • Destination network down? AS internal failures? BGP session resets? Policy changes? … • How do we measure, detect and analyze network failures? • How do we trouble-shoot network failures and perform root-case analysis? • How do we design more robust and resilient mechanisms?

This Lecture: Focusing on Impact of Failures within an ISP Network • With IP networks becoming the dominant and “converged” information delivery substrate, displacing telephone networks, and eventually cable TV? • Need to better “service availability” • Telephone networks: service availability metrics: 5 9’s: i.e., 99.999% • What about IP networks? • Effect of IP network failures: • routers lose “reachability”: i.e., no forward entries • or existence of transient/permanent forwarding loops • What are impacts of network failures? • In particular, on VoIP services

Failures Affect Link Loads • Many ISP networks are “over-provisioned” so as to handle network failures: • Many claim: normal load utilization < 50% • But still high variability in link utilization: • Can find a link w/ load > 50% every 15 minutes; > 90% every 8 days

Traffic Potholes or Blackholes • Average delay over 5 sec intervals • Traffic was blackholed for more than 10 minutes • It took about 40 minutes for the network to reach a stable state • Root Cause: • Route Misconfigurations! Sprint Measurement Study Anecdote:

Routing Loops under Failures • Loops due to link failures/new route advertisement • Measurements from 3 backbone links • 25% packets caught in a loop in one failure instance • 1% lost due to expire TTL; those that escape have long delays

Sprint Link Failure Study • Link “failures” occur fairly frequently, well spread over time • Inter-POP links are more stable than intra-PoP links • Many intra-PoP link “failures” due to planned events, less impact on traffic due to “full-mesh” intra-PoP topology • Most link failures tend to be transient • Excluding “planned” failures • Most are single link failures • Some are correlated link failures • Link failure characteristics vary depend on links • Depending causes of failures, e.g., flaky interfaces, router overloads, fiber cuts, etc., • Impact of link failures • OC48 link down for 6 seconds: 3 million packets may be lost! • significant impact on applications such as VoIP, on-line gamiing

Methodology: Integrated Monitoring Sprint Measurement Study [I+02,M+04] • Tier-1 ISP backbone (600+ nodes) • Passive route listener software to collect IS-IS & BGP updates • IPMON passive traffic monitoring & active probes • SONET alarm logs; router configurations and BGP policies

IGP Failure Events • IP link: adjacency between two IS-IS routers • Link Failure: loss of this adjacency • Results shown in the following slides only include • US inter-PoP links (OC48) • Failures less than 24 hrs long

Sprint Study: Link Failure Frequencies

Sprint Study: Duration of Failures

Sprint Study: Failures across Links

Scatter Plot of US Failure Events • Apr. – Nov. 2002

Maintenance (or Planned Failures) • Weekly schedule (Mondays 5am – 2pm UC): 20% of failures

Examples of Planned Failures • Upgrades • Changing link to higher capacity • Loading new operating system on a router • Swapping out an old interface card • Maintenance • Fixing a flaky optical amplifier • Configuration changes that require a reboot • Responsible for 50% of intradomain failures • Cable intrusions • Construction activities near a fiber

Failure Classification

Anomalies Found in Shaikh04 paper • Intermittent hardware problem • Router periodically losing OSPF adjacencies • Risk of network partition if 2nd failure occurred • External link flaps • Congestion on edge link causing lost messages • Lost adjacency leading to flapping routes • Configuration errors • Two routers assigned the same IP address • Inefficient config leading to duplicate LSAs • Vendor implementation bug • More frequent refreshing of LSAs than specified

Routing convergence Forwarding convergence Converging After a Failure • Failure detection • Router recognizes an incident link has failed • Failure notification • Router informs other routers about the change • Path re-computation • Routers compute new paths avoiding the link • Forwarding-table update • Routers update their forwarding tables • Data traffic starts to flow over the new path • AT&T, Sprint studies show • convergence time 100s milliseconds up to a few seconds

SPF Calculation LSA LSA Data packet LS Ack Data packet All Together: Looking Inside Router LSA Processing Route Processor (CPU) OSPF Process LSA Flooding Topology View SPF Calculation FIB Update FIB Forwarding Forwarding Switching Fabric Interface card Interface card

Bad Things Happen During Convergence • Transient inconsistencies • Creating “transient forwarding loops” due to • Routers have different views of the network • Forwarding decisions may be inconsistent • Effects on data traffic • Black-hole: packet loss • Loops: packets going in circles • Delay: packets going on very long paths • Out-of-order: new packets arrive before old ones • Want to minimize convergence delay • … and especially the effects on the data traffic

Loop! Example: Transient Forwarding Loop (or Micro-loop) • Set of routers disagree • One router acting on old information • Another router acting on new information s d

Reducing Impact of Link Failures Assuming Traditional Link-State Protocols • Improving convergence time of control/data plane • Reducing timer value for HELO messages • Can achieve sub-second convergence time • 200 msecs common target, threshold for VoIP quality, do-able! • However, • Still react to failure events, can’t prevent packet loops or losses during convergence • may amplify effect of short “transient” failures that last sub-seconds • Prevent “micro loops” during transient routing convergence periods • One solution: using “ordered FIB updates” • requires coordination among routers, adds complexity, delays convergence time • Dealing with “Planned Failures” ?

Reducing Impact of Link Failures Using MPLS • Can pre-compute back-up paths • Often done using the “link protection” scheme • For each link, there is a MPLS protection (back-up) path • But • Need to change “forwarding plane” of routers • Many networks don’t have MPLS deployed Question: Can we perform fast rerouting using “traditional” link state routing protocols without resort to MPLS?

Fast Re-Routing using Link State Protocols [Nelakuditi et al] • Motivations • Most common link failures are transient single-link failures • Hastily react to such failures by LSA flooding may do more harm than good, causing network instability! • Suppress such failures unless it lasts longer than a threshold • But we want to be able to re-route affected packets along a back-up path, not simply dropping them ! • FIFR (failure Insensitive Fast Re-routing): nearly 100% forwarding continuity • prepare for (instead of react to) failures • adapt to changes while ensuring stability • Other Advantages: • no change to forwarding plane • minimal change to routing plane

What is Interface Specific Forwarding? • Interface-independent forwarding • destination  next-hop • Each line card has a copy of the same FIB • Interface-specific forwarding • <incoming interface, destination>  next-hop • Different forwarding entries at each line card • Forwarding operation remains the same

ISF Enables Local Rerouting • Infer failures based on interface and destination • Find the farthest keylink whose failure would cause a packet to arrive at the unusualinterface along the reverse shortest path to the destination • Precompute interface-specific forwarding tables • Avoid the keylink in choosing next hop for a destination • Failure Inferencing based Fast Rerouting • IP fast reroute without explicit routing/tunneling

Illustration: No Failure Scenario F F

Illustration: Local Rerouting without ISF F F new routing table at router B after detecting the failure link B – E fails!

Illustration: Local Rerouting with ISF F F

ISF Table Computation • Infer failed links from packet’s arrival at an interface • keylink whose failure causes packet to d arrive at i from j • A link u -> v is a candidate keylink if • with u->v, j is a next hop from i to d • without u->v, edge j->i is along the shortest path from u to d • is the farthest one from i among candidate keylinks • Avoid keylink in choosing the destination’s next hop • next hops to d from i when packet arrives at i from j • Failure inferencing is not done per packet • ISF table entries computed upon link state updates

Illustration: ISF Table Computation {B-E} {} {} {E-F} When no more than one link failure is suppressed in a network with symmetric weights, FIFR always forwards successfully to a destination if a path to it exists

Operations under FIFR

Handling both Link and Node Failures • Infer keynodes instead of keylinks • A node u is a candidate keynode if • with u, j is a next hop from i to d • without u, edge j->i is along the shortest path from the upstream node of u (w.r.t. the path from i to u) to d • Keynode is the farthest one from i among candidates • When no route to destination without a node • Node adjacent to the failure assumes link failure • Non-adjacent nodes treat it as adjacent node failure • May cause loops when destination is indeed not reachable • Protects against non-partitioning single failures

Networks with Asymmetric Link Weights • FIFR can handle asymmetric link weights • By forcing packets to take reverse shortest path • Provided links are bidirectional • Keynode computation based on rSPF • A node u is a candidate keynode if • with u, j is a next hop from i to d • without u, edge i->j is along the shortest path from d to the upstream node of u (w.r.t the path from i to u) • Keynode is the farthest one from i among candidates • Works with both symmetric and asymmetric weights

Networks with Broadcast Links • FIFR applicable to networks with broadcast links • A broadcast link is modeled with point to point links from/to the designated router • Adjacent failures • Broadcast link failure treated as that of designated router • Non-adjacent failures • Not necessary to know the previous hop of a packet to compute interface-specific keynode per destination • Failure inferencing can be done as before

Summary of FIFR • Fast reroute under any single failures • Without changing/encapsulating IP datagram • May cause loops under multiple failures • With ISF, guaranteed-protection against single failures or loop-freedom under multiple failures but not both • Blacklist-based Interface Specific Forwarding • Needs interface-specific forwarding • Two forwarding entries per destination • O(|E|log2|V|) to compute forwarding entries

Other Approaches See the optional reading [GRY07] for more detials. • Loop-free Alternative (LFA): fast re-routing only when direct link to (default) next-hop fails • simpler computation – we know exactly which link to remove when computed new next-hop; but protection limited • using IP-tunnels, etc. • U-turn: allow protection over multiple hops • Using “Not-Via” Addresses • Multi-topology routing • routers and links (with possibly different link weights) belong to multiple topologies • E.g., a default topology, plus “back-up” topologies with various (assumed) links removed (or new link weights) • packets are “marked” with “topology id” for look-up • IETF Fast Rerouting and MT-Routing Working Groups

Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast Re-Routing (IP FRR)