Congestion Avoidance & Control for OSPF Networks (draft-ash-manral-ospf-congestion-control-00.txt)

Congestion Avoidance & Controlfor OSPF Networks(draft-ash-manral-ospf-congestion-control-00.txt) Anurag Maunder Sanera Systems amaunder@sanera.net Jerry Ash AT&T gash@att.com Gagan Choudhury AT&T choudhury@att.com Vera Sapozhnikova AT&T sapozhnikova@att.com Vishwas Manral NetPlane Systems vishwasm@netplane.com Mostafa Hashem Sherif AT&T mhs@att.com

Outline(draft-ash-manral-ospf-congestion-control-00.txt) • problem: • concerns over scalability of IGP link-state protocols (e.g., OSPF) • much evidence that LS protocols cannot recover from large failures & widespread loss of topology database information • failure experience • vendor analysis • simulation & modeling • propose protocol mechanisms to address problem • throttle LSA updates/retranmissions • detect & notify congestion state • neighbor nodes throttle LSA updates/retransmissions • keep adjacencies up • database backup & resynchronization • proprietary implementations of mechanisms have improved scalability/stability • need standard features for uniform implementation & interoperability • issues discussed on list

Background & Motivation • Failure experience • LS routing protocols cannot recover from large ‘flooding storms’ • triggered by wide range of causes: network failures, bugs, operational errors, etc. • flooding storm overwhelms processors, causes database asynchrony & incorrect shortest path calculation, etc. • AT&T has experienced several very large LS protocol failures (4/13/1998, 7/2000, 2/20/2001, described in I-D) • vendor analysis of LS protocol recovery from total network failure(loss of all database information in the specified scenario, 400 nodes, etc.) • recovery time estimates up to 5.5 hours • expectation is that vendor equipment recovery not adequate under large failure scenario • network-wide event simulation model [choudhury] • medium to large flooding storms cause network to recover with difficulty and/or not recover at all • model validated -- results match actual network experience

Failure ExperienceAT&T Frame Relay Network, 4/13/98 • cause & effect • administrative error coupled with a software bug • result was the loss of all topology database information • the link-state protocol then attempted to recover the database with the usual Hello & topology state updates (TSUs) • huge overload of control messages kept network down for very long time • several problems occurred to prevent the network from recovering properly (based on root-cause analysis) • very large number of TSUs being sent to every node to process, causing general processor overload • route computation based on incomplete topology recovery; routes generated based on transient, asynchronous topology information & then in need of frequent re-computation • inadequate work queue management to allow processes to complete before more work is put into the process queue • inability to access node processors with network management commands due to lack of necessary priority of these messages • worked with vendor to make protocol fixes to address problems • along the lines suggested in the I-D

Proposed Protocol MechanismsThrottle LSA Updates/Retransmissions • detect node-congestion by • length of internal work queues • high processor occupancy & long CPU busy times • notify congestion state to other nodes • use TBD packet to convey congestion signal • when a node detects congestion from a neighbor • progressively decrease flooding rate, e.g. • double LSA_RETRANSMIT_INTERVAL for low congestion • quadruple LSA_RETRANSMIT_INTERVAL for high congestion • simulation analysis shows proposed mechanisms perform effectively (Choudhury) • deals better with non-linear failure modes than statistical detection/notification methods

Issues Discussed on List • is there a problem (need to prevent catastrophic network collapse) • most seem to agree there is a problem • several have observed ‘LSA storms’ & their ill effects • storms triggered by hardware failure, software bug, faulty operational practice, etc., many different events • sometimes network cannot recover • unacceptable to operators • vendors invited to analyze failure scenario given in draft • no response yet • how to solve problem • better/smart implementation/coding of protocol within current specification • e.g., ‘never losing an adjacency solves problem’ • these are proprietary, single-vendor, implementation extensions • standard protocol extensions • for uniform implementation • for multi-vendor interoperability • already demonstrated with proprietary, single-vendor implementations

Issues Discussed on List • what protocol extensions? • not just ‘signaling congestion message on the wire’ but also response • need uniform response to congestion signal ‘slow down by this much’ to be effective • rather than ‘implementation dependent’ response • like helper router response to ‘grace LSA’ from congested router in hitless restart • how evaluate effectiveness of proposals • expert analysis based on experience • simulation • a couple of ‘academic’ & ‘shaky simulation’ comments • validated simulations used widely • for network design of routing features, nm features, congestion control, etc. • for many years • many large-scale network design examples (e.g., ‘Dynamic Routing in Telecommunications Networks’, McGraw Hill) • ‘white-box’ approach • implement & text in the lab • expert analysis, simulation, white-box all useful

Issues Discussed at IETF-55Routing Area Meeting & MPLS WG Meeting • box builders view: • ‘stop intruding into our box’ • design choices should be made by box builders • nothing wrong with current way of building boxes • box users view: • still observe major failures • most agree there is a problem (from list discussion) • box-builder/vendor analysis shows unacceptable failure response (in draft) • box-builders/vendors invited to analyze scenario in draft • box-builders approach doesn’t work to prevent failures • boxes need a few, critical, standard protocol mechanisms to address problem • have gotten vendors to make proprietary changes to fix problem • require standard protocol extensions • for uniform implementation • for multi-vendor interoperability • user requirements need to drive solution to problem

Conclusions • problem: • concerns over scalability of IGP link-state protocols • evidence that LS routing protocols (e.g., OSPF) currently can not recover from large failures & widespread loss of topology database information • problem is flooding, data base asynchrony, shortest path calculation, etc. • evidence based on failure experience, vendor analysis, simulation & modeling • propose protocol mechanisms to address problem, e.g. • throttle LSA update/retransmissions • detect & notify congestion state • neighbor nodes throttle LSA updates/retransmissions • simulation analysis shows effectiveness of proposed changes (Choudhury) • propose draft as an OSPF WG document • refine/evolve proposed protocol extensions

Backup Slides

Proposed Congestion Control Mechanisms • throttle LSA updates/retransmissions • detect & notify congestion state • congested node signals other nodes to limit rate of LSA messages sent to it • neighbor nodes throttle LSA updates/retransmissions • automatically reduce rate under congestion • keep adjacencies up • database backup & resynchronization • topology database automatically recovered from loss based on local backup mechanisms • allows a node to recover gracefully from local faults on the node • prioritized processing of Hello & LSA Ack messages (Choudhury draft)

Keep Adjacencies Up • increase adjacency break interval under congestion • goal is to avoid breaking adjacencies by increasing wait interval for non-receipt of Hello messages • if node detects congestion from a neighbor & if no packet received in NODE_DEAD-INTERVAL • wait additional time = ADJACENCY_BREAK_INTERAL before calling adjacency down • throttle setups of link adjacencies • define MAX_ADJACENCY_BUILD_COUNT = maximum number of adjacencies a node can bring up at one time

Database Backup & Resynchronization • database backup • node should provide a local, primary, nonvolatile memory backup [GR-472-CORE] • node should back up all non-self-originated LSAs, routing tables, & states of interfaces • database should be backed up at least every 5 minutes • restoration of data should be completed within 5 minutes of initiation [GR-472-CORE] • nodes signal neighbors when ’safe’ to perform resynchronization procedures • based on TBD packet format • under resynchronization, node • should generate all its own LSAs • should receive only LSAs that have changed between time it failed & current time • should base its routing on current database, derived as above

Database Backup & Resynchronization • database resynchronization • propose changes to receiving/transmitting database summary & LSA request packets • when in full state • node sends & receives database summary & LSA request packets as if performing database synchronization when peer data structure is in Negotiating, Exchanging, & loading states • node informs neighbor when to use resync procedures • node supports resync to neighbor request by receiving/transmitting database summary & LSA request packets

Failure Experience • other failures which have occurred with similar consequences • moderate TSU storm following ATM nodes upgrade, 7/2000 • network recovered, with difficulty • large TSU storm in ATM network, 2/20/2001 [pappalardo1, pappalardo2] • manual procedures required to reduce TSU flooding & stabilize network • desirable to automate procedures for TSU flooding reduction under overload • worked with vendor to make protocol fixes to address problems • along the lines suggested in the I-D • other relevant LS-network failures have been reported [cholewka, jander] • conclusions • LS vulnerable to loss of database information, control overload to re-sync databases, & other failure/overload scenarios • networks more vulnerable in absence of adequate protection mechanisms • generic problem of LS protocols • across a variety of implementations • across FR, ATM, & IP-based technologies

Vendor Analysis • vendors & service providers asked to analyze LS protocol recovery from total network failure(loss of all database information in the specified scenario • network scenario • 400 node network • 100 backbone nodes • 3 edge nodes per backbone node (edge single homed) • backbone nodes connected to max of 10 backbone nodes • max node adjacency is 13 • sparse network • 101 peer groups • 1 backbone peer group with 100 backbone nodes • 100 edge peer groups, each with 3 nodes, all homed on the backbone peer group • 1,000,000 addresses advertised

Vendor Analysis • projected recovery times • Recovery Time Estimate A – 3.5 hours • Recovery Time Estimate B – 5-15 minutes • Recovery Time Estimate C – 5.5 hours • expectation is that vendor equipment recovery not adequate under large failure scenario

Analysis Modeling • various studies published [atmf00-0249, maunder, choudhury] • [choudhury] reports network-wide event simulation model • study impact of a TSU storm • captures • node congestion • propagation delay between nodes • retransmissions if TSU not acknowledged within 5 seconds • link declared down if Hello delayed beyond “node-dead interval” (aka “inactivity timer” in PNNI, “router-dead interval” in OSPF) • link recovery following database synchronization • approximates real network behavior & processing times • results show • dispersion -- number of control packets generated but not processed in at least one node • medium to large TSU storms cause network to recover with difficulty and/or not recover at all • results match actual network experience

Impact of TSU Storm on Network Stability

Congestion Avoidance & Control for OSPF Networks (draft-ash-manral-ospf-congestion-control-00.txt)