Cascading Failures in Infrastructure Networks

Cascading Failures in Infrastructure Networks David Alderson Ph.D. Candidate Dept. of Management Science and Engineering Stanford University April 15, 2002 Advisors: William J. Perry, Nicholas Bambos

Outline • Background and Motivation • Union Pacific Case Study • Conceptual Framework • Modeling Cascading Failures • Ongoing Work David Alderson

Background • Most of the systems we rely on in our daily lives are designed and built as networks • Voice and data communications • Transportation • Energy distribution • Large-scale disruption of such systems can be catastrophic because of our dependence on them • Large-scale failures in these systems • Have already happened • Will continue to happen David Alderson

Recent Examples • Telecommunications • ATM network outage: AT&T (February 2001) • Frame Relay outage: AT&T (April 1998), MCI (August 1999) • Transportation • Union Pacific Service Crisis (May 1997- December 1998) • Electric Power • Northeast Blackout (November 1965) • Western Power Outage (August 1996) • All of the above • Baltimore Tunnel Accident (July 2001) David Alderson

Public Policy • U.S. Government interest from 1996 (and earlier) • Most national infrastructure systems are privately owned and operated • Misalignment between business imperatives (efficiency) and public interest (robustness) • Previously independent networks now tied together through common information infrastructure • Current policy efforts directed toward building new public-private relationships • Policy & Partnership (CIAO) • Law Enforcement & Coordination (NIPC) • Defining new roles (Homeland Security) David Alderson

Research Questions Broadly: • Is there something about the network structure of these systems that contributes to their vulnerability? More specifically: • What is a cascading failure in the context of an infrastructure network? • What are the mechanisms that cause it? • What can be done to control it? • Can we design networks that are robust to cascading failures? • What are the implications for network-based businesses? David Alderson

Union Pacific Railroad • Largest RR in North America • Headquartered in Omaha, Nebraska • 34,000 track miles (west of Mississippi River) • Transporting • Coal, grain, cars, other manifest cargos • 3rd party traffic (e.g. Amtrak passenger trains) • 24x7 Operations: • 1,500+ trains in motion • 300,000+ cars in system • More than $10B in revenue annually David Alderson

Union Pacific Railroad • Four major resources constraining operations: • Line capacity (# parallel tracks, speed restrictions, etc.) • Terminal capacity (in/out tracks, yard capacity) • Power (locomotives) • Crew (train personnel, yard personnel) • Ongoing control of operations is mainly by: • Dispatchers • Yardmasters • Some centralized coordination, primarily through a predetermined transportation schedule David Alderson

Union Pacific Railroad • Sources of network disruptions: • Weather (storms, floods, rock slides, tornados, hurricanes, etc.) • Component failures (signal outages, broken wheels/rails, engine failures, etc.) • Derailments (~1 per day on average) • Minor incidents (e.g. crossing accidents) • Evidence for system-wide failures • 1997-1998 Service Crisis • Fundamental operating challenge David Alderson

UPRR Fundamental Challenge Two conflicting drivers: • Business imperatives necessitate a lean operation that maximizes efficiency and drives the system toward high utilization of available network resources. • An efficient operation that maximizes utilization is very sensitive to disruptions, particularly because of the effects of network congestion. David Alderson

Railroad Congestion There are several places where congestion may be seen within the railroad: • Line segments • Terminals • Operating Regions • The Entire Railroad Network • (Probably not locomotives or crews) Congestion is related to capacity. David Alderson

Factors Affecting Observed Performance: • Dispatcher / Corridor Manager Expertise • On Line Incidents / Equipment Failure • Weather • Temporary Speed Restrictions Emprically-Derived Relationship Line Segment Velocity 25 18 The Effect of Forcing Volume in Excess of Capacity 28 32 36 Volume (trains per day) 35 UPRR Capacity Model Concepts David Alderson

Implications of Congestion Concepts of traffic congestion are important for two key aspects of network operations: • Capacity Planning and Management • Service Restoration In the presence of service interruptions, the objective of Service Restoration is to: • Minimize the propagation across the network of any disturbance caused by a service interruption • Minimize the time to recovery to fluid operations David Alderson

Modeling Congestion We can model congestion using standard models from transportation engineering. Define the relationships between: • Number of items in the system (Density) • Average processing rate (Velocity) • Input Rate • Output Rate (Throughput) David Alderson

Velocity (v) K Density (n) N Modeling Congestion Velocity vs. Density: • Assume that velocity decreases (linearly) with the traffic density. David Alderson

Throughput () * Density (n) N/2 N Modeling Congestion Throughput vs. Density • Throughput = Velocity · Density Throughput is maximized at n = N/2 with value * = N/4 (K=1). David Alderson

Throughput Velocity * N Density Throughput Velocity Velocity K * N Density Throughput Modeling Congestion David Alderson

Modeling Congestion • Let p represent the intensity of congestion onset. David Alderson

becomes nK N Modeling Congestion It is clear that David Alderson

Throughput Velocity * N Density Throughput Velocity Velocity K * N Density Throughput Modeling Congestion David Alderson

UP Service Crisis • Initiating Event • 5/97 derailment at a critical train yard outside of Houston • Additionally • Loss of BNSF route that was decommissioned for repairs • Embargo at Laredo interchange point to Mexico • Complicating Factors • UP/SP merger and transition to consolidated operations • Hurricane Danny, fall 1997 • Record rains and floods (esp. Kansas) in 1998 • Operational Issues • Tightly optimized transportation schedule • Traditional service priorities David Alderson

Southern California Central Corridor (Kansas-Nebraska-Wyoming) Houston-Gulf Coast UP Service Crisis Source: UP Filings with Surface Transportation Board, September 1997 – December 1998 David Alderson

Case Study: Union Pacific Completed Phase 1 of case study: • Understanding of the factors affecting system capacity, system dynamics • Investigation of the 1997-98 Service Crisis • Project definition: detailed study of Sunset Route • Data collection, preliminary analysis for the Sunset Route Ongoing work: • A detailed study of their specific network topology • Development of real-time warning and analysis tools David Alderson

Basic Network Concepts • Networks allow the sharing of distributed resources • Resource use  resourceload • Total network usage = total network load • Total network load is distributed among the components of the network • Many networking problems are concerned with finding a “good” distribution of load • Resource allocation  load distribution David Alderson

Infrastructure Networks • Self-protection as an explicit design criterion • Network components themselves are valuable • Expensive • Hard to replace • Long lead times to obtain • Willingness to sacrifice current system performance in exchange for future availability • With protection as an objective, connectivity between neighboring nodes is • Helpful • Harmful David Alderson

Initiating events Cascading Failures Cascading failures occur in networks where • Individual network components can fail • When a component fails, the natural dynamics of the system may induce the failure of other components Network components can fail because • Accident • Internal failure • Attack • A cascading failure is not • A single point of failure • The occurrence of multiple concurrent failures • The spread of a virus David Alderson

Related Work Cascading Failures: • Electric Power: Parrilo et. al. (1998), Thorp et. al. (2001) • Social Networks: Watts (1999) • Public Policy: Little (2001) Other network research initiatives • “Survivable Networks” • “Fault-Tolerant Networks” Large-Scale Vulnerability • Self-Organized Criticality: Bak (1987), many others • Highly Optimized Tolerance: Carlson and Doyle (1999) • Normal Accidents: Perrow (1999) • Influence Models: Verghese et. al. (2001) David Alderson

Our Approach • Cascading failures in the context of flow networks • conservation of flow within the network • Overloading a resource leads to degraded performance and eventual failure • Network failures are not independent • Flow allocation pattern  resource interdependence • Focus on the dynamics of network operation and control • Design for robustness (not protection) David Alderson

Quantity of Interest Modeling Approach Coarse Grained Models Long-Term Averages Static Flow Models Capacity Planning Failure & Recovery Time-Dependent Averages Fluid Approximations Averages & Variances Diffusion Approximations Ongoing Operation (Processing & Routing) Probability Distributions Queueing Models Fine Grained Models Simulation Models Event Sequences Taxonomy of Network Flow Models Relevant Decisions Reference: Janusz Filipiak David Alderson

Short Time Scales Long Time Scales Computer Routing milliseconds to seconds minutes to hours days to weeks Railroad Transportation minutes to hours days to weeks months to years Time Scales in Network Operations Relevant Decisions Ongoing Operation (Processing & Routing) Failure & Recovery Capacity Planning David Alderson

Type of Network Dynamics Underlying Assumption Network topology is CHANGING Failure & Recovery Network topology is STATIC What Are Network Dynamics? Dynamics OF Networks Dynamics ON Networks David Alderson

Network Flow Optimization • Original work by Ford and Fulkerson (1956) • One of the most studied areas for optimization • Three main problem types • Shortest path problems • Maximum flow problems • Minimum cost flow problems • Special interpretation for some of the most celebrated results in optimization theory • Broad applicability to a variety of problems David Alderson

Single Commodity Flow Problem Notation: N set of nodes, indexed i = 1, 2, … N A set of arcs , indexed j = 1, 2, … M di demand (supply) at node i fjflow along arc j ujcapacity along arc j Anode-arcincidence matrix, A set of flows f is feasible if it satisfies the constraints: Ai f = di  i  N(flows balanced at node i, and supply/demand is satisfied) 0  fj  uj  j  A (flow on arc jless than capacity) David Alderson

  s t Feasible region, denoted F(): (flows balanced at nodei) 0  fj  uj  j A(flow on arcj feasible) Single Commodity Flow Problem David Alderson

  s t subject to: (flows balanced at nodei) 0  fj  uj  j A(flow on arcj feasible) Minimum Cost Problem Let cj = cost on arcj Minimizef (j A)cj fj David Alderson

  s t subject to: (flows balanced at nodei) 0  fj  uj=1  j A(flow on arcj feasible) Shortest Path Problem 1= =1 Let costscjcorrespond to “distances”, set = 1 Minimizef (j A)cj fj David Alderson

  s t subject to: (flows balanced at nodei) 0  fj  uj  j A(flow on arcj feasible) Maximum Flow Problem Maximizef David Alderson

Network Optimization Traditional Assumptions: • Complete information • Static network (capacities, demands, topology) • Centralized decision maker Solution obtained from global optimization algorithms Relevant issues: • Computational (time) complexity • Function of problem size (number of inputs) • Based on worst-case data • Parallelization (decomposition) • Synchronization (global clock) David Alderson

New Challenges Most traditional assumptions no longer hold… • Modern networks are inherently dynamic • Connectivity fluctuates, components fail, growth is ad hoc • Traffic demands/patterns constantly change • Explosive growth  massive sizescale • Faster technology  shrinking time scale • Operating decisions are made with incomplete, incorrect information • Claim: A global approach based on static assumptions is no longer viable David Alderson

Cascading Failures & Flow Networks • In general, we assume that network failures result from violations of network constraints • Node feasibility (flow conservation) • Arc feasibility (arc capacity) • That is, failure  infeasibility • The network topology provides the means by which failures (infeasibilities) propagate • In the optimization context, a cascading failure is a collapse of the feasible region of the optimization problem that results from the interaction of the constraints when a parameter is changed David Alderson

Addressing New Challenges • Extend traditional notions of network optimization to model cascading failures in flow networks • Allow for node failures • Include flow dynamics • Consider solution approaches based on • Decentralized control • Local information • Leverage ideas from dual problem formulation • Identify dimensions along which there are explicit tensions and tradeoffs between vulnerability and performance David Alderson

Primal Problem Min cTf s.t. A f = d f  0 f  u Dual Problem Max Td - uT s.t. TA  c  unrestricted   0 Dual Problem Formulation • Dual variables,  have interpretation as prices at nodes, arcs • Natural decomposition as distributed problem • e.g. Nodes set prices based on local information • Examples: • Kelly, Low and many others for TCP/IP congestion control • Boyd and Xiao for dual decomposition of SRRA problem David Alderson

n(k) a(k) d(k) • Processing capacity • State-dependent output d(k) (performance)  constant a(k) • a(k) – d(k) indicates how n(k) is changing • n* is equilibrium point • Node “fails” ifn(k) >  n(k) (load)  n* Node Dynamics • Consider each node as a simple input-output system running in discrete time… • Letn(k) = flow being processed in intervalk • Node dynamics • n(k+1) = n(k) + a(k) – d(k) System is feasible for a(k)<  David Alderson

n1(k) n2(k) a2(k) d2(k) a1(k) d1(k) = d1(n1) u1(k) d2(n2) 1 2 a2(k) a1(k) u1(k) a2(k) a2(k) n1(k) 1 n2(k) 2 Network Dynamics • The presence of an arc between adjacent nodes couples their behavior u1 • Arc capacities limit both outgoing and incoming flow David Alderson

u1 n1(k) n2(k) a1(k) a2(k) d1(k) d2(k) = • When a node fails, the capacity of its incoming arcs drop effectively to zero. • Upstream node loses capacity of arc • In the absence of control, the upstream node fails too. • Result: Node failures propagate “upstream”… • Question: • How will the network respond to perturbations? Network Dynamics • The failure of one node can lead to the failure of another u1=0 David Alderson

Network Robustness Consider the behavior of the network in response to a perturbation to arc capacity: • Does the disturbance lead to a local failure? • Does the failure propagate? • How far does it propagate? Measure the consequences in terms of: • Size of the resulting failure island • Loss of network throughput Key factors: • Flow processing sensitivity to congestion • Network topology • Local routing and flow control policies • Time scales David Alderson

evidence of congestion System Performance System Load  Congestion Sensitivity In many real network systems, components are sensitive to congestion • Using the aforementioned family of functions we can tune the sensitivity of congestion • Direct consequences on local dynamics, stability, and control • Tradeoff between system efficiency vs. fragility • Implications for local behavior David Alderson

Cascading Failures in Infrastructure Networks