1 / 57

Cascading Failures in Infrastructure Networks

Cascading Failures in Infrastructure Networks. David Alderson Ph.D. Candidate Dept. of Management Science and Engineering Stanford University April 15, 2002 Advisors: William J. Perry, Nicholas Bambos. Outline. Background and Motivation Union Pacific Case Study Conceptual Framework

hal
Download Presentation

Cascading Failures in Infrastructure Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cascading Failures in Infrastructure Networks David Alderson Ph.D. Candidate Dept. of Management Science and Engineering Stanford University April 15, 2002 Advisors: William J. Perry, Nicholas Bambos

  2. Outline • Background and Motivation • Union Pacific Case Study • Conceptual Framework • Modeling Cascading Failures • Ongoing Work David Alderson

  3. Background • Most of the systems we rely on in our daily lives are designed and built as networks • Voice and data communications • Transportation • Energy distribution • Large-scale disruption of such systems can be catastrophic because of our dependence on them • Large-scale failures in these systems • Have already happened • Will continue to happen David Alderson

  4. Recent Examples • Telecommunications • ATM network outage: AT&T (February 2001) • Frame Relay outage: AT&T (April 1998), MCI (August 1999) • Transportation • Union Pacific Service Crisis (May 1997- December 1998) • Electric Power • Northeast Blackout (November 1965) • Western Power Outage (August 1996) • All of the above • Baltimore Tunnel Accident (July 2001) David Alderson

  5. Public Policy • U.S. Government interest from 1996 (and earlier) • Most national infrastructure systems are privately owned and operated • Misalignment between business imperatives (efficiency) and public interest (robustness) • Previously independent networks now tied together through common information infrastructure • Current policy efforts directed toward building new public-private relationships • Policy & Partnership (CIAO) • Law Enforcement & Coordination (NIPC) • Defining new roles (Homeland Security) David Alderson

  6. Research Questions Broadly: • Is there something about the network structure of these systems that contributes to their vulnerability? More specifically: • What is a cascading failure in the context of an infrastructure network? • What are the mechanisms that cause it? • What can be done to control it? • Can we design networks that are robust to cascading failures? • What are the implications for network-based businesses? David Alderson

  7. Outline • Background and Motivation • Union Pacific Case Study • Conceptual Framework • Modeling Cascading Failures • Ongoing Work David Alderson

  8. Union Pacific Railroad • Largest RR in North America • Headquartered in Omaha, Nebraska • 34,000 track miles (west of Mississippi River) • Transporting • Coal, grain, cars, other manifest cargos • 3rd party traffic (e.g. Amtrak passenger trains) • 24x7 Operations: • 1,500+ trains in motion • 300,000+ cars in system • More than $10B in revenue annually David Alderson

  9. Union Pacific Railroad • Four major resources constraining operations: • Line capacity (# parallel tracks, speed restrictions, etc.) • Terminal capacity (in/out tracks, yard capacity) • Power (locomotives) • Crew (train personnel, yard personnel) • Ongoing control of operations is mainly by: • Dispatchers • Yardmasters • Some centralized coordination, primarily through a predetermined transportation schedule David Alderson

  10. Union Pacific Railroad • Sources of network disruptions: • Weather (storms, floods, rock slides, tornados, hurricanes, etc.) • Component failures (signal outages, broken wheels/rails, engine failures, etc.) • Derailments (~1 per day on average) • Minor incidents (e.g. crossing accidents) • Evidence for system-wide failures • 1997-1998 Service Crisis • Fundamental operating challenge David Alderson

  11. UPRR Fundamental Challenge Two conflicting drivers: • Business imperatives necessitate a lean operation that maximizes efficiency and drives the system toward high utilization of available network resources. • An efficient operation that maximizes utilization is very sensitive to disruptions, particularly because of the effects of network congestion. David Alderson

  12. Railroad Congestion There are several places where congestion may be seen within the railroad: • Line segments • Terminals • Operating Regions • The Entire Railroad Network • (Probably not locomotives or crews) Congestion is related to capacity. David Alderson

  13. Factors Affecting Observed Performance: • Dispatcher / Corridor Manager Expertise • On Line Incidents / Equipment Failure • Weather • Temporary Speed Restrictions Emprically-Derived Relationship Line Segment Velocity 25 18 The Effect of Forcing Volume in Excess of Capacity 28 32 36 Volume (trains per day) 35 UPRR Capacity Model Concepts David Alderson

  14. Implications of Congestion Concepts of traffic congestion are important for two key aspects of network operations: • Capacity Planning and Management • Service Restoration In the presence of service interruptions, the objective of Service Restoration is to: • Minimize the propagation across the network of any disturbance caused by a service interruption • Minimize the time to recovery to fluid operations David Alderson

  15. Modeling Congestion We can model congestion using standard models from transportation engineering. Define the relationships between: • Number of items in the system (Density) • Average processing rate (Velocity) • Input Rate • Output Rate (Throughput) David Alderson

  16. Velocity (v) K Density (n) N Modeling Congestion Velocity vs. Density: • Assume that velocity decreases (linearly) with the traffic density. David Alderson

  17. Throughput () * Density (n) N/2 N Modeling Congestion Throughput vs. Density • Throughput = Velocity · Density Throughput is maximized at n = N/2 with value * = N/4 (K=1). David Alderson

  18. Throughput Velocity * N Density Throughput Velocity Velocity K * N Density Throughput Modeling Congestion David Alderson

  19. Modeling Congestion • Let p represent the intensity of congestion onset. David Alderson

  20. becomes nK N Modeling Congestion It is clear that David Alderson

  21. Throughput Velocity * N Density Throughput Velocity Velocity K * N Density Throughput Modeling Congestion David Alderson

  22. UP Service Crisis • Initiating Event • 5/97 derailment at a critical train yard outside of Houston • Additionally • Loss of BNSF route that was decommissioned for repairs • Embargo at Laredo interchange point to Mexico • Complicating Factors • UP/SP merger and transition to consolidated operations • Hurricane Danny, fall 1997 • Record rains and floods (esp. Kansas) in 1998 • Operational Issues • Tightly optimized transportation schedule • Traditional service priorities David Alderson

  23. Southern California Central Corridor (Kansas-Nebraska-Wyoming) Houston-Gulf Coast UP Service Crisis Source: UP Filings with Surface Transportation Board, September 1997 – December 1998 David Alderson

  24. Case Study: Union Pacific Completed Phase 1 of case study: • Understanding of the factors affecting system capacity, system dynamics • Investigation of the 1997-98 Service Crisis • Project definition: detailed study of Sunset Route • Data collection, preliminary analysis for the Sunset Route Ongoing work: • A detailed study of their specific network topology • Development of real-time warning and analysis tools David Alderson

  25. Outline • Background and Motivation • Union Pacific Case Study • Conceptual Framework • Modeling Cascading Failures • Ongoing Work David Alderson

  26. Basic Network Concepts • Networks allow the sharing of distributed resources • Resource use  resourceload • Total network usage = total network load • Total network load is distributed among the components of the network • Many networking problems are concerned with finding a “good” distribution of load • Resource allocation  load distribution David Alderson

  27. Infrastructure Networks • Self-protection as an explicit design criterion • Network components themselves are valuable • Expensive • Hard to replace • Long lead times to obtain • Willingness to sacrifice current system performance in exchange for future availability • With protection as an objective, connectivity between neighboring nodes is • Helpful • Harmful David Alderson

  28. Initiating events Cascading Failures Cascading failures occur in networks where • Individual network components can fail • When a component fails, the natural dynamics of the system may induce the failure of other components Network components can fail because • Accident • Internal failure • Attack • A cascading failure is not • A single point of failure • The occurrence of multiple concurrent failures • The spread of a virus David Alderson

  29. Related Work Cascading Failures: • Electric Power: Parrilo et. al. (1998), Thorp et. al. (2001) • Social Networks: Watts (1999) • Public Policy: Little (2001) Other network research initiatives • “Survivable Networks” • “Fault-Tolerant Networks” Large-Scale Vulnerability • Self-Organized Criticality: Bak (1987), many others • Highly Optimized Tolerance: Carlson and Doyle (1999) • Normal Accidents: Perrow (1999) • Influence Models: Verghese et. al. (2001) David Alderson

  30. Our Approach • Cascading failures in the context of flow networks • conservation of flow within the network • Overloading a resource leads to degraded performance and eventual failure • Network failures are not independent • Flow allocation pattern  resource interdependence • Focus on the dynamics of network operation and control • Design for robustness (not protection) David Alderson

  31. Quantity of Interest Modeling Approach Coarse Grained Models Long-Term Averages Static Flow Models Capacity Planning Failure & Recovery Time-Dependent Averages Fluid Approximations Averages & Variances Diffusion Approximations Ongoing Operation (Processing & Routing) Probability Distributions Queueing Models Fine Grained Models Simulation Models Event Sequences Taxonomy of Network Flow Models Relevant Decisions Reference: Janusz Filipiak David Alderson

  32. Short Time Scales Long Time Scales Computer Routing milliseconds to seconds minutes to hours days to weeks Railroad Transportation minutes to hours days to weeks months to years Time Scales in Network Operations Relevant Decisions Ongoing Operation (Processing & Routing) Failure & Recovery Capacity Planning David Alderson

  33. Type of Network Dynamics Underlying Assumption Network topology is CHANGING Failure & Recovery Network topology is STATIC What Are Network Dynamics? Dynamics OF Networks Dynamics ON Networks David Alderson

  34. Network Flow Optimization • Original work by Ford and Fulkerson (1956) • One of the most studied areas for optimization • Three main problem types • Shortest path problems • Maximum flow problems • Minimum cost flow problems • Special interpretation for some of the most celebrated results in optimization theory • Broad applicability to a variety of problems David Alderson

  35. Single Commodity Flow Problem Notation: N set of nodes, indexed i = 1, 2, … N A set of arcs , indexed j = 1, 2, … M di demand (supply) at node i fjflow along arc j ujcapacity along arc j Anode-arcincidence matrix, A set of flows f is feasible if it satisfies the constraints: Ai f = di  i  N(flows balanced at node i, and supply/demand is satisfied) 0  fj  uj  j  A (flow on arc jless than capacity) David Alderson

  36.  s t Feasible region, denoted F(): (flows balanced at nodei) 0  fj  uj  j A(flow on arcj feasible) Single Commodity Flow Problem David Alderson

  37.  s t subject to: (flows balanced at nodei) 0  fj  uj  j A(flow on arcj feasible) Minimum Cost Problem Let cj = cost on arcj Minimizef (j A)cj fj David Alderson

  38.  s t subject to: (flows balanced at nodei) 0  fj  uj=1  j A(flow on arcj feasible) Shortest Path Problem 1= =1 Let costscjcorrespond to “distances”, set = 1 Minimizef (j A)cj fj David Alderson

  39.  s t subject to: (flows balanced at nodei) 0  fj  uj  j A(flow on arcj feasible) Maximum Flow Problem Maximizef David Alderson

  40. Network Optimization Traditional Assumptions: • Complete information • Static network (capacities, demands, topology) • Centralized decision maker Solution obtained from global optimization algorithms Relevant issues: • Computational (time) complexity • Function of problem size (number of inputs) • Based on worst-case data • Parallelization (decomposition) • Synchronization (global clock) David Alderson

  41. New Challenges Most traditional assumptions no longer hold… • Modern networks are inherently dynamic • Connectivity fluctuates, components fail, growth is ad hoc • Traffic demands/patterns constantly change • Explosive growth  massive sizescale • Faster technology  shrinking time scale • Operating decisions are made with incomplete, incorrect information • Claim: A global approach based on static assumptions is no longer viable David Alderson

  42. Cascading Failures & Flow Networks • In general, we assume that network failures result from violations of network constraints • Node feasibility (flow conservation) • Arc feasibility (arc capacity) • That is, failure  infeasibility • The network topology provides the means by which failures (infeasibilities) propagate • In the optimization context, a cascading failure is a collapse of the feasible region of the optimization problem that results from the interaction of the constraints when a parameter is changed David Alderson

  43. Addressing New Challenges • Extend traditional notions of network optimization to model cascading failures in flow networks • Allow for node failures • Include flow dynamics • Consider solution approaches based on • Decentralized control • Local information • Leverage ideas from dual problem formulation • Identify dimensions along which there are explicit tensions and tradeoffs between vulnerability and performance David Alderson

  44. Primal Problem Min cTf s.t. A f = d f  0 f  u Dual Problem Max Td - uT s.t. TA  c  unrestricted   0 Dual Problem Formulation • Dual variables,  have interpretation as prices at nodes, arcs • Natural decomposition as distributed problem • e.g. Nodes set prices based on local information • Examples: • Kelly, Low and many others for TCP/IP congestion control • Boyd and Xiao for dual decomposition of SRRA problem David Alderson

  45. Outline • Background and Motivation • Union Pacific Case Study • Conceptual Framework • Modeling Cascading Failures • Ongoing Work David Alderson

  46. n(k) a(k) d(k) • Processing capacity • State-dependent output d(k) (performance)  constant a(k) • a(k) – d(k) indicates how n(k) is changing • n* is equilibrium point • Node “fails” ifn(k) >  n(k) (load)  n* Node Dynamics • Consider each node as a simple input-output system running in discrete time… • Letn(k) = flow being processed in intervalk • Node dynamics • n(k+1) = n(k) + a(k) – d(k) System is feasible for a(k)<  David Alderson

  47. n1(k) n2(k) a2(k) d2(k) a1(k) d1(k) = d1(n1) u1(k) d2(n2) 1 2 a2(k) a1(k) u1(k) a2(k) a2(k) n1(k) 1 n2(k) 2 Network Dynamics • The presence of an arc between adjacent nodes couples their behavior u1 • Arc capacities limit both outgoing and incoming flow David Alderson

  48. u1 n1(k) n2(k) a1(k) a2(k) d1(k) d2(k) = • When a node fails, the capacity of its incoming arcs drop effectively to zero. • Upstream node loses capacity of arc • In the absence of control, the upstream node fails too. • Result: Node failures propagate “upstream”… • Question: • How will the network respond to perturbations? Network Dynamics • The failure of one node can lead to the failure of another u1=0 David Alderson

  49. Network Robustness Consider the behavior of the network in response to a perturbation to arc capacity: • Does the disturbance lead to a local failure? • Does the failure propagate? • How far does it propagate? Measure the consequences in terms of: • Size of the resulting failure island • Loss of network throughput Key factors: • Flow processing sensitivity to congestion • Network topology • Local routing and flow control policies • Time scales David Alderson

  50. evidence of congestion System Performance System Load  Congestion Sensitivity In many real network systems, components are sensitive to congestion • Using the aforementioned family of functions we can tune the sensitivity of congestion • Direct consequences on local dynamics, stability, and control • Tradeoff between system efficiency vs. fragility • Implications for local behavior David Alderson

More Related