1 / 24

Network Resilience: Exploring Cascading Failures

Network Resilience: Exploring Cascading Failures. Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don Towsley (Umass-Amherst). Similar events were observed on July 19 th , the day CODE RED spread.

theodora
Download Presentation

Network Resilience: Exploring Cascading Failures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don Towsley (Umass-Amherst)

  2. Similar events were observed on July 19th, the day CODE RED spread http://www.renesys.com/projects/bgp_instability Prologue On Tuesday, September 18, simultaneous with the onset of the propagation phase of the Nimda worm, we observed a BGP storm. This one came on faster, rode the trend higher, and then, just as mysteriously, turned itself off, though much more slowly. Over a period of roughly two hours, starting at about 13:00 GMT (9am EDT), aggregate BGP announcement rates exponentially ramped up by a factor of 25, from 400 per minute to 10,000 per minute, with sustained "gusts" to more than 200,000 per minute. The advertisement rate then decayed gradually over many days, reaching pre-Nimda levels by September 24th.

  3. Cascading Failures? Conjecture • The viruses started random IP port scanning • Most of these random IP addresses were not in the cached entries of the routing table, causing.... • frequent cache misses, and.. • in the case of invalid IP addresses, generation of ICMP (router error) messages.. • …both of the above causes led to router CPU overload, causing routers to crash • Router failure led to withdrawal announcements by the peers, generating a high level of advertisement traffic. • When the router came back on, it required a full state update from it's peers, creating a large spike in the load of it's peers that provided the state dump • Once the restarted router obtained all the dumps, it dumped its full state to all its peers, creating another spike in the load.. • Frequent full state dumps led to more CPU overload, leading to more crashes, and the propagation of the cycle...

  4. Outline • Background • Modeling interactions • A Fluid model • Phase transitions • A Birth-Death model • More phase transitions • Insights • Future work

  5. Studies in Cascading Failures • Cascading failures studied extensively in Power Networks (Zaborsky et al.) • Coupling in Power Networks between nodes well understood: e.g. differential equations describe voltage-phasor-load relationships • Coupling in data networks: Routing, Traffic engineering, policy routing, DNS…difficult to model!

  6. Modeling interactions • We model coupling at BGP level • Study the interaction of a clique of BGP routers • Model three different kinds of phenomena: router crash, router repair and full state updates • System essentially forms a mutual aid collective

  7. Clique of routers • Routers form a fully connected graph • All routers are peers of each other • At the AS level, BGP routers form a clique of • the order of 540 nodes

  8. A fluid model for interactions • We consider a clique of N nodes • Study process of nodes that are down, D • ks : Rate at which single up node brings up down nodes • kl : Rate at which full state updates brings down up nodes • Typically, expect ks >>kl

  9. Drift equations • a(t) = Number of arrivals in [0,t) da(t) = (N-D)*D*ksdt • d(t) = Number of departures in [0,t) dd(t) = D *(N-D) /D kldt = (N-D) *kldt • Now, consider the drift in down nodes D dD(t) = da(t) - dd(t)

  10. Dynamics of D System shows Phase Transition If D(0) > ks /kl else

  11. Phase transitions N = 100 ks /kl = 20

  12. Properties of phase transition • Threshold is an absolute quantity rather than a fraction • Cliques with “powerful” (i.e., ks /kl high) nodes do not exhibit cascading failures • Smaller cliques more resistant to phase transitions

  13. A Birth-Death model • Again consider a clique of N nodes • The system state i is the number of down nodes • Transitions rates are state dependent l0 l1 li lN-1 0 1 i i+1 N-1 N mi m0 m1

  14. Transient model • Since mN =0, state N is an absorbing state • System ends up in N with probability 1 • Perform transient analysis, compute mean time to absorption, Wi starting from state i • Wi good indicator of stability of system, a low value indicates propensity to collapse to state N (where all nodes are down) • Physically, interpret Wi as the ability for the system to recover if it ends up in state i through some exogenous process (e.g. attacks)

  15. Solution for Wi With boundary conditions and

  16. Solution (cont.) and Yield a way to compute Wi

  17. Modeling transition rates li =(N-i) *i *kl + ka ka =ambient traffic load, kl similar to fluid model ks similar to fluid model mi =(N-i) *ks

  18. The mean time to absorption N=20, ks =1, kl=0.01 System stable, mean time to absorption of the order 1026 , even if only one node is up

  19. A larger clique N=100, ks =1, kl=0.01 System still stable, mean time to absorption of the order 1048 , if only one node is up

  20. The appearance of phase transitions N=200, ks =1, kl=0.01 Mean time to absorption goes down from 1047 , to about 0 in a matter of few states

  21. Dependence on service rate/load Transition point shifts right as ratio goes up

  22. Dependence on clique size Transition point remains roughly the same, relative stability goes down as N goes up

  23. Early conclusions • Cascading failures possible in mutual support systems like a BGP clique • Presence of phase transitions depends on system parameters strongly • Clique size an important threshold, larger cliques more likely to undergo cascading failures

  24. Future work • Refine model, plug in numbers for parameters • Look at different topologies • Do more detailed modeling of single router (fixed point solutions)

More Related