1 / 56

Sources of Unreliability in Networks

Sources of Unreliability in Networks. James Newell CS 598IG – Scattered Systems October 21, 2004. Papers. The Synchronization of Periodic Messages Internet Routing Instability Characterising the Use of a Campus Wireless Network. The Synchronization of Periodic Routing Messages.

Download Presentation

Sources of Unreliability in Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sources of Unreliability in Networks James Newell CS 598IG – Scattered Systems October 21, 2004

  2. Papers • The Synchronization of Periodic Messages • Internet Routing Instability • Characterising the Use of a Campus Wireless Network

  3. The Synchronization of Periodic Routing Messages Sally Floyd and Van Jacobson IEEE/ACM 1994

  4. Overview • Many sources of periodic network traffic • Router updates • Streaming media applications • Overtime periodic traffic can become synchronized! • Synchronization leads to unbalanced traffic • Packet loss • Increased latency

  5. Examples from Internet • DECnet’s DNA Phase IV on LBL (1988) • NEARnet core routers (1992)

  6. Background • Synchronization results from weakly-coupled interaction • Examples • Thai fireflies • Wall clocks • TCP window cycles • External clock synchronization • Client - Server models

  7. Router Synchronization • Router updates are periodic • Random fluctuations in period • Internal fluctuations cause router synchronization • External fluctuations break-apart synchronized routers Easy to overlook!

  8. Periodic Messages Model Process for Tc Sec • Algorithm • Router A takes Tc seconds to process outgoing update • Router B receives first update packet from A in Td seconds • If A or B receives first packet of update, it processes it in Tc2 seconds • After processing, sets timer between Tp –Tr and Tp + Tr A Timer Expires Process an additional Tc2 After Tc + Tc2, reset timer between Tp +/- Tr Arrives at time Td B

  9. More on Periodic Message Model • Triggered updates on major changes • Assumptions • No collisions • No lost or retransmitted packets • Similar to real protocols • RIP • IGRP • EGP

  10. Simulations • Initially unsynchronized • Parameters • N = 20 • Tp = 121 sec • Tc = 0.11 sec • Tc2 = 0.11 sec • Tr = 0.1 sec • Td = 0 sec

  11. Analysis • Clusters of synchronized routers form • Largest cluster (size i) dominates • Processes for iTc seconds after first timer expires • “Jumps” (i - 1)Tc each round on graph • Characterizes state of graph • Synchronized groups can merge

  12. Variation of Tr 1.4 Tc 1.0 Tc 0.6 Tc Unsynchronization -> Synchronization 2.5 Tc 2.8 Tc 2.3 Tc Synchronization -> Unsynchronization

  13. Markov Chain Model • N nodes that implement the Periodic Chain Model • Each state signifies largest cluster size • Smaller clusters are all of size one • Only move one state per round

  14. Cluster Breakup • Assume Tc < 2Tr + Td • Pi,i-1 = P(Tc < L + Td) = i Tc - Td 1 - 1 < i ≤ N 2Tr Tp Tp + Tr Tp - Tr Tc M1 resets M3 expires M2 expires M1 expires Cluster i = 3 timer resets

  15. iTc Tc Cluster Growth • Clusteri has processing time of iTc • First timer expires Tp – Tr(i-1)/(i+1) • Clusteri “jumps” (i-1)Tc – Tr(i-1)/(i+1) compared to single groups • Assumes distance is Tp/(N – i +1) • Pi,i+1 = 1 - e-((N-i+1)/Tp)(i-1)Tc-Tr(i-1)/(i+1)) • P1,2 is left as a variable (dependent on Tr) 1 < i < N

  16. Synchronization • Define f(i) as number of rounds till cluster size first = i starting from 1 See Appendix A for derivation

  17. Breakup • Define g(i) as number of rounds till cluster size = i starting from N

  18. Evaluation of Analysis • Markov model 2~3 order larger than simulation • Rough approximation (Not Predictive) • Captures qualities behavior (Explanatory) • Grossly overestimates for large values of N and Tc

  19. Analysis Results • Choosing Tr as a small multiple of Tc is usually effective at preventing synchronization • Transition to synchronization is abrupt • Paper recommends Tr = Tp / 2 to cover all parameters 3000 years 16 Minutes

  20. Group Size • Steady-state behavior is bimodal • Almost always unsynchronized • Almost always synchronized • Addition of just one node can change synchronization modes Tr = 0.30 Tr = 0.11

  21. Delayed Transmission • In reality, Td≠ 0 even in small latency networks • If Td > Tc, little coupling takes place • When 0 < Td < Tc, synchronization is inversely related to Tc – Td • Remember pi,i-1 = i Tc- Td 1 - 2Tr

  22. Topologies • Assumed mesh model • Model applies to some topologies • Ring • Model breaks for other topologies • Star

  23. Conclusions • Periodic messages from routers can inadvertently synchronize • Emergent behavior with abrupt transition • Synchronization can be overcome • External random component (Tp/2) • Routing timer independent of incoming events • No triggered updates • Account for random bursts of traffic

  24. Discussion • How could have the problem evolved in today’s Internet? • Are there better solutions than adding a large random component? • How would the random component affect the performance of the protocol? • How does synchronization happen on WANs where Td can be very large?

  25. Internet Routing Instability Craig Labovitz G. Robert Malan Farnam Jahanian SIGCOMM ‘97

  26. Overview • Message analysis of inter-domain traffic at major Internet backbones • Rapid changes in node reachability causes network instability • Packet loss • Increased latency • Slower convergence • Connectivity loss!

  27. Internet Background • Internet is composed of various autonomous systems (AS) connected by backbones • Each AS contains disparate administrative and routing policies • AS boundary routers peer routing information about reachability of IP blocks (prefixes)

  28. BGP • Border Gateway Protocol (BGP) is used by AS to exchange updates • Uses incremental updates • Topology changes • Policy changes • Routes are defined by their ASPATH and prefix handle • Peer links are built using TCP -> congestion back-off!

  29. BGP Example • Each AS will append itself to the path during an update • AS’s need to keep a default-free routing table of all visible prefixes AS2 AS3 110.10.0.0/16 AS1 AS5 155.10.0.0/16 AS4 128.10.0.0/16 110.10.0.0/16 AS2 AS3 128.10.0.0/16 AS2 AS4 155.10.0.0/16 AS2 AS4 AS5

  30. BGP Routing Updates • Two forms of updates • Announcements of a new path or destination • Withdrawals or earlier announcements • Explicit – using “withdrawal” message • Implicit – using announcement to bypass AS • During steady-state updates should only occur during • Local policy changes • Network additions

  31. Major Findings • Amount of updates much higher than expected • Pathological and redundant updates dominate routing traffic • Redundant messages are periodic and high-frequency • Update traffic correlates with network usage • Instability cannot be attributed to a small group of AS or routers • Significant amount of forwarding instability occurs

  32. Gauging Internet Instability • Collected BGP messages at various Internet backbones • Taxonomy of BGP updates • Forwarding instability • Routing policy instability • Redundant updates

  33. Methodology • Logged BGP updates at 5 major US exchange points (Mae-East) between Jan ’96 and Jan ’97 • Routing servers peer with +90% of ISPs

  34. Problem with Instability • Non-convergence of routes • Dropped and out-of-order packets • Increase latency • Increase memory and CPU for packet queues • Invalid route caches • Route-flapping • BGP Keep-alive messages are dropped/delayed • Oscillations of detection of overloaded routers • Causes more instability due to multiple topology updates -> more route-flapping

  35. Instability Mitigation • Routing dampening • Ignore updates that surpass a defined threshold of parameters • Wait time period T till restart processing • Legitimate updates can be lost during T • Aggregation • Combine smaller prefixes into a super-prefix • Effective only under planned and cooperative networks • Multi-homed stubs cannot be aggregated well

  36. BGP Update Analysis • Taxonomy of routing events • WADiff: Route explicitly withdrawn, different route is announced • AADiff: Route implicitly withdrawn, different route is announced • WADup: Route explicitly withdrawn, then reannounced later • AADup: Route implicitly withdrawn, then reannounced later • WWDup: Repeated explicit withdrawals • AADiff, WADiff, and WADup is instability • WWDup is pathological instability • AADup can be either

  37. BGP Update Analysis • Typical routing table consist of 45,000 prefixes with 1300 AS • Monitored 3 to 6 million updates exchanged each day • Avg. 125 updates per Network per day • Bursty: sometimes hundreds per sec WWDup not shown

  38. Pathology Analysis • Majority of updates are pathological WWDup (0.5 to 6 million) • Transmitted by routers that never announced the path (stateless BGP) • This problem maybe due to a specific type router/provider • Update exhibit a period of 30 or 60 seconds Vendors subsequently fixed stateless BGP  Not main source of additional updates 

  39. Possible Pathology Origins • Misconfigured CSU clocks • Clocks can drift • Oscillation of valid and corrupted data • Jittered timer with stateless BGP • Synchronization • Non-jitter timers [earlier paper] • Improper interaction with internal gateways

  40. Instability Analysis • Only focus on AADiff, WADiff, and WADup • Temporal Trends • Highest during normal business hours • High during weekend • Low during summer break

  41. Fine-grained Analysis • Focus on Mae East exchange for month of August 1996 • Result: No AS is solely responsible for the instability statistics ISP A was responsible for high amount of international traffic ISP E was going through infrastructure transition

  42. Fine-grained Analysis • Now focus on a per route analysis (ASPATH + Prefix) • Result: No single route consistently dominates the instability statistics 20 to 90% (med 75) had less than 10 80 to 100% had less than 50

  43. Fine-grained Analysis • Temporal properties of update arrival • Measured frequency distribution of instability events • Found that majority (~50%) arrived either on a 30 second or 1 minutes interval • Consistent even for legitimate updates

  44. Conclusion • Instability continues to be a major problem • Over 99% of update events are redundant • Good: Doesn’t effect routing caches • Bad: Sheer volume can cause outages, delays • Instability cannot be attributed to a few guilty ISP, routers, or prefix paths • Exhibits temporal properties • Correlates with network usage • High-frequency periodicity

  45. Follow-up • From Origins of Internet Stability – INFOCOM ’99 • June 1996 – 2 Million packets per day • June 1998 • Several hundred thousands packets per day • More announcements then withdrawals • Majority still duplicate announcements • Oscillating routing announcements occur

  46. Characterising the Use of a Campus Wireless Network David Schwab Rick Bunt INFOCOM 2004

  47. Overview • Analysis of wireless usage at the University of Saskatchewan • Where • When • How much • Trace allows evaluation of network design principles and plans for future development

  48. Campus Characteristics • 40 Buildings with over 363 acres of land • 18,000 students attend the university

  49. Wireless Network Environment • Initial deployment in 2001 with 18 APs • Dispersed through various buildings • Not well advertised • Wireless traffic is routed on a virtual private network with a unique subnet • Cicsco LEAP authentication is used to provide access to wireless

  50. Trace Methodology • Mirrored wireless packets to a computer port monitoring traffic • Used EtherPeek to log packet data • Used LEAP server to track authentication data • Trace began Jan 22, 2003 and lasted one week • Data analyzed with perl script

More Related