1 / 32

A survey of Internet routing reliability

A survey of Internet routing reliability. Presented by Kundan Singh IRT internal talk April 9, 2003. Agenda. Routing overview Problems Route oscillations Slow convergence Scaling Configuration Effect on VoIP. Overview of Internet routing. AT&T (inter-national provider).

deron
Download Presentation

A survey of Internet routing reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A survey of Internet routing reliability Presented by Kundan Singh IRT internal talk April 9, 2003

  2. Agenda • Routing overview • Problems • Route oscillations • Slow convergence • Scaling • Configuration • Effect on VoIP Internet routing reliability

  3. Overview of Internet routing AT&T (inter-national provider) Regional provider MCI OSPF (optimize path) Autonomous systems Regional provider BGP (policy based) Campus Cable modem provider Campus Internet routing reliability

  4. 0 1 2 5 4 3 6 7 Border gateway protocol • TCP • OPEN, UPDATE, KEEPALIVE, NOTIFICATION • Hierarchical peering relationship • Export • all routes to customers • only customer and local routes to peers and providers • Path-vector • Optimal AS path satisfying policy d: 47 d: 247 d: 247 d: 1247 Provider Customer d Peer Peer d: 31247 e: 3125 . . . Backup Internet routing reliability

  5. Route selection • Local AS preference • AS path length • Multi-exit discriminator (MED) • Prefer external-BGP over internal-BGP • Use internal routing metrics (e.g., OSPF) • Use identifier as last tie breaker B4 R1 B1 R2 AS1 B3 B2 C2 C1 AS3 AS2 AS4 Internet routing reliability

  6. 1 2 0 Route oscillation • Each AS policy independent • Persistent vs transient • Not if distance based • Solution: • Static graph analysis • Policy guidelines • Dynamic “flap” damping Internet routing reliability

  7. Static analysis • Abstract models: • Solvable? • Resilience on link failure? • Multiple solutions? • Sometimes solvable? • Does not work • NP complete • Relies on Internet routing registries Internet routing reliability

  8. Policy guidelines • MUST • Prefer customer over peer/provider • Have lowest preference for backup path • “avoidance level” increases as path traverses • MED must be used across all advertisements • Works even on failure and consistent with current practice • Limits the policy usage Internet routing reliability

  9. IS-IS – millisecond convergence Detect change (hardware, keep-alive) Improved incremental SPF Link “down” immediate, “up” delayed Propagate update before calculate SPF Keep-alive before data packets Detect duplicate updates OSPF stability Sub-second keep-alive Randomization Multiple failures Loss resilience Distance vector Count to infinity Convergence in intra-domain Internet routing reliability

  10. 0 1 2 R BGP convergence ( R, 1R, 2R) (0R, 1R, R) (0R, R, 2R) Internet routing reliability

  11. 0 1 2 R BGP convergence 0->1: 01R 0->2: 01R ( - , 1R, 2R) 2->0: 20R 2->1: 20R 1->0: 10R 1->2: 10R (0R, 1R, - ) (0R, - , 2R) Internet routing reliability

  12. 0 1 2 R BGP convergence ( - , 1R, 2R) 01R 01R 1->0: 10R 1->2: 10R 1->0: 12R 1->2: 12R 2->0: 20R 2->1: 20R 2->0: 21R 2->1: 21R (01R,1R, - ) ( - , - , 2R) Internet routing reliability

  13. 0 1 2 R BGP convergence 0->1: W 0->2: W ( - , - , 2R) 2->0: 20R 2->1: 20R 2->0: 21R 2->1: 21R 2->0: 201R 2->1: 201R 10R 1->0: 12R 1->2: 12R 10R (01R,10R, - ) ( - , - , 2R) Internet routing reliability

  14. 0 1 2 R BGP convergence • MinRouteAdver • To announcements • In 13 steps • Sender side loop detection • One step ( - , - , - ) After 48 steps ( - , - , - ) ( - , - , - ) Internet routing reliability

  15. BGP convergence [2] • Latency due to path exploration • Fail-over latency = 30 n • Where n = longest backup path length • Within 3min, some oscillations up to 15 min • Loss and delay during convergence • “up” converges faster than “down” • Verified using experiment Internet routing reliability

  16. BGP convergence [3] • Path exploration => latency • More dense peering => more latency • Large providers, better convergence Internet routing reliability

  17. BGP convergence [4] • Route flap damping • To avoid excessive flaps, penalize updated routes • Penalty decays exponentially. • “suppression” and “reuse” threshold • Worsens convergence • Selective damping • Do not penalize if path length keeps increasing • Attach a preference with route Internet routing reliability

  18. 1 2 0 R 3 5 BGP convergence [5] • 12R and 235R are inconsistent. Prefer directly learnt 235R • Distinguish failure with policy change • Order of magnitude improvement 12R 2R 235R Internet routing reliability

  19. BGP scaling • Full mesh logical connection within an AS • Add hierarchy Internet routing reliability

  20. BGP scaling [2] • Route reflector • More popular • Upgrade only RR • Confederations • Sub-divide AS • Less updates, sessions Internet routing reliability

  21. RR C2 RR C1 BGP scaling [3] P • May have loop • If signaling path is not forwarding path Signaling path Choose Q Choose P Logical BGP session Physical link Internet routing reliability Q

  22. BGP scaling [4] • Persistent oscillations possible • Modify to pass multiple route information within an AS Internet routing reliability

  23. BGP stability • Initial experiment (’96) • 99% redundant updates <= implementation or configuration bug • After bug fixes (97-98) • Well distributed across AS and prefix Internet routing reliability

  24. BGP stability [2] • Inter-domain experiment (’98) • 9 months, 9GB, 55000 routes, 3 ISP, 15 min filtering • 25-35% routes are 99.99% available • 10% of routes less that 95% available Internet routing reliability

  25. BGP stability [3] • Failure • More than 50% have MTTF > 15 days, 75% failed in 30 days • Most fail-over/re-route within 2-days (increased since ’94) • Repair • 40% route failure repaired in < 10min, 60% in 30min • Small fraction of routes affect majority of instability • Weekly/daily frequency => congestion possible Internet routing reliability

  26. BGP stability [4] • Backbone routers • Interface MTTF 40 days • 80% failures resolved in 2 hr • Maintenance, power and PSTN are major cause for outages (approx 16% each) • Overall uptime of 99% • Popular destinations • Quite robust • Average duration is less than 20s => due to convergence Internet routing reliability

  27. Congestion Prioritize routing control messages over data Routing table size AS count, prefix length, multi-home, NAT Effects: Number of updates; convergence Configuration, no universal filter Real routers “malloc” failure Cascading effect Prefix limiting option Graceful restart CodeRed/Nimda Quite robust Some features get activated during stress Improper rate limiting Misconfiguration: IGP instability propagated Bugs: duplicate announcements BGP under stress Internet routing reliability

  28. BGP misconfiguration • Failure to summarize, hijack, advertise internal prefix, or policy. • 200-1200 prefix each day • ¾ of new advertisement as a result • 4% prefix affect connectivity • Cause • Initialization bug (22%), reliance on upstream filtering (14%), from IGP (32%) • Bad ACL (34%), prefix based (8%) • Conclusion • user interface, authentication, consistency verification, transaction semantics for command Internet routing reliability

  29. Switch vendors aim for 99.999% availability Network availability varies (domestic US calls > 99.9%) Study in ‘97 Overload caused 44% customer-minutes Mostly short outages Human error caused 50% outages Software only 14% No convergence problem PSTN failures Internet routing reliability

  30. VoIP • Tier-1 backbone (Sprint) have good delay, loss characteristics. • Average scattered loss .19% (mostly single packet loss, use FEC) • 99.9% probes have <33ms delay • Most burst loss due to routing problem • Customer sites have more problems Internet routing reliability

  31. VoIP [2] • Outages = more than 300ms loss • More than 23% losses are outages • Outages are similar for different networks • Call abortion due to poor quality • Net availability = 98% Internet routing reliability

  32. Future work • End system and higher layer protocol reliability and availability • Mechanism to reduce effect of outages in VoIP • Redundancy of VoIP systems during outages • Convergence and scaling of TRIP, which is similar to BGP Internet routing reliability

More Related