1 / 16

Towards an Internet that “Never Fails”

Towards an Internet that “Never Fails”. Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru . What We Should Aim Toward. Carrier airlines (2002 FAA Fact Book) 41 accidents, 6.7 million flights (five “nines” availability) 911 phone service (1993 NRIC report)

delphina
Download Presentation

Towards an Internet that “Never Fails”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards an Internet that “Never Fails” Hari BalakrishnanMIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

  2. What We Should Aim Toward • Carrier airlines (2002 FAA Fact Book) • 41 accidents, 6.7 million flights (five “nines” availability) • 911 phone service (1993 NRIC report) • 29 minutes downtime per year per line (four “nines” availability) • Standard phone service (various sources) • 53 minutes downtime per year per line (four “nines” availability) • The Internet? • One to two “nines”

  3. Example Catastrophic Failures “…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.” -- news.com, April 25, 1997 “Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001 “WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue." -- cnn.com, October 3, 2002 "A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).” -- dslreports.com, February 23, 2004

  4. NANOG List Failure “Analysis” More than 70% of threads discussing failures relatedto router configuration or route announcement problems Note: Only includes problems openly discussed on this list.

  5. Faults and Failures • Fault = Underlying defect in a component that causes it to violate a specification • Latent or Active (i.e., cause errors) • Unmasked faults (errors) cause failures • Failure of subsystem (spec violation) causes fault in system • Internet faults occur for complex reasons • Hardware, software, protocol, design, implementation, operational faults: could be triggered by malice • Internet failure: A cannot communicate with B

  6. Three Directions • Configuration as programming • Defines BGP behavior • Tools to cope with routing complexity • Coping with protocol faults: failure-atomic interdomain routing • Prefix-based routing considered harmful • End-to-end routing • Exposing multiple paths to end systems (and stubs)

  7. Today: Reactive Operation What happens if I tweak this policy…? • Problems cause downtime • Problems often not immediately apparent

  8. Coping with Complexity • View configuration as (distributed) programming • Large-scale: over 1M lines of code in some networks • Programming tools to reduce fault frequency • Static analysis can detect many faults [rcc] • Sandboxing to overcome current “stimulus-response” reasoning [FR03] • Centralize configuration platform • More “intentional” config specs • Push configs to routers • Push routes to routers [RCP:F+04] • Use static analysis and sandboxing tools

  9. rcc Configure Detect Faults Deploy Proactive Operation with rcchttp://nms.csail.mit.edu/rcc • Represent complex, distributed configuration • Define a correctness specification • Map specification to constraints Distributed router configurations (Single AS) rcc Correctness Specification Constraints Faults Normalized Representation

  10. Correctness Specification Path Visibility Every destination with a usable path has a route advertisement If there exists a path, then there exists a route Example violation: Signaling partition Route Validity Every route advertisement corresponds to a usable path If there exists a route, then there exists a path Example violation: Routing loop

  11. Results: Faults across 17 ASes Every AS had faults, regardless of network size Most faults can be attributed to distributed configuration Route Validity Path Visibility

  12. Three Directions • Configuration as programming • Tools to cope with routing complexity • Coping with protocol faults: failure-atomic interdomain routing • Prefix-based routing considered harmful • End-to-end routing • Exposing multiple paths to end systems

  13. Prefixes are too coarse-grained Validity: If a failure occurs that makes a network unreachable via a given path, then the route corresponding to that path must be withdrawn 70% of intra-AS failuresnot visible in BGP [FABK03]

  14. …but they are also too fine-grained! • ~70% of discontiguous prefix pairs from the same AS are announced from the same location • Allocation explains about 60% of these cases: • Registries often allocate discontiguous address blocks to a single AS on the same day • Routes for these prefixes will “flap” together. • 135.36.0.0/16 (Agere) and 135.12.0.0/14 (Lucent) Route objects should correspond to an “atom” of hosts that share fate

  15. Proposal: Atomic Interdomain Protocol (AIP) • Exterminate prefixes • Name “atomic domains” (AD) directly • Addressing, forwarding and routing on ADs • Like current AS numbers, but finer-grained • Example: MIT, Microsoft Redmond, one PoP of a large ISP, … • Flat AD IDs can carry cryptographic meaning • Self-certifying (hash of public key) • End-system addresses have the form [AD : LocalID]

  16. Summary It’s worth shooting for a two or three order-of-magnitude improvement in Internet availability It’s possible to get four or five nines of Internet availability, if we: • Develop tools to cope with configuration complexity • Develop a failure-atomic routing system • Expose multiple IP-layer paths to higher layers

More Related