1 / 40

Internet Routing (COS 598A) Today: Root-Cause Analysis

Internet Routing (COS 598A) Today: Root-Cause Analysis. Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm. Outline. Network troubleshooting Motivation for network troubleshooting Investigating from the edge vs. inside Active probing

elkan
Download Presentation

Internet Routing (COS 598A) Today: Root-Cause Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Internet Routing (COS 598A)Today: Root-Cause Analysis Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm

  2. Outline • Network troubleshooting • Motivation for network troubleshooting • Investigating from the edge vs. inside • Active probing • Traceroute • Mapping IP addresses to AS numbers • Passive monitoring • Analyzing BGP update streams • Identifying location and cause of routing change • Limitations of the approach

  3. Network Troubleshooting “Why can’t I reach www.cnn.com?” “Why is the performance bad?” Internet www.cnn.com

  4. Reachability Problems: What Could be Wrong? • End-host problem • Web server down • DNS server down, or misconfigured • Forwarding-path problem • Packet filter or firewall restricting access • Mismatch in Maximum Transmission Unit (MTU) • Routing problem • User or server disconnected from Internet • Blackhole dropping all packets • Persistent loop

  5. Performance Problem: What Could be Wrong? • End-host problems • Overloaded Web server • Overloaded DNS server • Overloaded user machine • Forwarding-path problem • High round-trip time • Link congestion • Routing problem • Long-term routing instability • Transient disruption during convergence

  6. Motivation for Troubleshooting • Improving performance • Detect, diagnose, and fix the problem • Pick a path through another provider • Pick a different path in any overlay network • Establishing accountability • Enforce Service Level Agreements • Rate service providers • Characterizing the Internet • Understand causes of performance problems • Understand challenges of troubleshooting

  7. Troubleshooting Outside vs. Inside • Outside: from network edge • Who: users and researchers, and operators troubleshooting problems outside their network • Data: ping/traceroute, public feeds of BGP updates, and public measurement platforms • Challenges: inference from very limited data • Inside: from inside the network • Who: operators running a network • Data: SNMP, fault data, traffic measurement, route monitors, and router configuration files • Challenges: collecting and joining the data Today

  8. Active Probing

  9. Pros and Cons of Active Probing • Advantages • Can run from any end system • Measure the actual forwarding path • See black-holes, loops, and delays directly • Disadvantages • Effects of routing changes, not the cause • Current path, not the path used in the past • Requires frequent probes to observe the changes • Shows only properties of round-trip path • Hard to tell if problem is on forward vs. reverse

  10. Time exceeded TTL=1 TTL=2 Traceroute: Measuring the Forwarding Path • Time-To-Live field in IP packet header • Source sends a packet with a TTL of n • Each router along the path decrements the TTL • “TTL exceeded” sent when TTL reaches 0 • Traceroute tool exploits this TTL behavior destination source Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message

  11. No response from router No name resolution Example Traceroute Output (Berkeley to CNN) Hop number, IP address, DNS name 1 169.229.62.1 2 169.229.59.225 3 128.32.255.169 4 128.32.0.249 5 128.32.0.66 6 209.247.159.109 7 * 8 64.159.1.46 9 209.247.9.170 10 66.185.138.33 11 * 12 66.185.136.17 13 64.236.16.52 inr-daedalus-0.CS.Berkeley.EDU soda-cr-1-1-soda-br-6-2 vlan242.inr-202-doecev.Berkeley.EDU gigE6-0-0.inr-666-doecev.Berkeley.EDU qsv-juniper--ucb-gw.calren2.net POS1-0.hsipaccess1.SanJose1.Level3.net ? ? pos8-0.hsa2.Atlanta2.Level3.net pop2-atm-P0-2.atdn.net ? pop1-atl-P4-0.atdn.net www4.cnn.com

  12. Example Troubleshooting Results • No packets go beyond your gateway • Gateway’s connection to Internet is dead • Traceroute stops at intermediate point • Perhaps a blackhole • Traceroute path has a loop • Transient or persistent forwarding loop • Traceroute shows a very long path • Routing anomaly, route hijacking, etc. • Traceroute shows very long delays • Delay or congestion on forward or reverse path

  13. Problems with Traceroute • Missing responses • Routers might not send “Time-Exceeded” • Firewalls may drop the probe packets • “Time-Exceeded” reply may be dropped • Misleading responses • Probes taken while the path is changing • Name not in DNS, or DNS entry misconfigured • Mapping IP addresses • Mapping interfaces to a common router • Mapping interface/router to Autonomous System

  14. AS25 AS25 AS25 AS25 AS11423 AS3356 AS3356 AS3356 AS3356 AS1668 AS1668 AS1668 AS5662 Berkeley Calren Level3 AOL CNN Map Traceroute Hops to ASes Traceroute output: (hop number, IP) 1 169.229.62.1 2 169.229.59.225 3 128.32.255.169 4 128.32.0.249 5 128.32.0.66 6 209.247.159.109 7 * 8 64.159.1.46 9 209.247.9.170 10 66.185.138.33 11 * 12 66.185.136.17 13 64.236.16.52 Need accurate IP-to-AS mappings (for network equipment).

  15. Candidate Ways to Get IP-to-AS Mapping • Routing address registry • Voluntary public registry such as whois.radb.net • Used by prtraceroute and “NANOG traceroute” • Incomplete and quite out-of-date • Mergers, acquisitions, delegation to customers • Origin AS in BGP paths • Public BGP routing tables such as RouteViews • Used to translate traceroute data to an AS graph • Incomplete and inaccurate… but usually right • Multiple Origin ASes, no mapping, wrong mapping

  16. Example: BGP Table (“show ip bgp” at RouteViews) Network Next Hop Metric LocPrf Weight Path * 3.0.0.0/8 205.215.45.50 0 4006 701 80 i * 167.142.3.6 0 5056 701 80 i * 157.22.9.7 0 715 1 701 80 i * 195.219.96.239 0 8297 6453 701 80 i * 195.211.29.254 0 5409 6667 6427 3356 701 80 i *>12.127.0.249 0 7018 701 80 i * 213.200.87.254 929 0 3257 701 80 i * 9.184.112.0/20 205.215.45.50 0 4006 6461 3786 i * 195.66.225.254 0 5459 6461 3786 i *>203.62.248.4 0 1221 3786 i * 167.142.3.6 0 5056 6461 6461 3786 i * 195.219.96.239 0 8297 6461 3786 i * 195.211.29.254 0 5409 6461 3786 i AS 80 is General Electric, AS 701 is UUNET, AS 7018 is AT&T AS 3786 is DACOM (Korea), AS 1221 is Telstra

  17. Why Would IP-to-AS Mapping Be Wrong? • IP addresses of equipment • Interfaces on the routers, not end hosts • Identifies equipment in routing protocols • Doesn’t need to be globally visible consistent • Three reasons the mappings may be “wrong” • Addresses of Internet Exchange Points • Sibling ASes that share address space • ASes that don’t announce their addresses • Look at traceroute path vs. BGP AS path • Traceroute path after IP-to-AS mapping • BGP AS path taken from the BGP table

  18. Extra AS due to Internet eXchange Points • IXP: shared place where providers meet • E.g., Mae-East, Mae-West, PAIX • Large number of fan-in and fan-out ASes E A A E F B F B D G C G C Traceroute AS path BGP AS path Ignore extra traceroute AS hop with high fan-in and fan-out

  19. Extra AS due to Sibling ASes • Sibling: organizations with multiple ASes: • E.g., Sprint AS 1239 and AS 1791 • AS numbers equipment with addresses of another E A E A F B H D F B D G C G C Traceroute AS path BGP AS path Merge sibling ASes “belong together” as if they were one AS.

  20. A C A C A C B A C B C Unannounced Infrastructure Addresses 12.0.0.0/8 A B C does not announce part of its address space in BGP(e.g., 12.1.2.0/24) C Fix the IP-to-AS map to associate 12.1.2.0/24 with C

  21. Refining Initial IP-to-AS Mapping • Start with initial IP-to-AS mapping • Mapping from BGP tables is usually correct • Good starting point for computing the mapping • Collect many BGP and traceroute paths • Signaling and forwarding AS path usually match • Good way to identify mistakes in IP-to-AS map • Successively refine the IP-to-AS mapping • Find add/change/delete that makes big difference • Base these “edits” on operational realities http://www.cs.princeton.edu/~jrex/papers/sigcomm03.pdf http://www.cs.princeton.edu/~jrex/papers/infocom04.pdf

  22. Research Areas • Better version of traceroute • Router support for active measurement • IPPM (IP Performance Measurement) • http://www1.ietf.org/mail-archive/web/imrg/current/msg00154.html • Peer-to-peer troubleshooting www.cnn.com “Yes” “No”

  23. Passive Monitoring

  24. Limitations of Active Measurements • Active measurements: traceroute-like tools • Can’t probe in the past • Shows the effect, not the cause Web Server (d) AS 2 AS 4 AS 1 User (s) AS 3

  25. Appealing to Peek Inside • Passive measurements: public BGP data BGP update feeds Data Correlation Data Collection (RouteViews, RIPE) root cause

  26. Inspect BGP Routing Changes • Changes in paths to reach destination d • AS 1: “1 3 4”  “1 2 4” • AS 2: “2 4” (no change) • AS 3: “3 4”  “3 1 2 4” • AS 4: “4” (no change) Web Server (d) AS 2 AS 4 AS 1 User (s) AS 3

  27. Idea #1: ASes in Paths Undergoing Change • Key assumption • “The AS responsible for the change appears in the old and/or the new AS path to the destination.” • If an AS has a routing change • All ASes in old and new paths may be responsible • Call these ASes the “suspect set” • Combining across vantage points • Consider all ASes that had a routing change • Perform the intersection across the suspect sets

  28. Idea #2: Excluding ASes in Non-Changing Paths • Key assumption • “If an AS has no routing change, the ASes in the path are not responsible and can be excluded.” • Example • AS 1: “1 2 4”  “1 2 3 4”: suspects {1, 2, 3, 4} • AS 2: “2 4”  “2 3 4”: suspects {2, 3, 4} • AS 3: “3 4” (no change): non-suspects {3, 4} AS 3 AS 2 AS 1 AS 4

  29. Idea #3: Blaming the ASes in the Better Path • Key assumption • “The better path is the one that contains the AS responsible for the change.” • Example • “1 2 4”  “1 2 3 4”: better path to worse path, with ASes {1,2,4} as the suspects (not AS 3) • Heuristics for identifying the “better” path • E.g., the shorter AS path AS 3 AS 2 AS 1 AS 4

  30. Idea #4: Combining Across Destinations • Key assumption • “All destinations experiencing routing changes in a short period of time have a common cause.” • Exploiting the observation • Form suspect sets for each destination • Perform intersections of the sets across the destinations

  31. Difficulties With Root-Cause Analysis • Misleading BGP routing changes • Responsible AS not on old or new path • Looking across destinations doesn’t resolve • Missing routing changes • Some routers in an AS don’t have a change • Some subnets are not visible in BGP • Some internal changes are not visible in BGP

  32. 1 4 5 6 2 3 7 8 9 10 11 Misleading BGP Changes Myth:The AS responsible for the change appears in the old or the new AS path. BGP data collection old: 1,2,8,9,10 new: 1,4,5,6,7,10

  33. 12 BGP data collection Misleading BGP Changes Myth:Looking at routing changes across prefixes resolves causes d2 AS 3 d3 AS 2 AS 1 d1 A B 7 10 C Changes for d2, but not for d1 and d3

  34. A B D C BGP data collection No change Missing Routing Changes Myth: The BGP updates from a single router accurately represent the AS dst AS 2 AS 1 7 6 10 12

  35. Missing Routing Changes Myth:BGP data from a router accurately represents changes on that router. 12.1.1.0/24 A BGP data collection 12.1.0.0/16

  36. A B D C BGP data collection Missing Routing Changes Myth:Routing changes visible in eBGP have greater impact end-to-end impact than changes with local scope. dst AS 2 AS 1 5 7 6 10 12

  37. (i,s,d,t) failure link (3,4) (j,s,d,t’) failure link (3,4) Hybrid of Active and Passive Monitoring Omni 2 Omni 4 Web Server (d) AS 2 AS 4 AS 1 i User (s) AS 3 Omni 1 j Omni 3

  38. Research Questions • Understanding if root-cause analysis can work • How many vantage points are needed? • Do the assumptions usually hold? • Can algorithms tolerate occasional violations? • Can some additional information help? • Distributed algorithms for root-cause analysis • Can ASes cooperate in distributed fashion? • How to prevent or detect ASes that cheat? • Do all ASes have to participate? • Other hybrids of active and passive monitoring?

  39. Conclusions • Troubleshooting is important • Detect, diagnose, and fix problems • Accountability and service-level agreements • Troubleshooting is hard • Active measurement (e.g., traceroute) not enough • Root-cause analysis techniques are not enough • New innovation necessary • Hybrid active/passive approaches • Router support for active measurement • Routing protocol extensions for troubleshooting

  40. For Next Time: From Inside an AS • Two papers • “OSPF monitoring: Architecture, design, and deployment experience” • “Finding a needle in a haystack: Pinpointing significant BGP routing changes in an IP network” • Optional reading • Materials from Packet Design and Ipsum Networks • Review only of first paper • Summary • Why accept • Why reject • Future work

More Related