Internet routing cos 598a today root cause analysis
Sponsored Links
This presentation is the property of its rightful owner.
1 / 40

Internet Routing (COS 598A) Today: Root-Cause Analysis PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on
  • Presentation posted in: General

Internet Routing (COS 598A) Today: Root-Cause Analysis. Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm. Outline. Network troubleshooting Motivation for network troubleshooting Investigating from the edge vs. inside Active probing

Download Presentation

Internet Routing (COS 598A) Today: Root-Cause Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Internet Routing (COS 598A)Today: Root-Cause Analysis

Jennifer Rexford

http://www.cs.princeton.edu/~jrex/teaching/spring2005

Tuesdays/Thursdays 11:00am-12:20pm


Outline

  • Network troubleshooting

    • Motivation for network troubleshooting

    • Investigating from the edge vs. inside

  • Active probing

    • Traceroute

    • Mapping IP addresses to AS numbers

  • Passive monitoring

    • Analyzing BGP update streams

    • Identifying location and cause of routing change

    • Limitations of the approach


Network Troubleshooting

“Why can’t I reach www.cnn.com?”

“Why is the performance bad?”

Internet

www.cnn.com


Reachability Problems: What Could be Wrong?

  • End-host problem

    • Web server down

    • DNS server down, or misconfigured

  • Forwarding-path problem

    • Packet filter or firewall restricting access

    • Mismatch in Maximum Transmission Unit (MTU)

  • Routing problem

    • User or server disconnected from Internet

    • Blackhole dropping all packets

    • Persistent loop


Performance Problem: What Could be Wrong?

  • End-host problems

    • Overloaded Web server

    • Overloaded DNS server

    • Overloaded user machine

  • Forwarding-path problem

    • High round-trip time

    • Link congestion

  • Routing problem

    • Long-term routing instability

    • Transient disruption during convergence


Motivation for Troubleshooting

  • Improving performance

    • Detect, diagnose, and fix the problem

    • Pick a path through another provider

    • Pick a different path in any overlay network

  • Establishing accountability

    • Enforce Service Level Agreements

    • Rate service providers

  • Characterizing the Internet

    • Understand causes of performance problems

    • Understand challenges of troubleshooting


Troubleshooting Outside vs. Inside

  • Outside: from network edge

    • Who: users and researchers, and operators troubleshooting problems outside their network

    • Data: ping/traceroute, public feeds of BGP updates, and public measurement platforms

    • Challenges: inference from very limited data

  • Inside: from inside the network

    • Who: operators running a network

    • Data: SNMP, fault data, traffic measurement, route monitors, and router configuration files

    • Challenges: collecting and joining the data

Today


Active Probing


Pros and Cons of Active Probing

  • Advantages

    • Can run from any end system

    • Measure the actual forwarding path

      • See black-holes, loops, and delays directly

  • Disadvantages

    • Effects of routing changes, not the cause

    • Current path, not the path used in the past

      • Requires frequent probes to observe the changes

    • Shows only properties of round-trip path

      • Hard to tell if problem is on forward vs. reverse


Time

exceeded

TTL=1

TTL=2

Traceroute: Measuring the Forwarding Path

  • Time-To-Live field in IP packet header

    • Source sends a packet with a TTL of n

    • Each router along the path decrements the TTL

    • “TTL exceeded” sent when TTL reaches 0

  • Traceroute tool exploits this TTL behavior

destination

source

Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message


No response

from router

No name resolution

Example Traceroute Output (Berkeley to CNN)

Hop number, IP address, DNS name

1 169.229.62.1

2 169.229.59.225

3 128.32.255.169

4 128.32.0.249

5 128.32.0.66

6 209.247.159.109

7 *

8 64.159.1.46

9 209.247.9.170

10 66.185.138.33

11 *

12 66.185.136.17

13 64.236.16.52

inr-daedalus-0.CS.Berkeley.EDU

soda-cr-1-1-soda-br-6-2

vlan242.inr-202-doecev.Berkeley.EDU

gigE6-0-0.inr-666-doecev.Berkeley.EDU

qsv-juniper--ucb-gw.calren2.net

POS1-0.hsipaccess1.SanJose1.Level3.net

?

?

pos8-0.hsa2.Atlanta2.Level3.net

pop2-atm-P0-2.atdn.net

?

pop1-atl-P4-0.atdn.net

www4.cnn.com


Example Troubleshooting Results

  • No packets go beyond your gateway

    • Gateway’s connection to Internet is dead

  • Traceroute stops at intermediate point

    • Perhaps a blackhole

  • Traceroute path has a loop

    • Transient or persistent forwarding loop

  • Traceroute shows a very long path

    • Routing anomaly, route hijacking, etc.

  • Traceroute shows very long delays

    • Delay or congestion on forward or reverse path


Problems with Traceroute

  • Missing responses

    • Routers might not send “Time-Exceeded”

    • Firewalls may drop the probe packets

    • “Time-Exceeded” reply may be dropped

  • Misleading responses

    • Probes taken while the path is changing

    • Name not in DNS, or DNS entry misconfigured

  • Mapping IP addresses

    • Mapping interfaces to a common router

    • Mapping interface/router to Autonomous System


AS25

AS25

AS25

AS25

AS11423

AS3356

AS3356

AS3356

AS3356

AS1668

AS1668

AS1668

AS5662

Berkeley

Calren

Level3

AOL

CNN

Map Traceroute Hops to ASes

Traceroute output: (hop number, IP)

1 169.229.62.1

2 169.229.59.225

3 128.32.255.169

4 128.32.0.249

5 128.32.0.66

6 209.247.159.109

7 *

8 64.159.1.46

9 209.247.9.170

10 66.185.138.33

11 *

12 66.185.136.17

13 64.236.16.52

Need accurate

IP-to-AS mappings

(for network equipment).


Candidate Ways to Get IP-to-AS Mapping

  • Routing address registry

    • Voluntary public registry such as whois.radb.net

    • Used by prtraceroute and “NANOG traceroute”

    • Incomplete and quite out-of-date

      • Mergers, acquisitions, delegation to customers

  • Origin AS in BGP paths

    • Public BGP routing tables such as RouteViews

    • Used to translate traceroute data to an AS graph

    • Incomplete and inaccurate… but usually right

      • Multiple Origin ASes, no mapping, wrong mapping


Example: BGP Table (“show ip bgp” at RouteViews)

Network Next Hop Metric LocPrf Weight Path

* 3.0.0.0/8 205.215.45.50 0 4006 701 80 i

* 167.142.3.6 0 5056 701 80 i

* 157.22.9.7 0 715 1 701 80 i

* 195.219.96.239 0 8297 6453 701 80 i

* 195.211.29.254 0 5409 6667 6427 3356 701 80 i

*>12.127.0.249 0 7018 701 80 i

* 213.200.87.254 929 0 3257 701 80 i

* 9.184.112.0/20 205.215.45.50 0 4006 6461 3786 i

* 195.66.225.254 0 5459 6461 3786 i

*>203.62.248.4 0 1221 3786 i

* 167.142.3.6 0 5056 6461 6461 3786 i

* 195.219.96.239 0 8297 6461 3786 i

* 195.211.29.254 0 5409 6461 3786 i

AS 80 is General Electric, AS 701 is UUNET, AS 7018 is AT&T

AS 3786 is DACOM (Korea), AS 1221 is Telstra


Why Would IP-to-AS Mapping Be Wrong?

  • IP addresses of equipment

    • Interfaces on the routers, not end hosts

    • Identifies equipment in routing protocols

    • Doesn’t need to be globally visible consistent

  • Three reasons the mappings may be “wrong”

    • Addresses of Internet Exchange Points

    • Sibling ASes that share address space

    • ASes that don’t announce their addresses

  • Look at traceroute path vs. BGP AS path

    • Traceroute path after IP-to-AS mapping

    • BGP AS path taken from the BGP table


Extra AS due to Internet eXchange Points

  • IXP: shared place where providers meet

    • E.g., Mae-East, Mae-West, PAIX

    • Large number of fan-in and fan-out ASes

E

A

A

E

F

B

F

B

D

G

C

G

C

Traceroute AS path

BGP AS path

Ignore extra traceroute AS hop with high fan-in and fan-out


Extra AS due to Sibling ASes

  • Sibling: organizations with multiple ASes:

    • E.g., Sprint AS 1239 and AS 1791

    • AS numbers equipment with addresses of another

E

A

E

A

F

B

H

D

F

B

D

G

C

G

C

Traceroute AS path

BGP AS path

Merge sibling ASes “belong together” as if they were one AS.


A C A C

A C

B A C

B C

Unannounced Infrastructure Addresses

12.0.0.0/8

A

B

C does not announce part of

its address space in BGP(e.g., 12.1.2.0/24)

C

Fix the IP-to-AS map to associate 12.1.2.0/24 with C


Refining Initial IP-to-AS Mapping

  • Start with initial IP-to-AS mapping

    • Mapping from BGP tables is usually correct

    • Good starting point for computing the mapping

  • Collect many BGP and traceroute paths

    • Signaling and forwarding AS path usually match

    • Good way to identify mistakes in IP-to-AS map

  • Successively refine the IP-to-AS mapping

    • Find add/change/delete that makes big difference

    • Base these “edits” on operational realities

http://www.cs.princeton.edu/~jrex/papers/sigcomm03.pdf

http://www.cs.princeton.edu/~jrex/papers/infocom04.pdf


Research Areas

  • Better version of traceroute

    • Router support for active measurement

    • IPPM (IP Performance Measurement)

    • http://www1.ietf.org/mail-archive/web/imrg/current/msg00154.html

  • Peer-to-peer troubleshooting

www.cnn.com

“Yes”

“No”


Passive Monitoring


Limitations of Active Measurements

  • Active measurements: traceroute-like tools

    • Can’t probe in the past

    • Shows the effect, not the cause

Web

Server

(d)

AS 2

AS 4

AS 1

User

(s)

AS 3


Appealing to Peek Inside

  • Passive measurements: public BGP data

BGP update feeds

Data Correlation

Data Collection

(RouteViews, RIPE)

root cause


Inspect BGP Routing Changes

  • Changes in paths to reach destination d

    • AS 1: “1 3 4”  “1 2 4”

    • AS 2: “2 4” (no change)

    • AS 3: “3 4”  “3 1 2 4”

    • AS 4: “4” (no change)

Web

Server

(d)

AS 2

AS 4

AS 1

User

(s)

AS 3


Idea #1: ASes in Paths Undergoing Change

  • Key assumption

    • “The AS responsible for the change appears in the old and/or the new AS path to the destination.”

  • If an AS has a routing change

    • All ASes in old and new paths may be responsible

    • Call these ASes the “suspect set”

  • Combining across vantage points

    • Consider all ASes that had a routing change

    • Perform the intersection across the suspect sets


Idea #2: Excluding ASes in Non-Changing Paths

  • Key assumption

    • “If an AS has no routing change, the ASes in the path are not responsible and can be excluded.”

  • Example

    • AS 1: “1 2 4”  “1 2 3 4”: suspects {1, 2, 3, 4}

    • AS 2: “2 4”  “2 3 4”: suspects {2, 3, 4}

    • AS 3: “3 4” (no change): non-suspects {3, 4}

AS 3

AS 2

AS 1

AS 4


Idea #3: Blaming the ASes in the Better Path

  • Key assumption

    • “The better path is the one that contains the AS responsible for the change.”

  • Example

    • “1 2 4”  “1 2 3 4”: better path to worse path, with ASes {1,2,4} as the suspects (not AS 3)

  • Heuristics for identifying the “better” path

    • E.g., the shorter AS path

AS 3

AS 2

AS 1

AS 4


Idea #4: Combining Across Destinations

  • Key assumption

    • “All destinations experiencing routing changes in a short period of time have a common cause.”

  • Exploiting the observation

    • Form suspect sets for each destination

    • Perform intersections of the sets across the destinations


Difficulties With Root-Cause Analysis

  • Misleading BGP routing changes

    • Responsible AS not on old or new path

    • Looking across destinations doesn’t resolve

  • Missing routing changes

    • Some routers in an AS don’t have a change

    • Some subnets are not visible in BGP

    • Some internal changes are not visible in BGP


1

4

5

6

2

3

7

8

9

10

11

Misleading BGP Changes

Myth:The AS responsible for the change appears in the old or the new AS path.

BGP data

collection

old:

1,2,8,9,10

new:

1,4,5,6,7,10


12

BGP data

collection

Misleading BGP Changes

Myth:Looking at routing changes across prefixes resolves causes

d2

AS 3

d3

AS 2

AS 1

d1

A

B

7

10

C

Changes for d2,

but not for d1 and d3


A

B

D

C

BGP data

collection

No change

Missing Routing Changes

Myth: The BGP updates from a single router accurately represent the AS

dst

AS 2

AS 1

7

6

10

12


Missing Routing Changes

Myth:BGP data from a router accurately represents changes on that router.

12.1.1.0/24

A

BGP data

collection

12.1.0.0/16


A

B

D

C

BGP data

collection

Missing Routing Changes

Myth:Routing changes visible in eBGP have greater impact end-to-end

impact than changes with local scope.

dst

AS 2

AS 1

5

7

6

10

12


(i,s,d,t)

failure link (3,4)

(j,s,d,t’)

failure link (3,4)

Hybrid of Active and Passive Monitoring

Omni 2

Omni 4

Web

Server

(d)

AS 2

AS 4

AS 1

i

User

(s)

AS 3

Omni 1

j

Omni 3


Research Questions

  • Understanding if root-cause analysis can work

    • How many vantage points are needed?

    • Do the assumptions usually hold?

    • Can algorithms tolerate occasional violations?

    • Can some additional information help?

  • Distributed algorithms for root-cause analysis

    • Can ASes cooperate in distributed fashion?

    • How to prevent or detect ASes that cheat?

    • Do all ASes have to participate?

    • Other hybrids of active and passive monitoring?


Conclusions

  • Troubleshooting is important

    • Detect, diagnose, and fix problems

    • Accountability and service-level agreements

  • Troubleshooting is hard

    • Active measurement (e.g., traceroute) not enough

    • Root-cause analysis techniques are not enough

  • New innovation necessary

    • Hybrid active/passive approaches

    • Router support for active measurement

    • Routing protocol extensions for troubleshooting


For Next Time: From Inside an AS

  • Two papers

    • “OSPF monitoring: Architecture, design, and deployment experience”

    • “Finding a needle in a haystack: Pinpointing significant BGP routing changes in an IP network”

  • Optional reading

    • Materials from Packet Design and Ipsum Networks

  • Review only of first paper

    • Summary

    • Why accept

    • Why reject

    • Future work


  • Login