Internet routing cos 598a today root cause analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

Internet Routing (COS 598A) Today: Root-Cause Analysis PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

Internet Routing (COS 598A) Today: Root-Cause Analysis. Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm. Outline. Network troubleshooting Motivation for network troubleshooting Investigating from the edge vs. inside Active probing

Download Presentation

Internet Routing (COS 598A) Today: Root-Cause Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Internet routing cos 598a today root cause analysis

Internet Routing (COS 598A)Today: Root-Cause Analysis

Jennifer Rexford

http://www.cs.princeton.edu/~jrex/teaching/spring2005

Tuesdays/Thursdays 11:00am-12:20pm


Outline

Outline

  • Network troubleshooting

    • Motivation for network troubleshooting

    • Investigating from the edge vs. inside

  • Active probing

    • Traceroute

    • Mapping IP addresses to AS numbers

  • Passive monitoring

    • Analyzing BGP update streams

    • Identifying location and cause of routing change

    • Limitations of the approach


Network troubleshooting

Network Troubleshooting

“Why can’t I reach www.cnn.com?”

“Why is the performance bad?”

Internet

www.cnn.com


Reachability problems what could be wrong

Reachability Problems: What Could be Wrong?

  • End-host problem

    • Web server down

    • DNS server down, or misconfigured

  • Forwarding-path problem

    • Packet filter or firewall restricting access

    • Mismatch in Maximum Transmission Unit (MTU)

  • Routing problem

    • User or server disconnected from Internet

    • Blackhole dropping all packets

    • Persistent loop


Performance problem what could be wrong

Performance Problem: What Could be Wrong?

  • End-host problems

    • Overloaded Web server

    • Overloaded DNS server

    • Overloaded user machine

  • Forwarding-path problem

    • High round-trip time

    • Link congestion

  • Routing problem

    • Long-term routing instability

    • Transient disruption during convergence


Motivation for troubleshooting

Motivation for Troubleshooting

  • Improving performance

    • Detect, diagnose, and fix the problem

    • Pick a path through another provider

    • Pick a different path in any overlay network

  • Establishing accountability

    • Enforce Service Level Agreements

    • Rate service providers

  • Characterizing the Internet

    • Understand causes of performance problems

    • Understand challenges of troubleshooting


Troubleshooting outside vs inside

Troubleshooting Outside vs. Inside

  • Outside: from network edge

    • Who: users and researchers, and operators troubleshooting problems outside their network

    • Data: ping/traceroute, public feeds of BGP updates, and public measurement platforms

    • Challenges: inference from very limited data

  • Inside: from inside the network

    • Who: operators running a network

    • Data: SNMP, fault data, traffic measurement, route monitors, and router configuration files

    • Challenges: collecting and joining the data

Today


Active probing

Active Probing


Pros and cons of active probing

Pros and Cons of Active Probing

  • Advantages

    • Can run from any end system

    • Measure the actual forwarding path

      • See black-holes, loops, and delays directly

  • Disadvantages

    • Effects of routing changes, not the cause

    • Current path, not the path used in the past

      • Requires frequent probes to observe the changes

    • Shows only properties of round-trip path

      • Hard to tell if problem is on forward vs. reverse


Traceroute measuring the forwarding path

Time

exceeded

TTL=1

TTL=2

Traceroute: Measuring the Forwarding Path

  • Time-To-Live field in IP packet header

    • Source sends a packet with a TTL of n

    • Each router along the path decrements the TTL

    • “TTL exceeded” sent when TTL reaches 0

  • Traceroute tool exploits this TTL behavior

destination

source

Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message


Example traceroute output berkeley to cnn

No response

from router

No name resolution

Example Traceroute Output (Berkeley to CNN)

Hop number, IP address, DNS name

1 169.229.62.1

2 169.229.59.225

3 128.32.255.169

4 128.32.0.249

5 128.32.0.66

6 209.247.159.109

7 *

8 64.159.1.46

9 209.247.9.170

10 66.185.138.33

11 *

12 66.185.136.17

13 64.236.16.52

inr-daedalus-0.CS.Berkeley.EDU

soda-cr-1-1-soda-br-6-2

vlan242.inr-202-doecev.Berkeley.EDU

gigE6-0-0.inr-666-doecev.Berkeley.EDU

qsv-juniper--ucb-gw.calren2.net

POS1-0.hsipaccess1.SanJose1.Level3.net

?

?

pos8-0.hsa2.Atlanta2.Level3.net

pop2-atm-P0-2.atdn.net

?

pop1-atl-P4-0.atdn.net

www4.cnn.com


Example troubleshooting results

Example Troubleshooting Results

  • No packets go beyond your gateway

    • Gateway’s connection to Internet is dead

  • Traceroute stops at intermediate point

    • Perhaps a blackhole

  • Traceroute path has a loop

    • Transient or persistent forwarding loop

  • Traceroute shows a very long path

    • Routing anomaly, route hijacking, etc.

  • Traceroute shows very long delays

    • Delay or congestion on forward or reverse path


Problems with traceroute

Problems with Traceroute

  • Missing responses

    • Routers might not send “Time-Exceeded”

    • Firewalls may drop the probe packets

    • “Time-Exceeded” reply may be dropped

  • Misleading responses

    • Probes taken while the path is changing

    • Name not in DNS, or DNS entry misconfigured

  • Mapping IP addresses

    • Mapping interfaces to a common router

    • Mapping interface/router to Autonomous System


Map traceroute hops to ases

AS25

AS25

AS25

AS25

AS11423

AS3356

AS3356

AS3356

AS3356

AS1668

AS1668

AS1668

AS5662

Berkeley

Calren

Level3

AOL

CNN

Map Traceroute Hops to ASes

Traceroute output: (hop number, IP)

1 169.229.62.1

2 169.229.59.225

3 128.32.255.169

4 128.32.0.249

5 128.32.0.66

6 209.247.159.109

7 *

8 64.159.1.46

9 209.247.9.170

10 66.185.138.33

11 *

12 66.185.136.17

13 64.236.16.52

Need accurate

IP-to-AS mappings

(for network equipment).


Candidate ways to get ip to as mapping

Candidate Ways to Get IP-to-AS Mapping

  • Routing address registry

    • Voluntary public registry such as whois.radb.net

    • Used by prtraceroute and “NANOG traceroute”

    • Incomplete and quite out-of-date

      • Mergers, acquisitions, delegation to customers

  • Origin AS in BGP paths

    • Public BGP routing tables such as RouteViews

    • Used to translate traceroute data to an AS graph

    • Incomplete and inaccurate… but usually right

      • Multiple Origin ASes, no mapping, wrong mapping


Example bgp table show ip bgp at routeviews

Example: BGP Table (“show ip bgp” at RouteViews)

Network Next Hop Metric LocPrf Weight Path

* 3.0.0.0/8 205.215.45.50 0 4006 701 80 i

* 167.142.3.6 0 5056 701 80 i

* 157.22.9.7 0 715 1 701 80 i

* 195.219.96.239 0 8297 6453 701 80 i

* 195.211.29.254 0 5409 6667 6427 3356 701 80 i

*>12.127.0.249 0 7018 701 80 i

* 213.200.87.254 929 0 3257 701 80 i

* 9.184.112.0/20 205.215.45.50 0 4006 6461 3786 i

* 195.66.225.254 0 5459 6461 3786 i

*>203.62.248.4 0 1221 3786 i

* 167.142.3.6 0 5056 6461 6461 3786 i

* 195.219.96.239 0 8297 6461 3786 i

* 195.211.29.254 0 5409 6461 3786 i

AS 80 is General Electric, AS 701 is UUNET, AS 7018 is AT&T

AS 3786 is DACOM (Korea), AS 1221 is Telstra


Why would ip to as mapping be wrong

Why Would IP-to-AS Mapping Be Wrong?

  • IP addresses of equipment

    • Interfaces on the routers, not end hosts

    • Identifies equipment in routing protocols

    • Doesn’t need to be globally visible consistent

  • Three reasons the mappings may be “wrong”

    • Addresses of Internet Exchange Points

    • Sibling ASes that share address space

    • ASes that don’t announce their addresses

  • Look at traceroute path vs. BGP AS path

    • Traceroute path after IP-to-AS mapping

    • BGP AS path taken from the BGP table


Extra as due to internet exchange points

Extra AS due to Internet eXchange Points

  • IXP: shared place where providers meet

    • E.g., Mae-East, Mae-West, PAIX

    • Large number of fan-in and fan-out ASes

E

A

A

E

F

B

F

B

D

G

C

G

C

Traceroute AS path

BGP AS path

Ignore extra traceroute AS hop with high fan-in and fan-out


Extra as due to sibling ases

Extra AS due to Sibling ASes

  • Sibling: organizations with multiple ASes:

    • E.g., Sprint AS 1239 and AS 1791

    • AS numbers equipment with addresses of another

E

A

E

A

F

B

H

D

F

B

D

G

C

G

C

Traceroute AS path

BGP AS path

Merge sibling ASes “belong together” as if they were one AS.


Unannounced infrastructure addresses

A C A C

A C

B A C

B C

Unannounced Infrastructure Addresses

12.0.0.0/8

A

B

C does not announce part of

its address space in BGP(e.g., 12.1.2.0/24)

C

Fix the IP-to-AS map to associate 12.1.2.0/24 with C


Refining initial ip to as mapping

Refining Initial IP-to-AS Mapping

  • Start with initial IP-to-AS mapping

    • Mapping from BGP tables is usually correct

    • Good starting point for computing the mapping

  • Collect many BGP and traceroute paths

    • Signaling and forwarding AS path usually match

    • Good way to identify mistakes in IP-to-AS map

  • Successively refine the IP-to-AS mapping

    • Find add/change/delete that makes big difference

    • Base these “edits” on operational realities

http://www.cs.princeton.edu/~jrex/papers/sigcomm03.pdf

http://www.cs.princeton.edu/~jrex/papers/infocom04.pdf


Research areas

Research Areas

  • Better version of traceroute

    • Router support for active measurement

    • IPPM (IP Performance Measurement)

    • http://www1.ietf.org/mail-archive/web/imrg/current/msg00154.html

  • Peer-to-peer troubleshooting

www.cnn.com

“Yes”

“No”


Passive monitoring

Passive Monitoring


Limitations of active measurements

Limitations of Active Measurements

  • Active measurements: traceroute-like tools

    • Can’t probe in the past

    • Shows the effect, not the cause

Web

Server

(d)

AS 2

AS 4

AS 1

User

(s)

AS 3


Appealing to peek inside

Appealing to Peek Inside

  • Passive measurements: public BGP data

BGP update feeds

Data Correlation

Data Collection

(RouteViews, RIPE)

root cause


Inspect bgp routing changes

Inspect BGP Routing Changes

  • Changes in paths to reach destination d

    • AS 1: “1 3 4”  “1 2 4”

    • AS 2: “2 4” (no change)

    • AS 3: “3 4”  “3 1 2 4”

    • AS 4: “4” (no change)

Web

Server

(d)

AS 2

AS 4

AS 1

User

(s)

AS 3


Idea 1 ases in paths undergoing change

Idea #1: ASes in Paths Undergoing Change

  • Key assumption

    • “The AS responsible for the change appears in the old and/or the new AS path to the destination.”

  • If an AS has a routing change

    • All ASes in old and new paths may be responsible

    • Call these ASes the “suspect set”

  • Combining across vantage points

    • Consider all ASes that had a routing change

    • Perform the intersection across the suspect sets


Idea 2 excluding ases in non changing paths

Idea #2: Excluding ASes in Non-Changing Paths

  • Key assumption

    • “If an AS has no routing change, the ASes in the path are not responsible and can be excluded.”

  • Example

    • AS 1: “1 2 4”  “1 2 3 4”: suspects {1, 2, 3, 4}

    • AS 2: “2 4”  “2 3 4”: suspects {2, 3, 4}

    • AS 3: “3 4” (no change): non-suspects {3, 4}

AS 3

AS 2

AS 1

AS 4


Idea 3 blaming the ases in the better path

Idea #3: Blaming the ASes in the Better Path

  • Key assumption

    • “The better path is the one that contains the AS responsible for the change.”

  • Example

    • “1 2 4”  “1 2 3 4”: better path to worse path, with ASes {1,2,4} as the suspects (not AS 3)

  • Heuristics for identifying the “better” path

    • E.g., the shorter AS path

AS 3

AS 2

AS 1

AS 4


Idea 4 combining across destinations

Idea #4: Combining Across Destinations

  • Key assumption

    • “All destinations experiencing routing changes in a short period of time have a common cause.”

  • Exploiting the observation

    • Form suspect sets for each destination

    • Perform intersections of the sets across the destinations


Difficulties with root cause analysis

Difficulties With Root-Cause Analysis

  • Misleading BGP routing changes

    • Responsible AS not on old or new path

    • Looking across destinations doesn’t resolve

  • Missing routing changes

    • Some routers in an AS don’t have a change

    • Some subnets are not visible in BGP

    • Some internal changes are not visible in BGP


Misleading bgp changes

1

4

5

6

2

3

7

8

9

10

11

Misleading BGP Changes

Myth:The AS responsible for the change appears in the old or the new AS path.

BGP data

collection

old:

1,2,8,9,10

new:

1,4,5,6,7,10


Misleading bgp changes1

12

BGP data

collection

Misleading BGP Changes

Myth:Looking at routing changes across prefixes resolves causes

d2

AS 3

d3

AS 2

AS 1

d1

A

B

7

10

C

Changes for d2,

but not for d1 and d3


Missing routing changes

A

B

D

C

BGP data

collection

No change

Missing Routing Changes

Myth: The BGP updates from a single router accurately represent the AS

dst

AS 2

AS 1

7

6

10

12


Missing routing changes1

Missing Routing Changes

Myth:BGP data from a router accurately represents changes on that router.

12.1.1.0/24

A

BGP data

collection

12.1.0.0/16


Missing routing changes2

A

B

D

C

BGP data

collection

Missing Routing Changes

Myth:Routing changes visible in eBGP have greater impact end-to-end

impact than changes with local scope.

dst

AS 2

AS 1

5

7

6

10

12


Hybrid of active and passive monitoring

(i,s,d,t)

failure link (3,4)

(j,s,d,t’)

failure link (3,4)

Hybrid of Active and Passive Monitoring

Omni 2

Omni 4

Web

Server

(d)

AS 2

AS 4

AS 1

i

User

(s)

AS 3

Omni 1

j

Omni 3


Research questions

Research Questions

  • Understanding if root-cause analysis can work

    • How many vantage points are needed?

    • Do the assumptions usually hold?

    • Can algorithms tolerate occasional violations?

    • Can some additional information help?

  • Distributed algorithms for root-cause analysis

    • Can ASes cooperate in distributed fashion?

    • How to prevent or detect ASes that cheat?

    • Do all ASes have to participate?

    • Other hybrids of active and passive monitoring?


Conclusions

Conclusions

  • Troubleshooting is important

    • Detect, diagnose, and fix problems

    • Accountability and service-level agreements

  • Troubleshooting is hard

    • Active measurement (e.g., traceroute) not enough

    • Root-cause analysis techniques are not enough

  • New innovation necessary

    • Hybrid active/passive approaches

    • Router support for active measurement

    • Routing protocol extensions for troubleshooting


For next time from inside an as

For Next Time: From Inside an AS

  • Two papers

    • “OSPF monitoring: Architecture, design, and deployment experience”

    • “Finding a needle in a haystack: Pinpointing significant BGP routing changes in an IP network”

  • Optional reading

    • Materials from Packet Design and Ipsum Networks

  • Review only of first paper

    • Summary

    • Why accept

    • Why reject

    • Future work


  • Login