Studying black holes on the internet with hubble
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

Studying Black Holes on the Internet with Hubble PowerPoint PPT Presentation


  • 65 Views
  • Uploaded on
  • Presentation posted in: General

Studying Black Holes on the Internet with Hubble. Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas Anderson University of Washington NSDI, April 2008 This work partially supported by. Global Reachability.

Download Presentation

Studying Black Holes on the Internet with Hubble

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Studying black holes on the internet with hubble

Studying Black Holes on the Internet with Hubble

Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas Anderson

University of Washington

NSDI, April 2008

This work partially supported by


Global reachability

Global Reachability

  • When an address is reachable from every other address

  • Most basic goal of Internet, especially BGP

    • “There is only one failure, and it is complete partition” Clarke, Design Philosophy of the DARPA Internet Protocols

  • Physical path  BGP path  traffic reaches

  • Black hole: BGP path, but traffic persistently does not reach


Does internet give global reachability

Does Internet give global reachability?

  • From use, seems to usually work

  • Can we assume the protocols just make it work?

  • “Please try to reach my network 194.9.82.0/24 from your networks…. Kindly anyone assist.” Operator on NANOG mailing list, March 2008.


Does internet give global reachability1

Does Internet give global reachability?


Hubble system goal

Hubble System Goal

In real-time on a global scale, automatically monitor long-lasting reachability problems and classify causes

5


Problem seen by hubble on oct 8 2007

Problem Seen by Hubble on Oct. 8, 2007

Fr:XTo:DPing?

Fr:DTo:XPing!

5:09 a.m.

Fr:ZTo:DPing?

5:11 a.m.

  • Target Identification – distributed ping monitors detect when the destination becomes unreachable

6


Problem seen by hubble on oct 8 20071

Problem Seen by Hubble on Oct. 8, 2007

5:13 a.m.

  • Target Identification – distributed ping monitors

  • Reachability analysis – distributed traceroutes determine the extent of unreachability

7


Problem seen by hubble on oct 8 20072

Problem Seen by Hubble on Oct. 8, 2007

  • Target Identification – distributed ping monitors

  • Reachability analysis – distributed traceroutes

  • Problem Classification

    • group failed traceroutes

8


Problem seen by hubble on oct 8 20073

Problem Seen by Hubble on Oct. 8, 2007

Fr:YTo:DPing?

Fr:YTo:DPing?

Fr:DTo:YPing!

Fr:DTo:YPing!

Fr:XTo:DPing?

D to Y works!

D to Z works!

Y to D fails!

Z to D fails!

  • Target Identification – distributed ping monitors

  • Reachability analysis – distributed traceroutes

  • Problem Classification

    • group failed traceroutes

    • spoofed probes to isolate direction of failure

9


Architecture detect problem

Architecture: Detect Problem

Ping prefix to check if still reachable

Every 2 minutes from PlanetLab

Report target after series of failed pings

Maintain BGP tables from RouteViews feeds

Allows IP  AS mapping

Identify prefixes undergoing BGP changes as targets

10


Architecture assess extent of problem

Architecture: Assess Extent of Problem

Traceroutes to gather topological data

Keep probing while problem persists

Every 15 minutes from 35 PlanetLab sites

Analyze which traceroutes reach

BGP table to map addresses to ASes

Alias information to map interfaces to routers

11


Architecture classify problem

Architecture: Classify Problem

To aid operators in diagnosis and repair:

Which ISP contains problem?

Which routers?

Which destinations?

12


Architecture classify problem1

Architecture: Classify Problem

Real-time, automated classification

Find common entity that explains substantial number of failed traceroutes to a prefix

Does not have to explain all failed traceroutes

Not necessarily pinpointing exact problem

13


Classifying with current topology

Classifying with Current Topology

  • Group failed/successful traceroutes by last AS, router

    Example: Router problem

  • No probes reach P through router R

  • Some reach through R’s AS

  • 28% of classified problems

14


Classifying with historical topology

Classifying with Historical Topology

  • Daily probes from PlanetLab to all prefixes

  • Gives baseline view of paths before problems

    Example: “Next hop” problem

  • Paths previously converged on router R

  • Now terminate just before R

  • 14% of classifiedproblems

15


Classifying with direction isolation

Classifying with Direction Isolation

  • Internet paths can be asymmetric

  • Traceroutes only return routers on forward path

    • Might assume last hop is problem

    • Even so, require working reverse path

    • Hard to determine reverse path

  • Isolate forward from reverse to test individually

  • Without node behind problem, use spoofed probes

    • Spoof from S to check forward path from S

    • Spoof as S to check reverse path back to S


Classifying with direction isolation1

Classifying with Direction Isolation

  • Hubble deployment on RON employs spoofed probes

    • 6 of 13 RON permit source spoofing

    • PlanetLab does not support source spoofing

      Example: Multi-homed provider problem

  • Probes through Provider B fail

  • Some reach through Provider A

  • Like Cox/USC

  • 6% of classified problems


Architecture summary of approach

Architecture: Summary of Approach

Synthesis of multiple information sources

Passive monitoring of route advertisements

Active monitoring from distributed vantage points

Historical monitoring data to enable troubleshooting

Topological classification and spoofing point at problem

18


Evaluation

Evaluation

Target Identification

  • How much of the Internet does Hubble monitor?

    Reachability Analysis

  • What percentage of the various paths to a prefix does Hubble analyze?

    Problem Classification

  • How often can Hubble identify a common entity that explains the failed paths to a prefix?

  • How often does spoofing isolate the failure direction?

    For further evaluation, please see the paper.


How much does hubble monitor

How much does Hubble monitor?

Every 2 minutes:

110,000 prefixes

89% of Internet’s edge address space

92% of edge ASes

Origin ASes for 99% of 14M BitTorrent users

20


What of paths does hubble monitor

What % of paths does Hubble monitor?

Compare withBGP paths of447 RIPE peers(downhill ASes)

AT&T

AT&T

Sprint

Sprint

Tier 1

Abilene

Cenic

Cenic

Gigapop

Gigapop

Transit

Intel

UW

WSU

Intel

UT

UM

MIT

Stub

  • PlanetLab’s restricted size and homogeneity limit uphill

  • 90% of our failed traceroutes terminate within 2 AS hops of prefix’s origin


What of paths does hubble monitor1

What % of paths does Hubble monitor?

Compare withBGP paths of447 RIPE peers(downhill ASes)

AT&T

AT&T

Sprint

Sprint

Tier 1

Abilene

Cenic

Cenic

Gigapop

Gigapop

Transit

Intel

UW

WSU

Intel

UT

UM

MIT

Stub

BGP ASes:{ AT&T, Sprint, Gigapop, Cenic, Intel }

Also on Traceroutes: { Sprint, Gigapop, Cenic, Intel }

Coverage for Intel prefix: 4 of 5 downhill ASes = 80%


What of paths does hubble monitor2

What % of paths does Hubble monitor?

Compare withBGP paths of447 RIPE peers(downhill ASes)

AT&T

AT&T

Sprint

Sprint

Tier 1

Abilene

Cenic

Cenic

Gigapop

Gigapop

Transit

Intel

UW

WSU

Intel

UT

UM

MIT

Stub

Overall for prefixes monitored by Hubble

  • For >60% of prefixes, traverse ALL downhill RIPE ASes

  • For 90% of prefixes, traverse more than half the ASes


How often can hubble classify

How often can Hubble classify?

9 classes currently

Based on topology

Point to an AS and/or router

Results from first week of February 2008

Automatically classified 375,775/457,960 (82%) of problems as they occurred

24


How often does spoofing work

How often does spoofing work?

When a RON path works and another does not:

Isolate 68% of failures from spoofing sources

47% forward, 21% reverse

25


How long do black holes last

How long do black holes last?

3 week study starting September 17, 2007

31,000 black holes involving 10,000 prefixes

20% lasted at least 10 hours!

68% were cases of partial reachability

26


How long do black holes last1

How long do black holes last?

3 week study starting September 17, 2007

31,000 black holes involving 10,000 prefixes

20% lasted at least 10 hours!

68% were cases of partial reachability

  • Partial reachability:

  • Can’t be just hardware failure

  • Configuration/ policy

27


Other measurement results

Other Measurement Results

  • Can’t find problems using only BGP updates

    • Only 38% of problems correlate with RouteViews updates

  • Multi-homing may not give resilience against failure

    • 100s of multi-homed prefixes had provider problems like COX/USC, and ALL occurred on path TO prefix

  • Inconsistencies across an AS

    • For an AS responsible for partial reachability, usually some paths work and some do not

  • Path changes accompany failures

    • 3/4 router problems are with routers NOT on baseline path


Conclusions and future work

Conclusions and Future Work

Hubble: working real-time system

Lots of reachability problems, some long lasting

Baseline/ fine-grained data enable problem classification

Spoofing to isolate direction of path failures

http://hubble.cs.washington.edu

Uses iPlane, MaxMind, Google Maps

29


Thanks

Thanks!

http://hubble.cs.washington.edu

30


Long term prospects for spoofing

Long term prospects for spoofing?

Support for spoofing:

  • No complaints about our spoofed probes

  • Can receive spoofed probes at PlanetLab

  • PlanetLab support in future kernels?

  • Router vendor talking to us about router support for measurements

    Alternatives to spoofing:

  • Traceroute servers behind problems

  • End-hosts behind problems

31


Comparison to planetseer osdi 04

Comparison to PlanetSeer [OSDI ‘04]

  • Most similar system

  • Passively monitors CoDeeN clients, probes on anomalies

  • Different and complimentary analysis

PlanetSeer

  • Clients that connected within 15 minutes

  • 43% edge ASes (sum over 3 months)

  • Not problems that prevent access to CDN

Hubble

  • All prefixes every 2 minutes

  • 92% edge Ases (every 2 minutes)

  • All partial or complete reachability problems

32


Characteristics of problems of interest

Characteristics of Problems of Interest

  • Routable prefix present in BGP tables

  • Persistent through 2 rounds of probes

  • Routing infrastructure failures

    • Not simply end-system/end-network failure

    • Judgments based on connectivity to origin AS

  • Not simply source problem

    • Monitor if less than 90% of vantages reach

    • Based on 4 months of probes to 110K prefixes

33


How well does hubble work

How well does Hubble work?

Scale:

  • 89% of the Internet’s edge address space

  • 92% of edge ASes

  • Origin ASes for 99% of 14M BitTorrent users

    Effectiveness:

  • Finds 85% of black holes, 95% of those that last at least 1 hr [compared to pervasive approach]

    Cost:

  • 5.5% of the probes required by pervasive approach

34


Does spoofing work

Does spoofing work?

When 3+ spoofing RON nodes fail to reach:

Isolate all failed paths in 61% of cases

42% forward, 16% reverse, 3% mixed

For 95% of cases, all paths isolate to same direction

35


Provider s unreachable

Provider(s) Unreachable

  • No probes reach even the provider(s) of Origin AS

  • Probes fail in AS upstream

  • 3% of classified problems (1-13% at any point in time)

36


Single homed origin as down

Single-homed Origin AS Down

  • No probes reach single-homed Origin AS

  • Some reach its provider

  • 17% of classified problems (4-37% at any point in time)

37


Multi homed origin as down

Multi-homed Origin AS Down

  • No probes reach multi-homed Origin AS

  • Some reach its provider(s)

  • 9% of classified problems (2-30% at any point in time)

38


Provider as problem for multi homed

Provider AS Problem for Multi-Homed

  • Probes through Provider B fail to reach P

  • Some reach through Provider A

  • 6% of classified problems (1-17% at any point in time)

39


Non provider as problem

Non-Provider AS Problem

  • Probes through Non-Provider C fail

  • Some reach through other ASes

  • 17% of classified problems (1-37% at any point in time)

40


Router problem on known path

Router Problem on Known Path

  • Last hop router Rwas seen on recent paths reaching P

  • No probes reach Pthrough R

  • Some reach through R’s AS

  • 7% of classified problems (1-40% at any point in time)

Historical

Traceroutes

41


Router problem on new path

Router Problem on New Path

  • Last hop router Rnot seen on recent paths reaching P

  • No probes reach Pthrough R

  • Some reach through R’s AS

  • 21% of classified problems (1-40% at any point in time)

42


Next hop problem on known paths

Next Hop Problem on Known Paths

  • No last hop router or AS explains problem

  • Paths previously converged on router R

  • Now terminate just before R

14% of classified problems (1-39% at any point in time)

43


Topological classification results

Topological classification results

Of ones we classify: Overall (range over time)

  • Provider(s) unreachable: 3% (1-13%)

  • Single-homed origin AS down:17% (4-37%)

  • Multi-homed origin AS down: 9% (2-30%)

  • Provider AS problem for multi-homed origin AS: 6% (1-17%)

  • Non-provider AS problem:17% (1-37%)

  • Router problem on old path: 7% (1-40%)

  • Router problem on new path:21% (1-40%)

  • Next hop problem on known paths: 14% (1-39%)

  • Prefix unreachable:22% (7-79%)

44


  • Login