Enhancing User-Level Path Diagnosis in Networks: TULIP's Approach to Fault Detection and Analysis

User-level Internet Path DiagnosisR. Mahajan, N. Spring, D. Wetherall and T. Anderson

The network is a black box…...so what can I do • We want the users to be able to diagnose their paths • Communicate information to ISP or NOC to improve the network

TULIP: User-level path diagnosis Objectives: Detect performance faults that affect a user’s flows. This involves a measure of the magnitude of the fault (queuing delay, loss) and the localization of the faulty link.

How TULIP does it • Ideal Architecture – Packet based solutions Each router the packet traverses adds a certain number of information to the packet: timestamp, global address of the router’s input interface. Issue: Packet size increases at each hop. A packet loss involves a loss of all the information. Corruption of a packet might yield to incorrect diagnosis data (allthough most corruption are treated as losses)

Because things are never ideal • Basic architecture sufficient for data collection Assets: Fixed packet size and sufficient information… Assuming : stationarity of paths (paths between source and destination don’t change too often)

Diagnosis tools in use in TULIP • Out-of-band measurement probes (or TTL based search) • obtain the Sample TTL and Interface ID • ICMP • Router timestamp • IP identifiers • Approximation of the per-flow counter

How to detect path loss/reordering Sending two probes to determine the behavior of the remote router

Packet queuing An ICMP timestamp is used to determine the queuing delays within a router (median)

The TULIP methods • To perform the measurement, TULIP uses two “scanning” methods. • Binary search (reduces diagnostic traffic but at a cost of diagnosis time) • Parrallel search (interleaves measurements to different routers by cycling through them in nodes)

Network Load and Diagnosis Time • Because of the relative stationary behavior of a router, with an approximative diagnosis time of 10/30 min, TULIP can provide accurate results. • The load for Binary search is B/W and for parrallel LB/W (lower bound) L: # of measurable routers B: Bandwitdth cost of the probes W: Wait time (usually 1s)

1 1’ 2’ 2 1 2 3 Rank(G)=2 Diagnosing granularity • The granularity is the weighted average of the lengths of its diagnosable segments.

Various granularity for different measurements • 50 % of the paths have a granularity less than 3 hops (75% <4) • TULIP matches ideal tomography implementation

Validation • Compared results with Planet Lab coupled with a tomography system • Use a measure “rate delta” that computes the difference between the rate at the far end minus that at the near end of a segment. Negative values implies a lack of consistency (values spawn a range too large)

Reordering Results 85 % of the results are consistent for forward path 75 % for round trip (due to the asymmetric nature of some paths)

Loss results • 85% again of non negative deltas • Round trip counterpart less affected by asymmetry than the Reordering diagnosis (because loss usually occurs close to the destination)

Queuing Results • ICMP message generation has a poor timestamp resolution (the two median within 2ms of each other – One from TCPDump on planet lab and one from TULIP). • Forward path shows that queuing delay is consistent (very few negative values) • Round trip reflects the variability in the return path

The last mile… • First hops from user is the bottleneck

Persistance of a fault • We check for how many iterations, TULIP yields similar results • 80% of the path show faults persisting long enough for TULIP to diagnose them (typical time a binary search takes to locate a fault : 6 runs)

Conclusions • Network Operators would be able to diagnose links efficiently • And a user too … if the world was populated entirely by Computer nerds.

Issues… • Multiple TULIP users could reduce the accuracy of the probing method, the per flow counter • An application doesn’t experience the network the same way an active measurement does. (TCP, application dependant as well as flags)

…and possible improvements • Per flow counter at the router level (unrealistic) • Hash source address and IPID (for flow) • ICMP timestamp have reception time as well as transmission time (allows the calculation of the delay the packet is processed at the router)

Enhancing User-Level Path Diagnosis in Networks: TULIP's Approach to Fault Detection and Analysis