1 / 22

Correlations in E2E Network Metrics: Impact on Large Scale Network Monitoring

This paper investigates the correlations between different E2E network metrics and explores the possibility of leveraging these dependencies to lower monitoring costs while maintaining high accuracies. The study analyzes changes in hop, latency, route, and capacity to quantify the correlation and performs a cost vs. accuracy tradeoff analysis. The findings provide insights into optimizing network monitoring strategies.

ivak
Download Presentation

Correlations in E2E Network Metrics: Impact on Large Scale Network Monitoring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlations in E2E Network Metrics: Impact on Large Scale Network Monitoring Praveen Yalagandula Sung-Ju Lee Puneet Sharma Sujata Banerjee HP Labs, Palo Alto http://networking.hpl.hp.com

  2. Motivation • Large scale E2E network monitoring • Application management, Flow control, Fault Diagnosis, etc. • A key question: What granularity should we measure? • Coarse-grained: lower cost but higher inaccuracy • Fine-grained: lower inaccuracy but higher cost • Observation: Heterogeneity in measurement costs • PING < TRACEROUTE < PATHRATE • Our investigation • Are different E2E network metrics correlated? • Can we leverage such dependencies (if any) to • Lower monitoring cost while maintaining high accuracies?

  3. Our Approach • We consider two correlations in the current work • Changes in Hop and Latency  Changes in Route • Changes in Route  Changes in Capacity • We use data from S3 deployment on Planet-Lab • ~2years of data • E2E measurements: Traceroute and Pathrate (capacity) • On thousands of paths • Perform Cost vs. Accuracy analysis for two cases • Base: Only higher cost measurements are performed • Strategy: • Perform lower cost measurements • If change detected, perform higher cost measurements

  4. State-of-the-art • Correlations assumed by previous systems • GNP, Vivaldi, and other co-ordinate based systems • Correlation in latencies across paths • NetQuest • Correlation between hop changes and route changes • CoDeen • Correlation between route changes and capacity • Our work • Quantify the correlation • Perform accuracy vs cost tradeoff analysis

  5. Outline • Motivation: Quantify & leverage metric correlations • S3: Scalable Sensing Service • Deployment on PlanetLab • Correlations: • Changes in Hop and Latency  Changes in Route • Changes in Route  Changes in Capacity • Cost-Accuracy Tradeoff Analysis • Summary and Future work

  6. S3: Architecture • Sensor pods • Collection of sensors • Measure system state from a node’s view • Backplane • Programmable fabric • Connects pods and aggregates measured system state • Inference Engines • Infers O(n2) E2E paths info by measuring few paths • Schedules measurements on pods • Aggregates data on backplane • Applications

  7. Sensor Pod Configuration& Data SNMP Agent Repository Load Memory Secure Web Interface Capacity API: query, control, and notification Lossrate Controller Bandwidth Latency

  8. S3 Deployment on Planet-Lab • Running since January 2006 • All pair network metrics • Latency: Inferred by Netvigator • Lossrate: Measured using Tulip lossrate tool • Available Bandwidth: Measured using Spruce and PathChirp • Capacity: Measured using Pathrate • Stats:~14GB raw data every day, ~1GB compressed

  9. Two correlations quantified • Changes in hop and latency  changes in route (HLR)? • PING can be used to measure both hops and latency • Original TTL - Remaining TTL value = Num of hops • Change in number of hops will always means change in the route • But does change in the route  change in the number of hops? • Obviously NO; but how often & how it affects monitoring accuracy? • Changes in route  changes in capacity (RC)? • Capacity can change when route is not changed • CAP Limits • Especially in PlanetLab • Becoming common in other networks: e.g., Cable networks • Same route, but link upgraded or link-level change not visible in IP route • Question: • How often does this happen and how it affects monitoring accuracy?

  10. S3 Dataset • HL  R • Use Traceroute measurements • Performed at each node to 20 landmark nodes • Landmark nodes (20) chosen across the globe • Performed once every 30 minutes • R  C • Use Traceroute and Pathrate measurements • Each node performs Pathrate to all other nodes • In a round-robin fashion • Takes about a day (avg.) to complete a round of measurements • We use Pathrate measurements iff (0 < COV < 1)

  11. Defining metric changes • Route changes (R) • R=1: If current route does not match previous sample • Else R=0 • Some times routers do not respond: ‘*’ in output • We ignore those hops during above route change detection • Latency changes (L) • L=1: If current latency is p% or more different than the previous sample • Else L=0 • We use p=5% for this analysis • Hop changes (H) • H=1: If current number of hops does not match with the previous • H=0: otherwise

  12. Measurements where route changed but hops did not change  If we use changes in hops to detect route changes, we will miss these Case counts • Averaged across all paths • H: Change in hops; L: Change in Latency; R: Change in route

  13. Case counts Measurements where route changed but neither hops nor latency changed  If we use changes in hops and/or latency to detect route changes, we will miss these • Averaged across all paths • H: Change in hops; L: Change in Latency; R: Change in route

  14. Case counts Overall, these two numbers are small  changes in hop and latency can be a good indicator of changes in route • Averaged across all paths • H: Change in hops; L: Change in Latency; R: Change in route

  15. Cost-Accuracy Tradeoff • What if we perform only PING and then perform Traceroute only when a hop or latency change is observed? • Reduces cost: PING is relatively inexpensive • Increases inaccuracy: Might miss some some route changes • Base method: Traceroutes every T seconds • Strategy: • Perform Traceroutes every s.T seconds • We refer to s as the sampling factor • Perform PING every t seconds when a Traceroute is not performed • Further, perform a Traceroute if change in hop/latency is observed

  16. 0.25 Plain Case: Cost decreases with reduced sampling 0.08 Plain Case: Inaccuracy increases with reduced sampling. If wrong frequency is chosen, we can have very high inaccuracy! 0.33 Hop & Hop-Lat Strategies: Bounded inaccuracies even when any traceroutes are performed only when changes are detected with Pings 0.12 Cost-Accuracy Tradeoff

  17. Defining capacity changes for a path • Pathrate gives an estimate of capacity (with some error) • Link-Mapping based change detection • Mapped result from Pathrate measurement to one of the several known link types • C=1: If current link type is different from the previous link type • Percent-Change • C=1: If current value is p% or more different from the previous value • We use p=10% for our analysis

  18. Case counts • Averaged across all paths • C: Change in Capacity; R: Change in Route R & C take same value in only 63% and 58% cases  Modest positive correlation

  19. Cost-Accuracy Tradeoff • Link-Mapping

  20. Cost-Accuracy Tradeoff • Percent-Change

  21. Conclusions and Next Steps • Methodology for correlation quantification • Case counting • Cost-Accuracy tradeoff analysis • Hop & Latency changes  Route changes • Route changes  Capacity changes • Promising results in both cases • Low cost measurements can be used to trigger high cost measurements • Further steps • Other correlations: Capacity and Available Bandwidth correlation • Application level inaccuracy aka impact on E2E apps

  22. Ongoing work http://networking.hpl.hp.com/s-cube Email: s-cube@hpl.hp.com

More Related