1 / 33

Real-time End-to-end Network Monitoring in Large Distributed Systems

Real-time End-to-end Network Monitoring in Large Distributed Systems. Han Hee Song University of Texas at Austin Joint work with Praveen Yalagandula Hewlett-Packard Labs. Outline. Introduction S3 – Scalable Sensing Service Concurrent n2 measurements Serialized measurements

tommy
Download Presentation

Real-time End-to-end Network Monitoring in Large Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-time End-to-end Network Monitoring in Large Distributed Systems Han Hee Song University of Texas at Austin Joint work with Praveen Yalagandula Hewlett-Packard Labs

  2. Outline • Introduction • S3 – Scalable Sensing Service • Concurrent n2 measurements • Serialized measurements • Network inference • Proposed solution • Resource adaptive network monitoring • Evaluation • Path distribution • Inference accuracy • Adaptive path count management • Summary

  3. S3– Scalable Sensing Service • Goal • Securely measure real-time e2e network properties • In a network with 10,000s of end hosts • Applications • Content distribution systems • Media streaming systems • Traffic engineering • Overlay routing

  4. Network usage of different measurement tools CPU usage of PathChirp Memory usage of PathChirp Challenges of Concurrent measurements • Problems of concurrent n2 measurements • Resource constraints on CPU, memory, network BW

  5. Challenges of Concurrent measurements • Problems of concurrent n2 measurements • Resource constraints on CPU, memory, network BW • Interference on node and network Response time of LossDelay msmt tool Latency error of LossDelay tool

  6. Challenges - Serialized measurements • Problems of serialized measurements • Single cycle of measurement takes too long to be real-time CDF of times taken for a single cycle of measurements on a 500 node PlanetLab topology

  7. b1 b2 x1 b3 x2 x4 Routing matrix A é ù 1 1 0 0 x3 ê ú b4 1 0 1 0 ê ú = ê ú 0 1 0 1 ê ú 0 1 1 0 ë û Rank(A)=4 b5 Network Inference • Monitor path performance of a subset of paths, and reconstruct the performance of all other paths. • Example • Measurement of additive metrics, e.g. delay, log(1 - loss rate) • Bandwidth measurement End hosts é ù é ù x b 1 1 ê ú ê ú x b ê ú ê ú 2 2 A = ú ê ú x ê b ú 3 3 ê ê ú ê ú x ë ë û û b 4 4

  8. Network Inference • Network inference goal • Optimize the number of monitored paths, without considering available resources at the end hosts • Our goal • Builda system that leverages inference techniques that adapts to the resource constraints

  9. Outline • Introduction • S3 – Scalable Sensing Service • Concurrent n2 measurements • Serialized measurements • Network inference • Proposed solution • Resource adaptive network monitoring • Evaluation • Path distribution • Inference accuracy • Adaptive path count management • Summary

  10. Resource adaptive network monitoring • Background: NetQuest • Design of experiment • Using Bayesian experimental design, select a subset of paths to measure that maximize the expected amount of information gain. • Network inference • Using L1-norm minimization, reconstruct the performances of all other paths from the partial indirect observation. • We extend NetQuest by following ways • Characterize resource requirement • CPU usage of LossDelay measurement tool • Continuously monitor available resources • Monitor CPU usage of other on-going processes • Path selection • Modify the Design of experiment stage by select paths w.r.t. the available resources • Measurement • Measure selected e2e path properties using S3 system • Inference • Leverage NetQuest’s L1-norm minimization

  11. Resource requirement characterization • Assume CPU usage grows linearly with the number of measurement tools • From each node, characterize the amount of CPU used by an instance of LossDelay measurement tool • Test run the tool several times • Obtain average CPU usage by UNIXtime command: (user+sys)/real time

  12. Monitoring resources • On each node, continuously monitorfraction of CPU used by other processes • CoTop tool outputs CPU usage across all slices • Determine max number of LossDelay measurements w.r.t. remaining free CPU and CPU requirement of LossDelay

  13. Path selection • Greedy search algorithm selectinga set of paths to measure 1. Start with an empty bag of paths 2. Among paths outside of the bag, choose and add a path p s.t. (1) Adding p does not violate constraint (2) Maximize the accuracy 3. Repeat step 2 until the

  14. Measurement • Configure S3 system to perform measurements • To only selected destinations • Without incurring on-host interference and on-network interference.

  15. Inference • Network inference using NetQuest • L1 norm minimization approximately reconstructs all path performance • Based on measured path data and topology information

  16. Evaluation • Algorithms compared • Resource oblivious algorithm • Choose set of paths equal to the rank of the routing matrix. • Measurements that exceeds a node’s resource constraint are taken out later. • Resource aware algorithm • Schedule paths w.r.t. each node’s resource constraints • Note: the total number of paths measured differs between resource-oblivious and resource-aware algorithms

  17. Evaluation • Evaluation setting • 100 end hosts • Measure loss rate and delay of paths using LossDelay tool • Constrain nodes to use 0.1%, 0.5%, 1%, or 2% of remaining CPU • Simulation & real PlanetLab deployment • Compare path distribution • Measure accuracy of inference • Measure CPU adaptability

  18. Evaluation –Path distribution • Path distribution for resource constraint of 0.5% available CPU • Less number of paths scheduled on loaded nodes.

  19. Evaluation –Inference accuracy comparison • Mean Absolute Error (MAE) of inferred path performances • Inference accuracy loss smaller even with stringent constraints

  20. Evaluation –Adaptive path count management • Adaptive path count management graph • Adaptive management reacts to the changes in the CPU load

  21. Summary • Conclusion • Real-time end-to-end monitoring system • Monitoring loss and delay metrics using small fraction of free resources • Future work • Decentralize path selection algorithm • Based on resource constraints • Inference algorithm • Decentralize inference load • Leverage other algorithms: GNP, NetVigator • Available Bandwidth measurement

  22. Thank you

  23. Backup slides

  24. Network Monitoring • Goal • To measure real-time e2e network properties • In a network with 10,000s of end hosts • Applications • Content distribution systems • Media streaming systems • Traffic engineering • Overlay routing

  25. Challenges - Simultaneous measurements • Resource usage of simultaneous measurements • Loss delay sensor • CPU, memory, bandwidth usage plots • PathChirp • CPU, memory, bandwidth usage plots • Pathrate • CPU, memory, bandwidth usage plots

  26. Challenges of Simultaneous measurements • Interferences of simultaneous measurements • Loss delay sensor • response time, latency error plots • PathChirp • Maximum response time, measurement failure frequency plots • Pathrate • Measurement error plot

  27. Resource adaptive network monitoring • Network monitoring system that adapts the number of active measurements according to the machine and network load • Key tasks overview • Resource requirement characterization • CPU usage of LossDelay measurement tool • Monitoring resources • Monitor CPU usage of other on-going processes • Path selection • Select paths w.r.t. the load on the node and network • Measurement • Measure path properties of only selected e2e paths • Inference • Leverage NetQuest’s L1-norm minimization.

  28. Inference Server End hosts measurements

  29. Path selection algorithm contd. • Original NetQuest path selection algorithm

  30. Path selection algorithm contd. • Path selection algorithm with constraints

  31. Topology information gathering • Internet topology stable for at least a day* • Using S3 deployment on PlanetLab, perform round-robin Traceroute among all end nodes • Once topology built, detect changes by checking remaining TTL from ICMP response * Zhang, Paxson, Shenker. The stationarity of Internet path properties. ACIRI Technical Report, May, 2000

More Related