1 / 24

Challenges in Evaluating Distributed Algorithms

Challenges in Evaluating Distributed Algorithms. Idit Keidar The Technion & MIT FuDiCo, Bertinoro, June 2002. Distributed Algorithm Evaluation. Aspects: Performance, reliability, availability, … Techniques: Theory: define models, metrics Simulations: define model, metrics

lester
Download Presentation

Challenges in Evaluating Distributed Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Challenges in Evaluating Distributed Algorithms Idit Keidar The Technion & MIT FuDiCo, Bertinoro, June 2002

  2. Distributed Algorithm Evaluation • Aspects: Performance, reliability, availability, … • Techniques: • Theory: define models, metrics • Simulations: define model, metrics • Experimental: choose environment, patterns, metrics, …

  3. Significance of Metrics/Models • Conclusions depend on choice of model/metric • Algorithms designed to optimize specific metrics • Metrics should be simple to work with • Must make simplifying assumptions

  4. Examples of Simplifying Assumptions • Time complexity: • All messages take equal time • Reliability: • Bounded potential number of failures (t of n), independent of system life span • IID models: • Independent failures (e.g., message loss) • All failures equally likely

  5. Example: Time Complexity • Usual metric: # synchronous rounds • Or asynchronous “steps” • Underlying assumption: • All rounds / steps cost the same

  6. Performance Evaluation of a Communication Round over the Internet Omar Bakr & Idit Keidar To appear in PODC 02

  7. Communication Round • Part of many distributed algorithms, systems • consensus, atomic commit, replication, group membership ... • Hosts send data to each other • potentially to all connected hosts • Common metric for evaluating algorithms: number of rounds

  8. Questions • What is the best way to implement a communication round over the Internet • decentralized vs. centralized • How long is a communication round over the Internet? • Are all communication rounds the same?

  9. Prediction is Hard • Internet is unpredictable, diverse, … • Different answers for different topologies, different times • Different performance metrics • local running time at a host • overall running time from first start to last finish

  10. All-to-all Leader “Communication Round” Primitive • Initiated by some host • Propagates data from every host to every other host connected to it • Example implementations:

  11. Experiment I • 10 hosts: Taiwan, Korea, US academia, ISPs • TCP/IP • Algorithms: • All-to-all • Leader (initiator) • Secondary leader (not initiator) periodically initiated at each host - 650 times over 3.5 days

  12. Overall Running Time • Elapsed time from initiation (at initiator) until all hosts terminate • Requires estimating clock differences • Clocks not synchronized, drift • We compute difference over short intervals • Compute 3 different ways • Achieve accuracy within 20 ms. on 90% of runs

  13. All-to-all: 2 Leader: 3 Secondary Leader: 4 Teaser: Counting Rounds

  14. 150+240 = 390 150+150+150 = 450 Counting Overall Running TimeFrom MIT • Ping-measured latencies (IP): • Longest link latency 240 ms • Longest link to MIT 150 ms

  15. All-to-all overall local Leader overall local Sec. Leader overall local Average (runs under 2 sec) 811 295 541 335 585 408 % runs over 2 seconds 55% 3% 13% 6% 9% 3% Running times in milliseconds Measured Running Times Runs Initiated at MIT

  16. What’s Going On? • Very high loss rates on two links • 42% and 37% • Taiwan to two US ISPs • Loss rates on other links up to 8% • Upon loss, TCP’s timeout is big • More than round-trip-time • All-to-all sends messages on lossy links • Often delayed by loss

  17. Distribution of Running Times Up to 1.3 sec. at MIT

  18. All-to-all overall local Leader overall local Sec. Leader overall local Average (runs over 2 sec) 866 645 1120 844 679 607 % runs over 2 seconds 54% 24% 64% 43% 13% 7% Measured Running Times Runs Initiated at Taiwan Running times in milliseconds

  19. Distribution of Running Timesin Taiwan

  20. Experiment II: Removing Taiwan • Overall running times much better • For every initiator and algorithm, less than 10% over 2 seconds • All-to-all overall still worse than others! • either Leader or Secondary Leader best, depending on initiator • loss rates of 2% - 8% are not negligible • all-to-all sends O(n2) messages; suffers • But, all-to-all has best local running times

  21. Probability of Delay due to Loss • If all links would have same latency • assume 1% loss on all links; 10 hosts (n=10) • Leader sends 3(n-1) = 27 messages • probability of at least one loss: 1 -.9927 »24% • All-2-all sends n(n-1) = 90 messages • probability of at least one loss: 1 -.9990 »60% • In reality, links don’t have same latency • only loss on long links matters

  22. Experiment Conclusions • Message loss causes high variation in TCP link latencies • latency distribution has high variance, heavy tail • Latency distribution determines expected cost of sending O(n) concurrent messages • Constant distribution -> constant • Exponential distribution -> Log(n) [RS94] • Worse for heavier tails • Secondary leader helps • No triangle inequality • Different for overall vs. local running times

  23. And Now, Back to Theory • Number of rounds not sufficient metric • E.g., One-to-all and all-to-all have different costs • Refine metric: • Say which “kind” of rounds • Lower bound for asynchronous consensus – 2 rounds, but of which kind? • Paxos does it in: • 1 transmission from one to quorum + • 1 transmission from quorum to all

  24. DalgevalDistributed algorithm evaluation • Goal: Develop realistic ways to evaluate distributed algorithms • Gather data on failures, performance in various environments • Use data to create models for prediction • theoretical analysis and simulations • Experiment to validate prediction • Design algorithms to optimize for correct metrics

More Related