Download
exploring tradeoffs in failure detection in p2p networks n.
Skip this Video
Loading SlideShow in 5 Seconds..
Exploring Tradeoffs in Failure Detection in P2P Networks PowerPoint Presentation
Download Presentation
Exploring Tradeoffs in Failure Detection in P2P Networks

Exploring Tradeoffs in Failure Detection in P2P Networks

0 Views Download Presentation
Download Presentation

Exploring Tradeoffs in Failure Detection in P2P Networks

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Exploring Tradeoffs in Failure Detection in P2P Networks Shelley Zhuang, Ion Stoica, Randy Katz HIIT Short Course August 18-20, 2003

  2. Problem Statement • One of the key challenges to achieve robustness in overlay networks: quickly detect a node failure • Canonical solution: each node periodically pings its neighbors • Propose keep-alive techniques • Study the fundamental limitations and tradeoffs between detection time, control overhead, and probability of false positives

  3. Outline • Motivation • Network Model and Assumptions • Keep-alive Techniques • Performance Evaluation • Conclusion

  4. Network Model and Assumptions • P2P system with n nodes • Each node A knows d other nodes • Average path length = l • Node up-time ~ i.i.d. T = exponential(λf) • Failstop failures • If a neighbor is lost, a node can use another neighbor to route the packet w/o affecting the path length

  5. Packet Loss Probability • δ = average time it takes a node to detect that a neighbor has failed • Probability that a node forwards a packet to a neighbor that has failed is 1- e-λf δδλf P(T-t  δ | Tt) = P(T<=δ) • Probability that the packet is lost is pl lδλf pdf T δ

  6. Outline • Motivation • Network Model and Assumptions • Keep-alive Techniques • Performance Evaluation • Conclusion

  7. Aliveness Techniques • Baseline • Each node sends a ping message to each of its neighbors every Δ seconds B C A D

  8. Aliveness Techniques • Information Sharing • Piggyback failures of neighbors in acknowledgement messages • Best case: completely connected graph of degree d B C A D

  9. Aliveness Techniques • Boosting • When a node detects failure of a neighbor, D, it announces to all other nodes that have D as their neighbor • Best case: completely connected graph of degree d B C A D

  10. Outline • Motivation • Network Model and Assumptions • Keep-alive Techniques • Performance Evaluation • Conclusion

  11. Performance Evaluation • Case studies • d-regular network • Chord lookup protocol • Chord event driven simulator • Gnutella join/leave trace • Packet loss rate • Control overhead • Planetlab experiments • Planetlab event driven simulator • False positives

  12. Loss Rate – Gnutella • Loss Rate = # Lookup timeouts / # Lookups • 20 lookups per second Boosting (simple) - No additional state

  13. Loss Rate – Gnutella • Tto seconds before deciding that a probe is lost • Multiple losses before deciding that a neighbor has failed

  14. Overhead (count) – Gnutella • Constant probing overhead (1 probe/second) • Small difference due to boost messages

  15. Overhead (bps) – Gnutella • Boosting w/ bptr 1.29 times the baseline

  16. Overhead (bps) – Gnutella • Send backpointers every 10 probe acks

  17. False Positive – Planetlab • Propagation of positive information • Most false positives are of TO = 0, 1 increase probe timeout threshold

  18. Overhead (bps) – Planetlab • Overhead from boost messages and positive information correlate with the loss rate

  19. Outline • Motivation • Network Model and Assumptions • Keep-alive Techniques • Performance Evaluation • Conclusion

  20. Conclusion • Examined three keep-alive techniques in Chord with Gnutella join/leave trace • By carefully designing keep-alive algorithms, it is possible to significantly reduce packet loss probability • Probability of false positive for boosting with backpointer < 0.01 for loss rate ~ 8.6% by propagating positive information and increasing probe timeout threshold

  21. Future Work • Evaluate keep-alives schemes under massive failures and churn • Optimal control resource allocation strategy for a given network topology, failure rate, and load distribution • Other applications of keep-alive techniques?