Exploring Tradeoffs in Failure Detection in P2P Networks

# Exploring Tradeoffs in Failure Detection in P2P Networks

## Exploring Tradeoffs in Failure Detection in P2P Networks

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Exploring Tradeoffs in Failure Detection in P2P Networks Shelley Zhuang, Ion Stoica, Randy Katz HIIT Short Course August 18-20, 2003

2. Problem Statement • One of the key challenges to achieve robustness in overlay networks: quickly detect a node failure • Canonical solution: each node periodically pings its neighbors • Propose keep-alive techniques • Study the fundamental limitations and tradeoffs between detection time, control overhead, and probability of false positives

3. Outline • Motivation • Network Model and Assumptions • Keep-alive Techniques • Performance Evaluation • Conclusion

4. Network Model and Assumptions • P2P system with n nodes • Each node A knows d other nodes • Average path length = l • Node up-time ~ i.i.d. T = exponential(λf) • Failstop failures • If a neighbor is lost, a node can use another neighbor to route the packet w/o affecting the path length

5. Packet Loss Probability • δ = average time it takes a node to detect that a neighbor has failed • Probability that a node forwards a packet to a neighbor that has failed is 1- e-λf δδλf P(T-t  δ | Tt) = P(T<=δ) • Probability that the packet is lost is pl lδλf pdf T δ

6. Outline • Motivation • Network Model and Assumptions • Keep-alive Techniques • Performance Evaluation • Conclusion

7. Aliveness Techniques • Baseline • Each node sends a ping message to each of its neighbors every Δ seconds B C A D

8. Aliveness Techniques • Information Sharing • Piggyback failures of neighbors in acknowledgement messages • Best case: completely connected graph of degree d B C A D

9. Aliveness Techniques • Boosting • When a node detects failure of a neighbor, D, it announces to all other nodes that have D as their neighbor • Best case: completely connected graph of degree d B C A D

10. Outline • Motivation • Network Model and Assumptions • Keep-alive Techniques • Performance Evaluation • Conclusion

11. Performance Evaluation • Case studies • d-regular network • Chord lookup protocol • Chord event driven simulator • Gnutella join/leave trace • Packet loss rate • Control overhead • Planetlab experiments • Planetlab event driven simulator • False positives

12. Loss Rate – Gnutella • Loss Rate = # Lookup timeouts / # Lookups • 20 lookups per second Boosting (simple) - No additional state

13. Loss Rate – Gnutella • Tto seconds before deciding that a probe is lost • Multiple losses before deciding that a neighbor has failed

14. Overhead (count) – Gnutella • Constant probing overhead (1 probe/second) • Small difference due to boost messages

15. Overhead (bps) – Gnutella • Boosting w/ bptr 1.29 times the baseline

16. Overhead (bps) – Gnutella • Send backpointers every 10 probe acks

17. False Positive – Planetlab • Propagation of positive information • Most false positives are of TO = 0, 1 increase probe timeout threshold

18. Overhead (bps) – Planetlab • Overhead from boost messages and positive information correlate with the loss rate

19. Outline • Motivation • Network Model and Assumptions • Keep-alive Techniques • Performance Evaluation • Conclusion

20. Conclusion • Examined three keep-alive techniques in Chord with Gnutella join/leave trace • By carefully designing keep-alive algorithms, it is possible to significantly reduce packet loss probability • Probability of false positive for boosting with backpointer < 0.01 for loss rate ~ 8.6% by propagating positive information and increasing probe timeout threshold

21. Future Work • Evaluate keep-alives schemes under massive failures and churn • Optimal control resource allocation strategy for a given network topology, failure rate, and load distribution • Other applications of keep-alive techniques?