1 / 26

An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services. Jingyu Zhou * § , Lingkun Chu*, Tao Yang* § * Ask Jeeves § University of California at Santa Barbara. Outline. Background & motivation Membership protocol design Implementation Evaluation

Download Presentation

An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services Jingyu Zhou *§, Lingkun Chu*, Tao Yang*§ * Ask Jeeves §University of California at Santa Barbara

  2. Outline • Background & motivation • Membership protocol design • Implementation • Evaluation • Related work • Conclusion

  3. Background • Large-scale 24x7 Internet services • Thousands of machines connected by many level-2 and level-3 switches (e.g. 10,000 at Ask Jeeves) • Multi-tiered architecture with data partitioning and replication • Some of machines are unavailable frequently due to failures, operational errors, and scheduled service update.

  4. Network Topology in Service Clusters • Multiple hosting centers across Internet • In a hosting center • Thousands of nodes • Many level-2 and level-3 switches • Complex switch topology

  5. Motivation • Membership protocol • Yellow page directory – discovery of services and their attributes • Server aliveness – quick fault detection • Challenges • Efficiency • Scalability • Fast detection

  6. Fast Failure Detection is crucial • Online auction service even with replication • Failure of one replica 7s - 12s • Service unavailable 10s - 13s

  7. Communication Cost for Fast Detection • Communication requirement • Propagate to all nodes • Fast detection needs higher packet rate • High bandwidth • Higher hardware cost • More chances of failures.

  8. Design Requirements of Membership Protocol for Large-scale Clusters • Efficient: bandwidth, # of packets • Topology-adaptive: localize traffic within switches • Scalable: scale to tens of thousands of nodes • Fast failure detection and information propagation.

  9. Approaches • Centralized • Easy to implement • Single point of failure, not scalable, extra delay • Distributed • All-to-all broadcast [Shen’01]: doesn’t scale well • Gossip [Renesse’98]: probabilistic guarantee • Ring: slow to handle multi-failures • Don’t consider network topology

  10. TAMP: Topology-Adaptive Membership Protocol • Topology-awareness • Form a hierarchical tree according to network topology • Topology-adaptiveness • Network changes: add/remove/move switches • Service changes: add/remove/move nodes • Exploit TTL field in IP packet

  11. Hierarchical Tree Formation Algorithm • Form small multicast groups with low TTL values; • Each multicast group performs elections; • Group leaders form higher level groups with larger TTL values; • Stop when max. TTL value is reached; otherwise, goto Step 2.

  12. An Example • 3 Level-3 switches with 9 nodes

  13. Node Joining Procedure • Purpose • Find/elect a leader • Exchange membership information • Process • Join a channel and listen; • If a leader exists, stop and bootstrap with the leader; • Otherwise, elects a leader (bully algorithm); • If is leader, increase channel ID & TTL, goto 1.

  14. Properties of TAMP • Upward propagation guarantee • A node is always aware of its leader • Messages can always be propagated to nodes in the higher levels • Downward propagation guarantee • A node at levelimust leaders of level i-1, i-2, …, 0 • Messages can always be propagated to lower level nodes • Eventual convergence • View of every node converges

  15. Update protocol when cluster structure changes • Heartbeat for failure detection • Leader receive an update - multicast up & down

  16. Fault Tolerance Techniques • Leader failure: backup leader or election • Network partition failure • Timeout all nodes managed by a failed leader • Hierarchical timeout: longer timeout for higher levels • Packet loss • Leaders exchanges deltas since last update • Piggyback last three changes

  17. Scalability Analysis • Protocols: all-to-all, gossip, and TAMP • Basic performance factors • Failure detection time (Tfail_detect) • View convergence time (Tconverge) • Communication cost in terms of bandwidth (B)

  18. Scalability Analysis (Cont.) • Two metrics • BDP = B * Tfail_detect , lower failure detection time with low bandwidth is desired • BCP = B * Tconverge , lower convergence time with low bandwidth is desired n: total # of nodes k: each group size, a constant

  19. Implementation • Inside Neptune middleware [Shen’01] – programming and runtime support for building cluster-based Internet services • Can be easily coupled into others clustering frameworks

  20. Evaluation: Objectives & Settings • Metrics • Bandwidth • failure detection time • View convergence time • Hardware settings • 100 dual PIII 1.4GHz nodes • 2 switches connected by a Gigabit switch • Protocol related settings • Frequency: 1 packet/s • A node is deemed dead after 5 consecutive loss • Gossip mistake probability 0.1% • # of nodes: 20 – 100 in step of 20

  21. Bandwidth Consumption • All-to-All & Gossip: quadratic increase • TAMP: close to linear

  22. Failure Detection Time • Gossip: log(N) increase • All-to-All & TAMP: constant

  23. View Convergence Time • Gossip: log(N) increase • All-to-All & TAMP: constant

  24. Related Work • Membership & failure detection • [Chandra’96], [Fetzer’99], [Fetzer’01], [Neiger’96], and [Stok’94] • Gossip-style protocols • SCAMP, [Kempe’01], and [Renesse’98] • High-availability system (e.g., HA-Linux, Linux Heartbeat) • Cluster-based network services • TACC, Porcupine, Neptune, Ninja • Resource monitoring: Ganglia, NWS, MDS2

  25. Contributions & Conclusions • TAMP is a highly efficient and scalable protocol for giant clusters • Exploiting TTL count in IP packet for topology-adaptive design. • Verified through property analysis and experimentation. • Deployed at Ask Jeeves clusters with thousands of machines.

  26. Questions?

More Related