Reliable and Highly Available Distributed Publish/Subscribe Systems

Reliable and Highly Available Distributed Publish/Subscribe Systems Reza Sherafat Hans-Arno Jacobsen University of Toronto September 2009 Symposium on Reliable and Distributed Systems

Distributed Publish/Subscribe Systems P P Publish P P Pub Pub • Many-to-many communication • High-level operations: “subscribe” and “publish“ • Decoupling between sources and sinks • Flexible content-based messaging Pub/Sub S S S S S P S Subscribe Subscribe S Sub Sub SRDS'09

Agenda • Existing approaches • δ-Fault-tolerance • Architecture • Reliable publication delivery protocol • Experimental results SRDS'09

Store-and-Forward • A copy is first preserved on disk and then forwarded • Intermediate hops send an ACK to previous hop after preserving • ACKed copies can be dismissed from disk • Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery • This ensures reliable delivery but may cause delays while the machine is down P P P P Tohere Fromhere ack ack ack SRDS'09

Mesh-Based Overlay Networks [Snoeren, et al., SOSP 2001] • Use a mesh network to concurrently forward messages on disjoint paths • Upon failures, the message is delivered using alternative routes • Pros: Minimal impact on delivery delay • Cons: Imposes additional traffic & possibility of duplicate delivery Tohere Fromhere P P P P SRDS'09

Replica-based Approach[Bholaet al., DSN 2002] • Replicas are grouped into virtual nodes • Replicas have identical routing information • We compare against this approach in evaluation section Virtual node SRDS'09

Next • Existing approaches • δ-Fault-Tolerance • Architecture • Reliable publication delivery protocol • Experimental results SRDS'09

δ-Fault-Tolerance • In distributed messaging system • Failed brokers may be down for a long time • There often are concurrent failures • Reliable message delivery is essential • Configuration parameter δ • A δ-fault-tolerant P/S system ensures reliable delivery when there are up to δ concurrent crash failures • Reliability: • Exactly-once delivery of publications to matching subscribers • Per-source FIFO ordered message delivery SRDS'09

Next • Existing approaches • δ-Fault-Tolerance • Architecture • Reliable publication delivery protocols • Experimental results SRDS'09

Architecture • Broker are organized in a tree-based overlay network • In our approach δ-fault-tolerance is closely related to how much brokers know about the broker tree • (δ+1)-neighborhood: brokers within distance δ+1 • This information is stored in a data structure called thetopology map • Topology maps are updated asbrokers enter/leave the network 3-neighborhood 2-neighborhood 1-neighborhood SRDS'09

Join Algorithm • Joining broker connects to a joinpoint • joinRequest message is sent to the joinpoint • joinpoint replies with a subset of its topology map • joinRequest is propagated in the network • Receiving brokers update their topology maps • confirmation messages propagated from edge brokers are sent back • Joining broker receives the confirmation: join is complete (δ+1)-neighborhood δ-neighborhood Joinpoint Joining broker SRDS'09

Subscription Routing Information • Subscription routing protocol is used to construct forwarding paths • Subscription messages encapsulate: • pred: Conjunct predicates specifying client’s interests • from: BrokerID points back to broker δ+1 hops closer to subscriber • Subscriptions are sent hop-by-hop throughout the network • Brokers update from as message is forwarded • Brokers handle confirmation msgs similar to join • Confirmed subs are inserted into subscription routing table s.from s.from s.from s.from S S S S δ=2 A B C D E SRDS'09

Next • Existing approaches • δ-Fault-Tolerance • Architecture • Reliable publication forwarding protocols • Experimental results SRDS'09

Publication Forwarding Algorithm (No Failure Case) • Received pubs are placed in a FIFO message queue and kept until processing is complete • Using subscription info: subsmatching the publication are identified • Matching subs’ from field are inserted into the recipientSet • Using topology map: pub is sent to closest available brokers towards matching subscribers (outgoingSet) • Receiving downstream brokers similarly forward the publication until delivered to subscribers • Confirmations from all downstream brokers are received • Clean-up: once all confirmations arrive, the publication is discarded from the queue Upstream P queue A (δ+1)-neighborhood Downstream SRDS'09

Publication Forwarding Algorithm (Failure Case) • Brokers use heartbeats to monitor availability of their connected peers • Once failures are detected the broker reconnects the topology by creating new links to downstream neighbors of the failed brokers • Unconfirmed publications are re-transmitted from msg queue • Subsequent pubs are forwarded via the new links instead • Bypass failed brokers • Multiple concurrent failures (up to δ) are handled similarly • In the worst case, δ brokers have failed in a row Upstream queue P P P P A Downstream SRDS'09

Eliminating Need for Confirmation Messages • For each pub msg sent over a link there is a confirmation msg that is sent back • Increased network traffic • We use an aggregated acknowledgement mechanism called Depth Acknowledgements (DACK) • It is very similar to the normal way that • This substitutes the need for confirmation messages SRDS'09

Discarding Publications Using DACK Messages P P P • B and C keep track of the highest sequence number they received and discarded (prefix-based) from A and periodically report it upstream using DACK messages. • Brokers append their own information to DACK and also relay portions of their neighbors’ DACK messages. • For each publication, A evaluates safety conditions for all brokers in the publication’s recipientSet. • Safety conditions • All intermediate brokers report an arrived seq# is higher than publication’s seq#, OR • Any intermediate broker has reported a discarded prefix seq# that is higher that the publication’s seq#(necessary when there are failures) Upstream P(?) A Update Update Update DACK arrived:{seq(A), …} discarded:{seq'(A), …} Arr:seq(A)Dsc:seq'(A) Downstream B arrived:{seq(A),seq(A),…} discarded:{seq'(A),seq'(A)},… DACK Update Update Update Arr:seq(A) Dsc:seq'(A) C SRDS'09

Next • Existing approaches • δ-Fault-Tolerance • Architecture • Reliable publication forwarding protocols • Experimental results SRDS'09

Experimental Setup • Algorithms implemented in Java • We run the system on a cluster computer: • 21 nodes each with 4 cores • Gigabit eathernet • Topology setup (δ=3) • Consists of 83 brokers • #subscriptions: 2600 • #publishers: 26 at varied publication rates • We inject failures to R1, R2, R3 and perform measurements R3 R1 R2 SRDS'09

Publication Delivery Delay Impact of failures on publication delivery delay • Use stream of publications (10msg/s) • Measure delivery delay between publishing and subscribing endpoints • 3 separate runs with different number of simultaneous failures • After a short-lived jump, the delivery delay quickly goes back to normal • Difference corresponds to failure detection timeout 3-Failures 2-Failures 1-Failure SRDS'09

Change in Load After Failures • Non-faulty broker's load after failures: • Input msg traffic: no change! • Output msg traffic: increase • CPU utilization: increase • Output rate/CPU utilization is affected by nearby failures InputMsg Rate R3 R1 R3 Fails R3 Fails R3 Fails R2 R2 stabilizesat slightly higher R2’s output traffic stabilizes at slightlyhigher rate R2’s input traffic stabilizes at exactlythe same rate Spikes at R2 after brokers reconnect Spikes at R2 after brokers reconnect Spikes at R2 afterbrokers reconnect Lower spikes on R1 Lower spikes on R1 R1 sees no chance R1’s input traffic stabilizes at exactlythe same rate Output Msg Rate R1sees no change CPU Load Smaller spike on R1

Comparison with Replica-based Approach Our approach Replica-based • Topology network • Our approach: δ=2 • Replica-based: 2 replicas • Considered situation after2 failures (R2 and R3 fail) • Compared load on R1 after failures occur • In our approach CPU load on R1 is about 30% lower Virtual node R3 30% difference R1 R2 SRDS'09

Conclusions • Our system delivers reliable P/S service in the face of up to δ concurrent broker failures • We also proposed optimizations: • To use aggregated acknowledgement messages • To reduce the network traffic • Ongoing and future work: • Explore multi-path forwarding • http://research.msrg.utoronto.ca/Padres/WebHome SRDS'09

Questions? Thanks! SRDS'09

Backup slides … SRDS'09

Sample DACK Propagation and Publication Purging (δ=3) First safety condition Second safety condition 1. 1. 2. 2. 3. 3. 4. 4. 5. 5. 6. 6. 7. 7. 8. 8. 9. 9. LEGEND: 10. Direction of pub forwarding Node holds pub in MQ Node discards pub Node receives pub

Publication Propagation and Purging Using DACK info (δ=3) 1. 1. 2. 2. 3. 3. 4. 4. 5. 5. 6. 6. 7. 7. 8. 8. 9. 9. 10. SRDS'09

Publication Propagation and Purging Using DACK info with failures (δ=3) SRDS'09

Reliable and Highly Available Distributed Publish/Subscribe Systems

Reliable and Highly Available Distributed Publish/Subscribe Systems

Presentation Transcript

Distributed File Systems

CS 525 Advanced Distributed Systems Spring 2010

Distributed Systems

Distributed Operating Systems - Introduction

Distributed Object-Based Systems

Distributed Systems

Fault Tolerance

Distributed Object-Based Systems

Distributed Operating Systems

zeroMQ 消息模式分析

Distributed Object-Based Systems

Distributed Systems

Distributed Systems

Distributed Systems

DISTRIBUTED COMPUTING

Mobile Computing – A Distributed Systems Perspective

From Distributed Processing Systems to the buzz word of the day and back

ITEC801 Distributed Systems

Distributed Databases

Midterm Review CS 230 – Distributed Systems (ics.uci/~cs230)

DISTRIBUTED SYSTEMS