1 / 36

Partition-Tolerant Distributed Publish/Subscribe Systems

Reza Sherafat Kazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDS October 6, 2011. Partition-Tolerant Distributed Publish/Subscribe Systems. Content-Based Publish/Subscribe. NY. London. P. P. Publish. P. Toronto. Pub/Sub. S. S. S. S. S. P. S. sub = [STOCK=IBM].

wanda
Download Presentation

Partition-Tolerant Distributed Publish/Subscribe Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reza SherafatKazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDSOctober 6, 2011 Partition-Tolerant Distributed Publish/Subscribe Systems

  2. Content-Based Publish/Subscribe NY London P P Publish P Toronto Pub/Sub S S S S S P S sub = [STOCK=IBM] Trader 1 S Trader 2 sub= [CHANGE>-8%] Stock quote dissemination SRDS 2011

  3. Goals • Pub Fault-tolerance (against concurrent failures): • Broker crashes • Link failures • Recoveries Reliability: • Publications match subscriptions • Per-source in-order delivery • After some point in time Exactly-once delivery(no loss, no duplicates) • Assumptions: • Clients are light-weight (broker network is responsible for reliability) • A time t after which the system provides guaranteed delivery P/S Reliabledelivery • Sub Sub SRDS 2011

  4. System Architecture Tree dissemination networks: One path from source to destination • Pros: • Simple, loop-free • Preserves publication order(difficult for non-tree content-based P/S) • Cons: • Trees are highly susceptible to failures Primary tree:Initial spanning tree that is formed as brokers join the system • Maintain neighborhood knowledge • Allows brokers to reconfigure overlayafter failures on the fly ∆-Neighborhood knowledge: ∆ is configuration parameterensures handling ∆-1 concurrent failures (worst case) • Knowledge of other brokers within distance ∆ Join algorithm • Knowledge of routing paths within neighborhood Subscription propagation algorithm 3-neighborhood 2-neighborhood 1-neighborhood SRDS 2011

  5. Overview of the Approach Single chain SRDS 2011

  6. Overlay Management Alg. Maintains end-to-end connectivity despite failures in the overlay. SRDS 2011

  7. Overlay Partitions • When primary tree is setup, brokers communicate with their immediate neighbors in the primary tree through FIFO links. • Overlay partitions: Broker crash or link failures creates “partitions” and some neighbor brokers “on the partition” become unreachable from neighboring brokers • Active connections: At each point they try to maintain a connection to its closest neighbor in the primary tree. • Only active connections are used by brokers P F E D C B A S x Active connection to E D pid1=<C, {D}> Brokers on the partition Brokers beyondthe partition Brokers onthe partition ? SRDS 2011 Partition detector

  8. Overlay Partitions – 2 Adjacent Failures • What if there are more failures, particularly adjacent failures? • If ∆ is large enough the same process can be used for larger partitions. P F E D C B A S Active connection to F D E pid1=<C, {D}> + pid2=<C, {D, E}> Brokers beyondthe partition Brokers onthe partition SRDS 2011

  9. Overlay Partitions - ∆ Adjacent Failures • Worst case scenario: ∆-neighborhood knowledge is not sufficient to reconnect the overlay. • Brokers “on” and “beyond” the partition are unreachable. P F E D C B A S No new active connection F D E pid1=<C, {D}> pid2=<C, {D, E}> + pid3=<C, {D, E, F}> Brokers beyondthe partition Brokers onthe partition SRDS 2011

  10. Subscription Propagation Alg. How correct routing tables are maintained despite overlay partitions? SRDS 2011

  11. Subscription Propagation Algorithm • Establishes end-to-end routing state among brokers while taking into account overlay partitions. • Subscriptions are dynamically inserted by subscribers and are propagated along branches of primary tree over active connections. • Primary tree is the “basis” of constructing end-to-end forwarding paths. • Each subscription contains: SUB = <Id, Predicates, Anchor> • Predicates specifies subscriber’s interest, e.g., [STOCK=“IBM”] • Anchor is a reference to brokers along the propagation path of the subscription SRDS 2011

  12. Subscription Propagation in Absence of Overlay Partitions • Subscription anchor field is updated to a broker point up to ∆ hops closer to subscriber • Accepting a subscription is to add it into routing tables • Only after confirmations are received, a subscription is accepted (i.e., will be used for matching) • Observation: Matching publications are delivered to a subscriber once its local broker accepts subscription P E D C B A S s.anchor s s s s s s Subscriptions conf conf conf conf conf conf ☑ ☑ ☑ ☑ ☑ ☑ ☑ Confirmations SRDS 2011 ∆ hops ∆ hops

  13. Subscription Propagation in Presence of overlay Partitions • Broker B sends s via its active link to bypass the partition and awaits receipt of the corresponding confirmation • Once B receives confirmation and accepts s, it tags the confirmation with pid of partitions that s bypassed. • Brokers relay this tag in their confirmation messages towards the subscriber’s local broker which accepts and stores s tags along with the tag in its routing table. P E D C B A S conf s s s s D B Subscriptions conf* conf* conf ☑ ☑ ☑ ☑ Confirmations ☑ ☑* C pidtag is alsostored alongwith s SRDS 2011 * Tag conf with pid

  14. Publication Forwarding Alg. How accepted subscriptions and their partition tags are used to achieve reliable delivery? SRDS 2011

  15. Publication Forwarding in Absence of Overlay Partitions • Forwarding only uses subscriptions accepted brokers. • Steps in forwarding of publication p: • Identify anchor of accepted subscriptions that match p • Determine active connections towards matching subscriptions’ anchors • Send p on those active connections and wait for confirmations • If there are local matching subscribers, deliver to them • If no downstream matching subscriber exists, issue confirmation towards P • Once confirmations arrive, discard p and send a conf towards P P E D C B A S p p p p p p conf conf conf conf conf conf p Publications Subscriptions E C Deliver to localsubscribers ☑ ☑ ☑ ☑ ☑ ☑ ☑ SRDS 2011

  16. Publication Forwarding in Presence of Overlay Partitions • Key forwarding invariant to ensure reliability:we ensure that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription. • Case1: Sub s has been accepted with no pid. It is safe to bypass intermediate brokers P E D C B A S Publications Subscriptions p p p p B D conf conf conf conf ☑ ☑ C Deliver to localsubscribers ☑ ☑ ☑ ☑ ☑ SRDS 2011

  17. Publication Forwarding (cont’d) • Case2: Sub s has been accepted with some pid. • Case 2a: Publisher’s local broker has accepted s and we ensure all intermediate forwarding brokers have also done so:  It is safe to deliver publications from sources beyond the partition. P E D C B A S Publications Subscriptions p p p p B D conf conf conf conf Depending on when this link has been establishedeither recovery or subscription propagation ensureC accepts s prior to receiving p ☑ ☑ C ☑ ☑ ☑* SRDS 2011

  18. Publication Forwarding (cont’d) • Case2: Subscription s is accepted with some pid tags. • Case 2b: Publisher’s broker has not accepted s: It is unsafe to deliver publications from this publisher (invariant). P E D C B A S Subscriptions Publications p p p p p* p ☑* s was acceptedat S with the same pid tag ☑ Tag with pid SRDS 2011

  19. Evaluation Using a mix of simulation and experimental deployments on large-scale testbed. SRDS 2011

  20. Simulation Results ∆=1 Size of brokers’ Neighborhoods as a function of ∆ ∆=2 ∆=3 ∆=4 • Network size of 1000 • Broker fanout of 3 ∆=1 ∆=2 ∆=3 ∆=4 Size of ∆-neighborhoods SRDS 2011

  21. Impact of Failures on End-to-End Broker Reachability • Using a graph simulation tool. • Overlay setup: • Network size 1000 Brokers with fanout=3 • Failure injection: • Failures: up to 100 brokers • We randomly marked a given number of nodes as failed • Measurements: • We counted the number of end-to-end brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain. ∆=1 ∆=2 ∆=3 ∆=4 ∆=1 ∆=4 SRDS 2011

  22. Experimental Deployments:Impact of Failures on Pub Delivery Expected ∆=4 ∆=3 ∆=1 ∆=1 • 500 brokers deployed on 8-core machines in a cluster: • Network setup: Overlay fanout=3. • We measured aggregate pub. delivery count in an interval of 120s • Expected bar is number of publications that must be delivered despite failures (this excludes traffic to/from failed brokers). ∆=4 ∆=3 ∆=2 ∆=1 SRDS 2011

  23. Conclusions • We developed a reliable P/S system that tolerateconcurrent broker and link failures: • Configuration parameter ∆ determines level of resiliency against failures (in the worst case). • Dissemination trees augmented with neighborhood knowledge. • Neighborhood knowledge allows brokers to maintain network connectivity and make forwarding decision despite failures. • We studied the performance of the system when numberof failures far exceeds ∆: • A small value for ∆ ensures good connectivity. SRDS 2011

  24. Questions… Thanks for your attention! SRDS 2011

  25. Challenges Responsibility on P/S messaging system • Why “end-to-end” principle does not work? • Publishers and subscribers are decoupled andunaware of each other. • Routing paths are established by dynamicallyinserted subscriptions • Subscription propagation is also subject tobroker/link failure. • Selective delivery makes in-order deliveryover redundant path difficult • Subscribers are only interested in a subset ofwhat is published. Subscription propagation algorithm We use a special form of tree dissemination SRDS 2011

  26. Store-and-Forward • A copy is first preserved on disk • Intermediate hops send an ACK to previous hop after preserving • ACKed copies can be dismissed from disk • Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery • This ensures reliable delivery but may cause delays while the machine is down P P P P Tohere Fromhere ack ack ack SRDS'09

  27. Mesh-Based Overlay Networks [Snoeren, et al., SOSP 2001] • Use a mesh network to concurrently forward msgs on disjoint paths • Upon failures, the msg is delivered using alternative routes • Pros: Minimal impact on delivery delay • Cons: Imposes additional traffic & possibility of duplicate delivery Fromhere Tohere P P P P SRDS'09

  28. Replica-based Approach [Bhola , et al., DSN 2002] • Replicas are grouped into virtual nodes • Replicas have identical routing information PhysicalMachines Virtual node SRDS'09

  29. Replica-based Approach[Bhola , et al., DSN 2002] • Replicas are grouped into virtual nodes • Replicas have identical routing information • We compare against this approach Virtual node P P P P P P SRDS'09

  30. Publication Forwarding (cont’d) • Case2: Sub s has been accepted with some pid. • Case 2b (Partition barrier): Publisher’s broker has also not accepted s P E D C B A S Subscriptions Publications p1* p1* p1 p1 p1 p1 p1 R ☑ ☑r ☑r ☑r ☑r ☑r ☑r ☑r p2 & p1 matches s matches r & s ☑* s was acceptedat S with the same pid tag ☑ Tag with pid SRDS 2011

  31. Subscription Propagation with Partitions • Partition islands: • Simply confirm (and accept) subscriptions over available • If partition brokers are reachable fromthe other side of the partition • Intuition: • Publications from P may only be lost if they arrive at B • But this will not happen sincethere is no link towards B from F • Correctness proof argues on the precedence of acceptance and creation of links Subscriptions C P P E D C B A S ☑ ☑ Confirmations Leadbroker ☑ ☑ ☑ Will acceptduring recovery B A SRDS 2011

  32. Subscription Propagation with Partition Barriers • If a portion of the network that includes publishers is on/beyond a partition barrier, there is no way to communicate the subscription information for the duration of failures • Lead broker “partially confirms” the subscription and tags the confirmation with the partition information • Accepting brokers store the partition information along with the subscription • This ensures liveness P G F E C B A A S Forward ☑* ☑* Leadbroker D Partialconf C B SRDS 2011 Δ hops

  33. Publication Forwarding • Only accepted subscriptions are stored in SRT and used for matching • At each point in time, a broker has a number of connections to its nearest reachable neighbors • This set of active connections may change over time Publication forwarding steps: • Store publication in a FIFO internal message queue • Match and compute set of {from} for subscriptions that match • For each partially confirmed subscription, tag the publication with the partition information • Send the publication to the closest reachable neighbors towards {from} • Once all confirmations arrive, discard publication and issue confirmation towards publisher P queue P P P P A (δ+1)-neighborhood S S S SRDS 2011

  34. Evaluations ∆=1 Size of brokers’ Neighborhoods as a function of ∆ Network size of 1000 Broker fanout of 7 ∆=1 ∆=2 ∆=2 ∆=3 ∆=3 ∆=4 ∆=4 • Network size of 1000 • Broker fanout of 3 Size of ∆-neighborhoods Size of ∆-neighborhoods SRDS 2011

  35. Overlay Links Management • Sessions: FIFO communication links between brokers. • Active sessions: Broker A’s session to B is active if A has no session to another broker C on the path between A and B. Primary tree ∆ = 2 SRDS 2011

  36. Agenda • Challenges of reliability and fault-tolerance in P/S • Our approach • Topology neighborhood knowledge • Subscription propagation • Publication forwarding • Recovery procedure • Evaluation results • Conclusions SRDS 2011

More Related