1 / 49

Virtual Synchrony

Virtual Synchrony. Krzysztof Ostrowski krzys@cs.cornell.edu. ?. ?. ?. ?. ?. A motivating example. SENSORS. notifications. detection. decision made. well-coordinated response. orders. GENERALS. DANGER. WEAPON. Requirements for a dist ributed system (or a replicated service).

job
Download Presentation

Virtual Synchrony

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Virtual Synchrony Krzysztof Ostrowski krzys@cs.cornell.edu

  2. ? ? ? ? ? A motivating example SENSORS notifications detection decision made well-coordinated response orders GENERALS DANGER WEAPON

  3. Requirements for a distributedsystem (or a replicated service) Consistent views acrosscomponents • E.g. vice-generals see the same events as thechief general • Agreement on what messages have been received or delivered • E.g. each general has same view oftheworld (consistent state) • Replicas of the distributed service do not diverge • E.g. everyone should have same view of membership • If a component is unavailable, all others decide up or down together

  4. Requirements for a distributed system (or a replicated service) Consistent actions • E.g. generals don’t contradict each other(don’t issue conflicting orders) • A single service may need to respond to a single request • Responses to independent requests may need to be consistent • But: Consistent  Same (same actions  determinism  no fault tolerance)

  5. System as a set of groups client-server groups event peer group multicasting a decision diffusion group

  6. A process group • A notion of process group • Members know each other, cooperate • Suspected failures  group restructures itself, consistently • A failed node is excluded, eventually learns of its failure • A recovered node rejoins • A group maintains a consistent common state • Consistency refers only to members • Membership is a problem on its own... but it CAN be solved

  7. A model of a dynamic process group consistent state A CRASH B RECOVER B C C D E E JOIN F state transfers consistent state membership views

  8. The lifecycle of a member (a replica) alive, but not ingroup assumed tabula rasa all information = lost join in group come up dead or suspected to be dead unjoin transfer state here processing requests fail or just unreachable

  9. The Idea (Roughly) • Take some membership protocol, or an external service • Guarantee consistency in inductive manner • Start in an identical replicated state • Apply any changes • Atomically, that is either everywhere or nowhere • In the exact same order at all replicas • Consistency of all actions / responses comes as a result • Same events seen: • Rely on ordering + atomicity of failures and message delivery

  10. The Idea (Roughly) • We achieve it by the following primitives: • Lower-level • Create / join / leave a group • Multicasting: FBCAST, CBCAST / ABCAST (the "CATOCS") • Higher-level • Download current state from the existing active replicas • Request / release locks (read-only / read-write) • Update • Read (locally)

  11. Why another approach, though? • We have the whole range of other tools • Transactions: ACID; one-copy serializability with durability • Paxos, Chandra-Toueg (FLP-syle consensus schemes) • All kinds of locking schemes, e.g. two-phase locking (2PL) • Virtual Synchrony is a point in the space of solutions • Why are other tools not perfect • Some are very slow: lots of messages, roundtrip latencies • Some limit asynchrony (e.g. transactions at commit time) • Have us pay very high cost for freatures we may not need

  12. A special class of applications • Command / Control • Joint Battlespace Infosphere, telecommunications, • Distribution / processing / filtering data streams • Trading system, air traffic control system, stock exchange, real-time data for banking, risk-management • Real-Time Systems • Shop floor process control, medical decision support, power grid • What do they have in common: • A distributed, but coordinated processing and control • Highly-efficient, pipelined distributed data processing

  13. Distributed trading system Pricing DB’s 1. Historical Data Market Data Feeds Trader Clients 2. Analytics Current Pricing 3. • Availability for historical data • Load balancing and consistentmessage delivery for price distribution • Parallel execution for analytics Long-Haul WAN Spooler Tokyo, London, Zurich, ...

  14. What’s special about these systems? • Need high performance: we must weaken consistency • Data is of different nature: more dynamic • More relevant online, in context • Storing it persistently often doesn’t make that much sense • Communication-oriented • Online progress: nobody cares about faulty nodes • Faulty nodes can be rebooted • Rebooted nodes are just spare replicas in the common pool

  15. Differences (in a Nutshell)

  16. Back to virtual synchrony • Our plan: • Group membership • Ordered multicast within group membership views • Delivery of new views synchronized with multicast • Higher-level primitives

  17. A process group: joining / leaving group membership protocol V2 = {A,B,C,D} V1 = {A,B,C} A B C request to leave sending a new view request to join D V3 = {B,C,D} ...OR: Group Membership Service

  18. A process group: joining / leaving • How it looks in practice • Application makes a call to the virtual synchrony library • Node communicates with other nodes • Locates the appropriate group, follows the join protocol etc. • Obtains state or whatever it needs (see later) • Application starts communicating, eg. joins replica pool Application all the virtual synchrony just fits into the protocol stack Application V.S. Module V.S. Module Network Network

  19. A process group: hadling failures •  We rely on a failure detector (it doesn’t concern us here) • A faulty or unreachable node is simply excluded from the group • Primary partition model: group cannot split into two active parts recovery V1 = {A,B,C,D} V2 = {B,C,D} A CRASH B join C D "B" realizes that somethings is wrong with "A"and initiates the membership change protocol V3 = {A,B,C,D}

  20. Causal delivery and vector clocks (0,0,0,0,0) (0,1,1,0,0) (0,1,1,0,0) A B C D E (0,0,1,0,0) (0,0,1,0,0) delayed cannot deliver CBCAST = FBCAST + vector clocks

  21. What’s great about fbcast / cbcast ? • Can deliver to itself immediately • Asynchronous sender: no need to wait for anything • No need to wait for delivery order on receiver • Can issue many sends in bursts • Since processes are less synchronous... • ...the system is more resilient to failures • Very efficient, overheads are comparable with TCP

  22. Asynchronous pipelining sender never needs to wait, and can send requests at a high rate A B C buffering may reduce overhead A buffering B C

  23. What’s to be careful with ? • Asynchrony: • Dataaccumulates in buffer in the sender • Must put limits to it! • Explicit flushing: send data to the others, force it out of buffers; if completes, data is safe; needed as a • A failure of the sender causes lots of updates to be lost • Sender gets ahead of anybody else......good if the others are doing something that doesn’t conflict. • Cannot do multiple conflicting tasks without a form of locking, while ABCAST can

  24. Why use causality? • Sender need not include context for every message • One of the very reasons why we use FIFO delivery, TCP • Could "encode" context in message, but: costly or complex • Causal delivery simply extends FIFO to multiple processes • Sender knows that the receiver will have received same msgs • Think of a thread "migrating" between servers, causing msgs

  25. A migrating thread and FIFO analogy A B C D E A way to think about the above... which might explain why it is analogous to FIFO. A B C D E

  26. Why use causality? • Receiver does not need to worry about "gaps" • Could be done by looking at "state", but may be quite hard • Ordering may simply be driven by correctness • Synchronizing after every message could be unacceptable • Reduces inconsistency • Doesn’t prevent it altogether, but it isn’t always necessary • State-level synchronization can thus be done more rarely! • All this said... causality makes most sense in context

  27. Causal vs. total ordering Note: Unlike causal odering, total ordering may require that local delivery be postponed! Causal, but not total ordering Causal and total ordering A,E A B A,E C A,E D E,A E E,A

  28. Total ordering: atomic, synchronous A B C D E A B C D E

  29. backup server requests traces primary server results Why total ordering? • State machine approach • Natural, easy to understand, still cheaper than transactions • Guarantees atomicity, which is sometimes desirable • Consider a primary-backup scheme

  30. postponed until ordering determined coordinator A B C D E actual message delivery here original messages ordering info Implementing totally ordered multicast

  31. Atomicity of failures Uniform: • Delivery guarantees: • Among all the surviving processes delivery is all or none (both cases) • Uniform: here also if (but not only if) • delivered to a crashed node • No guarantees for the newly joined CRASH Nonuniform: Wrong: CRASH CRASH

  32. Why atomicity of failures? • Reduce complexity • After we hear about failure, we need to quickly "reconfigure" • We like to think in stable epochs... • During epochs, failures don’t occur • Any failure or a membership change begins a new epoch • Communication does not cross epoch boundaries • System does not begin a new epoch before all messages are either consistently delivered or all consistently forgotten • We want to know we got everything the faulty process sent,to completely finish the old epoch and open a "fresh" one

  33. Atomicity: message flushing A (logical partitioning) B C D E retransmitting messages of failed nodes A B C D E changing membership retransmitting own messages

  34. A multi-phase failure-atomic protocol this is the point when messages are delivered to the applications Phase 1 Phase 2 Phase 3 Phase 4 A B C D E save message OK to deliver all have seen garbage collect this is only for uniform atomicity these three phases are always present

  35. Simple tools • Replicated services:lockingfor updates, state transfer • Divide responsibility for requests: load balancing • Simpler because of all the communication guarantees we get • Work partitioning: subdivide tasks within a group • Simple schemes for fault-tolerance: • Primary Backup,Coordinator-Cohort

  36. Simple replication: state machine • Replicate all data and actions • Simplest pattern of usage for v.s. groups • Same state everywhere (state transfer) • Updates propagated atomically using ABCAST • Updates applied in the exact same order • Reads or queries: always can be served locally • Not very efficient, updates too synchronous • A little to slow • We try to sneak-in CBCAST into the picture... • ...and use it for data locking and for updates

  37. Updates with token-style locking We may not need anything more but just causality... CRASH A B C D E granting lock ownership of a shared resource performing an update requesting lock

  38. Updates with token-style locking others must confirm (release any read locks) individual confirmations request updates updates A B C D E request request granting the lock message from token owner

  39. Multiple locks on unrelated data CRASH A B C D E

  40. Replicated services query update load balancing scheme

  41. Replicated services Primary-Backup Scheme Coordinator-Cohort Scheme backup server requests traces primary server results

  42. Other types of tools • Publish-Subscribe • Every topic as a separate group • subscribe = join the group • publish = multicast • state transfer = load prior postings • Rest of ISIS toolkit • News, file storage, job cheduling with load sharing, framework for reactive control apps etc.

  43. Complaints (Cheriton / Skeen) • Oter techniques are better: transctions, pub./sub. • Depends on applications... we don’t compete w. ACID ! • A matter of taste • Too many costly features, which we may not need • Indeed: and we need to be able to use them selectively • Stackable microprotocols - ideas picked up on in Horus • End-to-End argument • Here taken to the extreme, could be used against TCP • But indeed, there are overheads: use this stuff wisely

  44. At what level to apply causality? • Communication level (ISIS) • An efficient technique that usually captures all that matters • Speeds-up implementation, simplifies design • May not always be most efficient: might sometimes over-order(might lead to incidental causality) • Not a complete or accurate solution, but often just enough • What kinds of causality really matter to us?

  45. At what level to apply causality? • Communication level (ISIS) • May not recognize all sources of causality • Existence of external channels (shared data, external systems) • Semantics ordering, recognized / understood only by applications • Semantics- or State-level • Prescribe ordering by the senders (prescriptive causality) • Timestamping, version numbers

  46. (skip) Overheads • Causality information in messages • With unicast as a dominant communication pattern, it could be a graph, but only in unlikely patterns of communication • With multicast, it’s simply one vector (possibly compressed) • Usually we have only a few active senders and bursty traffic • Buffering • Overhead linear with N, but w. small constant, similar to TCP • Buffering is bounded together with communication rate • Anyway is needed for failure atomicity (an essential feature) • Can be efficiently traded for control traffic via explicit flushing • Can be greatly reduces by introducing hierarchical structures

  47. (skip) Overheads • Overheads on the critical path • Delays in delivery, but in reality: comparable to those in TCP • Arriving out of order is uncommon, a window with a few messages • Checking / updating causality info + maintaining msg buffers • A false (non-semantic) causality: messages unnecessarily delayed • Group membership changes • Requires agreement and slow multi-phase photocols • A tension between latency and bandwidth • Does introduce a disruption: suppresses delivery of new messages • Costly flushing protocols place load on the network

  48. (skip) Overheads • Control traffic • Acknowledgements, 2nd / 3rd phase: additional messages • Not on critical path, but latency matters, as it affects buffering • Can be piggybacked on other communication • Atomic communication, flush, membership changes slowed down to the slowest participant • Heterogeneity • Need sophisticated protocols to avoid overloading nodes • Scalability • Group size: asymmetric load on senders, large timestamps • Number of groups: complex protocols, not easy to combine

  49. Conclusions • Virtual synchrony is an intermediate solution... • ...less than consistency w. serializability, better than nothing. • Strong enough for some classes of systems • Effective in practice, successfully used in many real systems • Inapplicable or inefficient in database-style settings • Not a monolithic scheme... • Selected features should be used only based on need • At the same time, a complete design paradigm • Causality isn’t so much helpful as an isolated feature, but... • ...it is a key piece of a much larger picture

More Related