Reliable group communication a mathematical approach
1 / 78

Reliable Group Communication: a Mathematical Approach - PowerPoint PPT Presentation

  • Uploaded on

…. GC. Reliable Group Communication: a Mathematical Approach. Nancy Lynch Theory of Distributed Systems MIT LCS Kansai chapter, IEEE July 7, 2000. ?. ?. ?. ?. Dynamic Distributed Systems. Modern distributed systems are dynamic.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Reliable Group Communication: a Mathematical Approach' - mimis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Reliable group communication a mathematical approach


Reliable Group Communication: a Mathematical Approach

Nancy Lynch

Theory of Distributed Systems


Kansai chapter, IEEE

July 7, 2000

Dynamic distributed systems





Dynamic Distributed Systems

  • Modern distributed systems are dynamic.

  • Set of clients participating in an application changes, because of:

    • Network, processor failure, recovery

    • Changing client requirements

  • To cope with changes:

    • Use abstract groups of client processes with changing membership sets.

    • Processes communicate with group members by sending messages to the group as a whole.

Group communication services


Group Communication Services

  • Support management of groups

  • Maintain membership info

  • Manage communication

  • Make guarantees about ordering, reliability of message delivery, e.g.:

    • Best-effort: IP Multicast

    • Strong consistency guarantees: Isis, Transis, Ensemble

  • Hide complexity of coping with changes

This talk
This Talk

  • Describe

    • Group communication systems

    • A mathematical approach to designing, modeling, analyzing GC systems.

    • Our accomplishments and ideas for future work.

  • Collaborators:

    Idit Keidar, Alan Fekete, Alex Shvartsman, Roger Khazan, Roberto De Prisco, Jason Hickey, Robert van Renesse, Carl Livadas, Ziv Bar-Joseph, Kyle Ingols, Igor Tarashchanskiy

Talk outline
Talk Outline

I. Background: Group Communication

II. Our Approach

III. Projects and Results

1. View Synchrony

2. Ensemble

3. Dynamic Views

4. Scalable Group Communication

IV. Future Work

V. Conclusions

The setting





The Setting

  • Dynamic distributed system, changing set of participating clients.

  • Applications:

    • Replicated databases, file systems

    • Distributed interactive games

    • Multi-media conferencing, collaborative work


  • Abstract, named groups of client processes, changing membership.

  • Client processes send messages to the group (multicast).

  • Early 80s: Group idea used in replicated data management system designs

  • Late 80s: Separate group communication services.

Group communication service


Group Communication Service

  • Communication middleware

  • Manages group membership, current views

    View = membership set + identifier

  • Manages multicastcommunication

    among group members

    • Multicasts respect views

    • Guarantees within each view:

      • Reliability constraints

      • Ordering constraints, e.g., FIFO from each sender, causal, common total order

  • Global service



Group communication service1








Group Communication Service

Client A

Client B

Isis birman joseph 87



Isis [Birman, Joseph 87]

  • Primary component group membership

  • Several reliable multicast services, different ordering guarantees, e.g.:

    • Atomic Broadcast: Common total order, no gaps

    • Causal Broadcast:

  • When partition is repaired, primary processes send state information to rejoining processes.

  • Virtually Synchronous message delivery

Example interactive game



Example: Interactive Game

  • Alice, Bob, Carol, Dan in view {A,B,C,D}

  • Primary component membership

    • {A}{B,C,D} split;

      only {B,C,D} may continue.

  • Atomic Broadcast

    • A fires, B moves away;

      need consistent order

Interactive game
Interactive Game

  • Causal Broadcast

    • C sees A enter a room; locks door.

  • Virtual Synchrony

    • {A}{BCD} split; B sees A shoot; so do C, D.




  • Replicated data management

    • State machine replication [Lamport 78] , [Schneider 90]

    • Atomic Broadcast provides support

    • Same sequence of actions performed everywhere.

    • Example: Interactive game state machine

  • Stock market

  • Air-traffic control

Transis amir dolev kramer malkhi 92
Transis [Amir, Dolev, Kramer, Malkhi 92]

  • Partitionable group membership

  • When components merge, processes exchange state information.

  • Virtual synchrony reduces amount of data exchanged.

  • Applications

    • Highly available servers

    • Collaborative computing, e.g. shared whiteboard

    • Video, audio conferences

    • Distributed jam sessions

    • Replicated data management [Keidar , Dolev 96]

Other systems
Other Systems

  • Totem [Amir, Melliar-Smith, Moser, et al., 95]

    • Transitional views, useful with virtual synchrony

  • Horus[Birman, van Renesse, Maffeis 96]

  • Ensemble[Birman, Hayden 97]

    • Layered architecture

    • Composable building blocks

  • Phoenix, Consul, RMP, Newtop, RELACS,…

  • Partitionable

Service specifications
Service Specifications

  • Precise specifications needed for GC services

    • Help application programmers write programs that use the services correctly, effectively

    • Help system maintainers make changes correctly

    • Safety, performance, fault-tolerance

  • But difficult:

    • Many different services; different guarantees about membership, reliability, ordering

    • Complicated

    • Specs based on implementations might not be optimal for application programmers.

Early work on gc service specs
Early Work on GC Service Specs

  • [Ricciardi 92]

  • [Jahanian, Fakhouri, Rajkumar 93]

  • [Moser, Amir, Melliar-Smith, Agrawal 94]

  • [Babaoglu et al. 95, 98]

  • [Friedman, van Renesse 95]

  • [Hiltunin, Schlichting 95]

  • [Dolev, Malkhi, Strong 96]

  • [Cristian 96]

  • [Neiger 96]

  • Impossibility results [Chandra, Hadzilacos, et al. 96]

  • But still difficult…



  • Model everything:

    • Applications

      • Requirements, algorithms

    • Service specs

      • Work backwards, see what

        the applications need

    • Implementations of the services

  • State, prove correctness theorems:

    • For applications, implementations.

    • Methods: Composition, invariants, simulation relations

  • Analyze performance, fault-tolerance.

  • Layered proofs, analyses




Math foundation i o automata
Math Foundation: I/O Automata

  • Nondeterministic state machines

  • Not necessarily finite-state

  • Input/output/internal actions (signature)

  • Transitions, executions, traces

  • System modularity:

    • Composition, respecting traces

    • Levels of abstraction, respecting traces

  • Language-independent, math model

Typical examples modeled
Typical Examples Modeled

  • Distributed algorithms

  • Communication protocols

  • Distributed data management systems

Modeling style
Modeling Style

  • Describe interfaces, behavior

  • Program-like behavior descriptions:

    • Precondition/effect style

    • Pseudocode or IOA language

  • Abstract models for algorithms, services

  • Model several levels of abstraction,

    • High-level, global service specs

    • Detailed distributed algorithms

Modeling style1
Modeling Style

  • Very nondeterministic:

    • Constrain only what must be constrained.

    • Simpler

    • Allows alternative implementations

Describing timing features
Describing Timing Features

  • TIOAs [Lynch, Vaandrager 93]

    • For describing:

      • Timeout-based algorithms.

      • Clocks, clock synchronization

      • Performance properties

Describing failures





Describing Failures

  • Basic or timed I/O automata, with fail,recover input actions.

  • Included in traces, can use them in specs.

Describing other features
Describing Other Features

  • Probabilistic behavior: PIOAs[Segala 95]

    • For describing:

      • Systems with combination of probabilistic + nondeterministic behavior

      • Randomized distributed algorithms

      • Probabilistic assumptions on environment

  • Dynamic systems: DIOAs[Attie, Lynch 99]

    • For describing:

      • Run-time process creation and destruction

      • Mobility

      • Agent systems [NTT collaboration]

Using i o automata general
Using I/O Automata (General)

  • Specify systems precisely

  • Validate designs:

    • Simulation

    • State, prove correctness theorems

    • Analyze performance

  • Generate validated code

  • Study theoretical upper and lower bounds

Using i o automata for group communication systems
Using I/O Automata for Group Communication Systems

  • Use for global services + distributed algorithms

  • Define safety properties separately from performance/fault-tolerance properties.

    • Safety:

      • Basic I/O automata; trace properties

    • Performance/fault-tolerance:

      • Timed I/O automata with failure actions; timed trace properties


1. View Synchrony

2. Ensemble

3. Dynamic Views

4. Scalable Group Communication

1 view synchrony vs fekete lynch shvartsman 97 00
1. View Synchrony (VS) [Fekete, Lynch, Shvartsman 97, 00]


  • Develop prototypes:

    • Specifications for typical GC services

    • Descriptions for typical GC algorithms

    • Correctness proofs

    • Performance analyses

  • Design simple math foundation for the area.

  • Try out,evaluate our approach.

View synchrony
View Synchrony

What we did:

  • Talked with system developers (Isis, Transis)

  • Defined I/O automaton models for:

    • VS, prototype partitionable GC service

    • TO, non-view-oriented totally ordered bcast service

    • VStoTO, application algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser]

  • Proved correctness

  • Analyzed performance/ fault-tolerance.

Vstoto architecture
VStoTO Architecture










To broadcast specification


TO Broadcast Specification

Delivers messages to everyone, in the same order.

Safety: TO-Machine


input: bcast(a,p)

output: brcv(a,p,q)

internal: to-order(a,p)


queue, sequence of (a,p), initially empty

for each p:

pending[p], sequence of a, initially empty

next[p], positive integer, initially 1

To machine




append a to pending[p]



a is head of pending[p]


remove head of pending[p]

append (a,p) to queue



queue[next[q]] = (a,p)


next[q] := next[q] + 1


Performance fault tolerance

TO-Property(b,d,C):If C stabilizes, then soon thereafter (time b), any message sent or received anywhere in C is received everywhere in C, within bounded time (time d).






Vs specification


VS Specification

  • Partitionable view-oriented service

  • Safety: VS-Machine

    • Views presented in consistent order, possible gaps

    • Messages respect views

    • Messages in consistent order

    • Causality

    • Prefix property

    • Safe indication

  • Doesn’t guarantee Virtual Synchrony

  • Like TO-Machine, but per view

Performance fault tolerance1


newview( v)







If C stabilizes, then soon thereafter (time b), views known within C become consistent, and messages sent in the final view v are delivered everywhere in C, within bounded time (time d).

Vstoto algorithm
VStoTO Algorithm

  • TO must deliver messages in order, no gaps.

  • VS delivers messages in orderper view.

  • Problems arise from view changes:

    • Processes moving between views could have different prefixes.

    • Processes could skip views.

  • Algorithm:

    • Real work done in majority views only

    • Processes in majority views totally order messages, and deliver to clients messages that VS has said are safe.

    • At start of new view, processes exchange state, to reconcile progress made in different majority views.

Correctness safety proof
Correctness (Safety) Proof

  • Show composition of VS-Machine and VStoTO machines implements TO-Machine.

  • Trace inclusion

  • Use simulation relation proof:

    • Relate start states, steps of composition

      to those of TO-Machine

    • Invariants, e.g.:

      Once a message is ordered everywhere in some

      majority view, its order is determined forever.

  • Checked using PVS theorem-prover, TAME [Archer]



Conditional performance analysis
Conditional Performance Analysis

  • Assume VS satisfies VS-Property(b,d,C):

    • If C stabilizes, then within time b, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within time d.

  • And VStoTO satisfies:

    • Simple timing and fault-tolerance assumptions.

  • Then TO satisfies TO-Property(b+d,d,C):

    • If C stabilizes, then within time b+d, any message sent or delivered anywhere in C is delivered everywhere in C, within time d.

Conclusions vs
Conclusions: VS

  • Models for VS, TO, VStoTO

  • Proofs, performance/f-t analyses

  • Tractable, understandable, modular

  • [PODC 97], [TOCS 00]

  • Follow-on work:

    • Algorithm for VS [Fekete, Lesley]

    • Load balancingusing VS [Khazan]

    • Models for other Transis algorithms [Chockler]

  • But: VS is only a prototype; lacks some key features, like Virtual Synchrony

  • Next: Try a real system!

2 ensemble hickey lynch van renesse 99
2. Ensemble [Hickey, Lynch, van Renesse 99]


  • Try, evaluate our approach on a real system

  • Develop techniques for modeling, verifying, analyzing more features, of GC systems, including Virtual Synchrony

  • Improve on prior methods for system validation


  • Ensemble system [Birman, Hayden 97]

    • Virtual Synchrony

    • Layered design, building blocks

    • Coded in ML [Hayden]

  • Prior verification work for Ensemble and predecessors:

    • Proving local properties using Nuprl [Hickey]

    • [Ricciardi], [Friedman]


  • What we did:

    • Worked with developers

    • Followed VS example

    • Developed global specs for key layers:

      • Virtual Synchrony

      • Total Order with Virtual Synchrony

    • Modeled Ensemble algorithm spanning between layers

    • Attempted proof; found logical error in state exchange algorithm (repaired)

    • Developed models, proofs for repaired system

Conclusions ensemble
Conclusions: Ensemble

  • Models for two layers, algorithm

  • Tractable, easily understandable by developers

  • Error, proofs

  • Low-level models similar to actual ML code (4 to 1)

  • [TACAS 99]

  • Follow-on:

    • Same error found in Horus.

    • Incremental models, proofs [Hickey]

  • Next: Use our approach to design new services.

3 dynamic views de prisco fekete lynch shvartsman 98
3. Dynamic Views [De Prisco, Fekete, Lynch, Shvartsman 98]


  • Define GC services that cope with both:

    • Long-term changes:

      • Permanent failures, new joins

      • Changes in the “universe” of processes

    • Transient changes

  • Use these to design consistent total order and consistent replicated data algorithms that tolerate both long-term and transient changes.

Dynamic views






Dynamic Views

  • Many applications with strong consistency requirements make progress only in primary views:

    • Consistent replicated data management

    • Totally ordered broadcast

  • Can use staticnotion of allowable primaries,e.g., majorities of universe, quorums

    • All intersect.

    • Only one exists at a time.

    • Information can flow from each to the next.

  • But: Static notion not good for

    long-term changes

Dynamic views1







Dynamic Views

  • For long-term changes, want dynamic notion of allowable primaries.

  • E.g., each primary might contain majority of previous:

  • But: Some might not intersect.

    Makes it hard to maintain consistency.

Dynamic views2
Dynamic Views

  • Key problem:

    • Processes may have different opinions about which is the previous primary

    • Could be disjoint.

  • [Yeger-Lotem, Keidar, Dolev 97]algorithm

    • Keeps track of allpossible previous primaries.

    • Ensures intersection with all of them.

Dynamic views3
Dynamic Views

What we did:

  • Defined Dynamic View Service,DVS, based on [YKD]

  • Designed to tolerate long-term failures

  • Membership:

    • Views delivered in consistent order, possible gaps.

    • Ensures new primary intersects all possible previous primaries.

  • Communication:

    • Similar toVS

    • Messages delivered within views,

    • Prefix property, safe notifications.

Dynamic views4



Dynamic Views

  • What we did, cont’d

    • Modeled, proved implementing algorithm

    • Modeled, proved TO-Broadcast application

    • Distributed implementation [Ingols 00]

Handling transient failures dynamic configurations
Handling Transient Failures: Dynamic Configurations

  • Configuration = Set of processes plus structure, e.g., set of quorums, leader,…

  • Application: Highly available consistent replicated data management:

    • Paxos [Lamport], uses leader, quorums

    • [Attiya,Bar-Noy, Dolev], uses read quorums and write quorums

  • Quorums allow flexibility, availability in the face of transient failures.

Dynamic configurations de prisco fekete lynch shvartsman 99 00
Dynamic Configurations [De Prisco, Fekete, Lynch, Shvartsman 99, 00]

  • Combine ideas/benefits of

    • Dynamic views, for long-term failures, and

    • Static configurations, for transient failures

  • Idea:

    • Allow configuration to change (reconfiguration).

    • Each configuration satisfies intersection properties with respect to previous configuration

  • Example:

    • Config = (membership set, read quorums, write quorums)

    • Membership set of new configuration contains read quorum and write quorum of previous configuration

Dynamic configurations
Dynamic Configurations

What we did:

  • Defined dynamic configuration service DCS, guaranteeing intersection properties w.r.t. all possible previous configurations.

  • Designed implementing algorithm, extending [YKD]

  • Developed application: Replicated data

    • Dynamic version of [Paxos]

    • Dynamic version of [Attiya, BarNoy, Dolev]

    • Tolerate

      • Transient failures, using quorums

      • Longer-term failures, using reconfiguration

Conclusions dynamic views
Conclusions: Dynamic Views

  • New DVS, DC services for long-term changes in set of processes

  • Applications, implementations

  • Decomposed complex algorithms into tractable pieces:

    • Service specification, implementation, application

    • Static algorithm vs. reconfiguration

  • Couldn’t have done it without the formal framework.

  • [PODC 98], [DISC 99]

4 scalable group communication keidar khazan 99 k k l shvartsman 00
4. Scalable Group Communication [Keidar, Khazan 99], [K ,K, L, Shvartsman 00]


  • Make GC work in wide area networks

    What we did:

  • Defined desired properties for GC services

  • Defined spec for scalable group membership service [Keidar, Sussman, Marzullo, Dolev 00],

    implemented on small set of membership servers

Scalable group communication
Scalable Group Communication

What we did, cont’d:

  • Developed new, scalable GC algorithms:

    • Use scalable GM service

    • Multicast implemented on clients

    • Efficient: Algorithm for virtual synchrony uses only one round for state exchange, in parallel with membership service’s agreement on views.

    • Processes can join during reconfiguration.

  • Distributed implementation [Tarashchanskiy]

Scalable gc
Scalable GC

What we did, cont’d:

  • Developed new incremental modeling, proof methods [Keidar, Khazan, Lynch, Shvartsman 00]

    • Proof Extension Theorem

  • Developed models, proofs (safety and liveness) , using the new methods.





Conclusions scalable gc
Conclusions: Scalable GC

  • Specs, new algorithms, proofs

  • New incremental proof methods

  • Couldn’t have done it without the formal framework.

  • [ICDCS 99], [ICSE 00]

Future work
Future Work

  • Model, analyze GC services, applications

  • Design new GC services

  • Catalog

  • Compare, evaluate GC services

  • Math foundations

  • Theory  Practice

Practical gc systems current status birman 99
Practical GC Systems: Current Status [Birman 99]

  • Commercial successes:

    • Stock exchange (Zurich, New York)

    • Air-traffic control (France)

  • Problems:

    • Performance, for strong guarantees like Virtual Synchrony

    • Not integrated with object-oriented programming technologies.

  • Trends:

    • Flexible services

    • Weaker guarantees; better performance

    • Integration with OO technologies, allowing programmers to make tradeoffs.

1 model analyze gc services

Analyze performance of our new services: Dynamic views, Scalable GC


Applications: Replicated data, games, …

Compare predicted, observed performance.

Other existing services

1. Model, Analyze GC Services

2 design new services
2. Design New Services Scalable GC

Total Order + QoS [Bar-Joseph, Keidar, Anker, L.]

  • Specs for:

    • Bandwidth reservation service

    • TO Multicast service with QoS (latency, bandwidth)

  • Algorithms implementing TO-QoS using reservation service:

    Algorithm 1: Allows gaps, simple, small added latency

    Algorithm 2: No gaps, more complex, more latency

    Basic services: Consensus, resource allocation, leader election, spanning trees, overlay networks

3 catalog of gc services
3. Catalog of GC Services Scalable GC

  • Service specs

  • Property specs [Chockler, Keidar, Vitenberg 99]

  • Implementing algorithms

  • Prototype applications

  • Lower bounds, impossibility results

4 compare evaluate gc services
4. Compare, Evaluate GC Services Scalable GC

  • Study tradeoffs between strength of ordering and reliability guarantees vs. performance

  • Compare GC services with other reliable multicast algorithms:

    • Scalable Reliable Multicast [Floyd, Jacobson, et al. 95]:

      Unreliable GC (IP Multicast) + retransmission protocol

    • Bimodal Multicast [Birman, Hayden, et al. 99]

5 math foundations
5. Math Foundations Scalable GC

  • Models:

    • Timing models

      • For timing assumptions, guarantees, QoS

      • For conditional performance analysis

    • Failure models, probabilistic models, process creation models…

    • Combined models

  • Proof methods:

    • Incremental modeling, proof

    • Conditional performance analysis

Conditional performance analysis1
Conditional Performance Analysis Scalable GC

  • Idea:

    • Make conditional claims about system behavior, under various assumptions about behavior of environment, network.

    • Include timing, performance, failures.

  • Benefits:

    • Formal performance predictions

    • Says when system makes specific guarantees

      • Normal case + failure cases

      • Parameters, sensitivity analysis

    • Composable

    • Get probabilistic claims as corollaries

Cp analysis typical hypotheses
CP Analysis: Typical Hypotheses Scalable GC

  • Stabilization of underlying network.

  • Limited rate of change.

  • Bounds on message delay.

  • Limited amount of failure (number, density).

  • Limit input arrivals (number, density).

  • Method allows focus on tractable cases.

Example reliable multicast livadas keidar lynch
Example: Reliable Multicast Scalable GC[Livadas, Keidar, Lynch]

  • Specs for IP Mcast, Reliable Mcast services

  • Automaton model for Scalable Reliable Mcast (SRM) protocol [Floyd, Jacobson, et al. 95]

  • Example:

    • Assume bounds on IP-level message loss, processor failures

    • Prove bounds on:

      • Time from client send until all non-failed clients receive.

      • Amount of traffic generated.

Srm architecture
SRM Architecture Scalable GC



6 theory practice
6. Theory Scalable GC Practice

  • IOA language, tool support for GC services, algorithms

  • Incremental development methods for algorithms, service specs, proofs, analyses

  • Methods for integrating group communication services with object-oriented programming technologies

V conclusions
V. Conclusions Scalable GC


GC Scalable GC


  • GC services help in programming

    dynamic distributed systems, though scalability, integration problems remain.

  • Our contributions:

    • Modeling style: Automata + performance properties

    • Techniques: Conditional performance analysis, incremental modeling/proof

    • Models, proofs for key services

    • Discovered errors

    • New services: Dynamic views, scalable GC

  • Mathematical framework makes it possible to design more complex systems correctly.

Future work1
Future Work Scalable GC

  • Model, analyze GC services, applications

  • Design new services

  • Catalog

  • Compare, evaluate services

  • Math foundations

  • Theory  Practice

Thank you
Thank you! Scalable GC