Reliable group communication a mathematical approach
This presentation is the property of its rightful owner.
Sponsored Links
1 / 78

Reliable Group Communication: a Mathematical Approach PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

…. GC. Reliable Group Communication: a Mathematical Approach. Nancy Lynch Theory of Distributed Systems MIT LCS Kansai chapter, IEEE July 7, 2000. ?. ?. ?. ?. Dynamic Distributed Systems. Modern distributed systems are dynamic.

Download Presentation

Reliable Group Communication: a Mathematical Approach

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Reliable group communication a mathematical approach

GC

Reliable Group Communication: a Mathematical Approach

Nancy Lynch

Theory of Distributed Systems

MIT LCS

Kansai chapter, IEEE

July 7, 2000


Dynamic distributed systems

?

?

?

?

Dynamic Distributed Systems

  • Modern distributed systems are dynamic.

  • Set of clients participating in an application changes, because of:

    • Network, processor failure, recovery

    • Changing client requirements

  • To cope with changes:

    • Use abstract groups of client processes with changing membership sets.

    • Processes communicate with group members by sending messages to the group as a whole.


Group communication services

GC

Group Communication Services

  • Support management of groups

  • Maintain membership info

  • Manage communication

  • Make guarantees about ordering, reliability of message delivery, e.g.:

    • Best-effort: IP Multicast

    • Strong consistency guarantees: Isis, Transis, Ensemble

  • Hide complexity of coping with changes


This talk

This Talk

  • Describe

    • Group communication systems

    • A mathematical approach to designing, modeling, analyzing GC systems.

    • Our accomplishments and ideas for future work.

  • Collaborators:

    Idit Keidar, Alan Fekete, Alex Shvartsman, Roger Khazan, Roberto De Prisco, Jason Hickey, Robert van Renesse, Carl Livadas, Ziv Bar-Joseph, Kyle Ingols, Igor Tarashchanskiy


Talk outline

Talk Outline

I.Background: Group Communication

II.Our Approach

III.Projects and Results

1. View Synchrony

2. Ensemble

3. Dynamic Views

4. Scalable Group Communication

IV. Future Work

V. Conclusions


I background group communication

I. Background: Group Communication


The setting

?

?

?

?

The Setting

  • Dynamic distributed system, changing set of participating clients.

  • Applications:

    • Replicated databases, file systems

    • Distributed interactive games

    • Multi-media conferencing, collaborative work


Groups

Groups

  • Abstract, named groups of client processes, changing membership.

  • Client processes send messages to the group (multicast).

  • Early 80s: Group idea used in replicated data management system designs

  • Late 80s: Separate group communication services.


Group communication service

GC

Group Communication Service

  • Communication middleware

  • Manages group membership, current views

    View = membership set + identifier

  • Manages multicastcommunication

    among group members

    • Multicasts respect views

    • Guarantees within each view:

      • Reliability constraints

      • Ordering constraints, e.g., FIFO from each sender, causal, common total order

  • Global service

B

A


Group communication service1

mcast

receive

new-view

mcast

new-view

GCS

receive

Group Communication Service

Client A

Client B


Isis birman joseph 87

A

B

Isis [Birman, Joseph 87]

  • Primary component group membership

  • Several reliable multicast services, different ordering guarantees, e.g.:

    • Atomic Broadcast: Common total order, no gaps

    • Causal Broadcast:

  • When partition is repaired, primary processes send state information to rejoining processes.

  • Virtually Synchronous message delivery


Example interactive game

A B C D

A B C D

Example: Interactive Game

  • Alice, Bob, Carol, Dan in view {A,B,C,D}

  • Primary component membership

    • {A}{B,C,D} split;

      only {B,C,D} may continue.

  • Atomic Broadcast

    • A fires, B moves away;

      need consistent order


Interactive game

Interactive Game

  • Causal Broadcast

    • C sees A enter a room; locks door.

  • Virtual Synchrony

    • {A}{BCD} split; B sees A shoot; so do C, D.

A B C D

A B C D


Applications

Applications

  • Replicated data management

    • State machine replication [Lamport 78] , [Schneider 90]

    • Atomic Broadcast provides support

    • Same sequence of actions performed everywhere.

    • Example: Interactive game state machine

  • Stock market

  • Air-traffic control


Transis amir dolev kramer malkhi 92

Transis [Amir, Dolev, Kramer, Malkhi 92]

  • Partitionable group membership

  • When components merge, processes exchange state information.

  • Virtual synchrony reduces amount of data exchanged.

  • Applications

    • Highly available servers

    • Collaborative computing, e.g. shared whiteboard

    • Video, audio conferences

    • Distributed jam sessions

    • Replicated data management [Keidar , Dolev 96]


Other systems

Other Systems

  • Totem [Amir, Melliar-Smith, Moser, et al., 95]

    • Transitional views, useful with virtual synchrony

  • Horus[Birman, van Renesse, Maffeis 96]

  • Ensemble[Birman, Hayden 97]

    • Layered architecture

    • Composable building blocks

  • Phoenix, Consul, RMP, Newtop, RELACS,…

  • Partitionable


Service specifications

Service Specifications

  • Precise specifications needed for GC services

    • Help application programmers write programs that use the services correctly, effectively

    • Help system maintainers make changes correctly

    • Safety, performance, fault-tolerance

  • But difficult:

    • Many different services; different guarantees about membership, reliability, ordering

    • Complicated

    • Specs based on implementations might not be optimal for application programmers.


Early work on gc service specs

Early Work on GC Service Specs

  • [Ricciardi 92]

  • [Jahanian, Fakhouri, Rajkumar 93]

  • [Moser, Amir, Melliar-Smith, Agrawal 94]

  • [Babaoglu et al. 95, 98]

  • [Friedman, van Renesse 95]

  • [Hiltunin, Schlichting 95]

  • [Dolev, Malkhi, Strong 96]

  • [Cristian 96]

  • [Neiger 96]

  • Impossibility results [Chandra, Hadzilacos, et al. 96]

  • But still difficult…


Ii our approach

II. Our Approach


Approach

Approach

Application

  • Model everything:

    • Applications

      • Requirements, algorithms

    • Service specs

      • Work backwards, see what

        the applications need

    • Implementations of the services

  • State, prove correctness theorems:

    • For applications, implementations.

    • Methods: Composition, invariants, simulation relations

  • Analyze performance, fault-tolerance.

  • Layered proofs, analyses

Service

Application

Algorithm


Math foundation i o automata

Math Foundation: I/O Automata

  • Nondeterministic state machines

  • Not necessarily finite-state

  • Input/output/internal actions (signature)

  • Transitions, executions, traces

  • System modularity:

    • Composition, respecting traces

    • Levels of abstraction, respecting traces

  • Language-independent, math model


Typical examples modeled

Typical Examples Modeled

  • Distributed algorithms

  • Communication protocols

  • Distributed data management systems


Modeling style

Modeling Style

  • Describe interfaces, behavior

  • Program-like behavior descriptions:

    • Precondition/effect style

    • Pseudocode or IOA language

  • Abstract models for algorithms, services

  • Model several levels of abstraction,

    • High-level, global service specs

    • Detailed distributed algorithms


Modeling style1

Modeling Style

  • Very nondeterministic:

    • Constrain only what must be constrained.

    • Simpler

    • Allows alternative implementations


Describing timing features

Describing Timing Features

  • TIOAs [Lynch, Vaandrager 93]

    • For describing:

      • Timeout-based algorithms.

      • Clocks, clock synchronization

      • Performance properties


Describing failures

fail

recover

fail

recover

Describing Failures

  • Basic or timed I/O automata, with fail,recover input actions.

  • Included in traces, can use them in specs.


Describing other features

Describing Other Features

  • Probabilistic behavior: PIOAs[Segala 95]

    • For describing:

      • Systems with combination of probabilistic + nondeterministic behavior

      • Randomized distributed algorithms

      • Probabilistic assumptions on environment

  • Dynamic systems: DIOAs[Attie, Lynch 99]

    • For describing:

      • Run-time process creation and destruction

      • Mobility

      • Agent systems [NTT collaboration]


Using i o automata general

Using I/O Automata (General)

  • Specify systems precisely

  • Validate designs:

    • Simulation

    • State, prove correctness theorems

    • Analyze performance

  • Generate validated code

  • Study theoretical upper and lower bounds


Using i o automata for group communication systems

Using I/O Automata for Group Communication Systems

  • Use for global services + distributed algorithms

  • Define safety properties separately from performance/fault-tolerance properties.

    • Safety:

      • Basic I/O automata; trace properties

    • Performance/fault-tolerance:

      • Timed I/O automata with failure actions; timed trace properties


Iii projects and results

III. Projects and Results


Projects

Projects

1. View Synchrony

2. Ensemble

3. Dynamic Views

4. Scalable Group Communication


1 view synchrony vs fekete lynch shvartsman 97 00

1. View Synchrony (VS) [Fekete, Lynch, Shvartsman 97, 00]

Goals:

  • Develop prototypes:

    • Specifications for typical GC services

    • Descriptions for typical GC algorithms

    • Correctness proofs

    • Performance analyses

  • Design simple math foundation for the area.

  • Try out,evaluate our approach.


View synchrony

View Synchrony

What we did:

  • Talked with system developers (Isis, Transis)

  • Defined I/O automaton models for:

    • VS, prototype partitionable GC service

    • TO, non-view-oriented totally ordered bcast service

    • VStoTO, application algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser]

  • Proved correctness

  • Analyzed performance/ fault-tolerance.


Vstoto architecture

VStoTO Architecture

brcv

bcast

TO

VStoTO

VStoTO

gprcv

newview

gpsnd

VS


To broadcast specification

TO

TO Broadcast Specification

Delivers messages to everyone, in the same order.

Safety: TO-Machine

Signature:

input: bcast(a,p)

output: brcv(a,p,q)

internal: to-order(a,p)

State:

queue, sequence of (a,p), initially empty

for each p:

pending[p], sequence of a, initially empty

next[p], positive integer, initially 1


To machine

Transitions:

bcast(a,p)

Effect:

append a to pending[p]

to-order(a,p)

Precondition:

a is head of pending[p]

Effect:

remove head of pending[p]

append (a,p) to queue

brcv(a,p,q)

Precondition:

queue[next[q]] = (a,p)

Effect:

next[q] := next[q] + 1

TO-Machine


Performance fault tolerance

Performance/Fault-Tolerance

TO-Property(b,d,C):If C stabilizes, then soon thereafter (time b), any message sent or received anywhere in C is received everywhere in C, within bounded time (time d).

stabilize

send

receive

b

d


Vs specification

VS

VS Specification

  • Partitionable view-oriented service

  • Safety: VS-Machine

    • Views presented in consistent order, possible gaps

    • Messages respect views

    • Messages in consistent order

    • Causality

    • Prefix property

    • Safe indication

  • Doesn’t guarantee Virtual Synchrony

  • Like TO-Machine, but per view


Performance fault tolerance1

stabilize

newview( v)

mcast(v)

receive(v)

b

d

Performance/Fault-Tolerance

VS-Property(b,d,C):

If C stabilizes, then soon thereafter (time b), views known within C become consistent, and messages sent in the final view v are delivered everywhere in C, within bounded time (time d).


Vstoto algorithm

VStoTO Algorithm

  • TO must deliver messages in order, no gaps.

  • VS delivers messages in orderper view.

  • Problems arise from view changes:

    • Processes moving between views could have different prefixes.

    • Processes could skip views.

  • Algorithm:

    • Real work done in majority views only

    • Processes in majority views totally order messages, and deliver to clients messages that VS has said are safe.

    • At start of new view, processes exchange state, to reconcile progress made in different majority views.


Correctness safety proof

Correctness (Safety) Proof

  • Show composition of VS-Machine and VStoTO machines implements TO-Machine.

  • Trace inclusion

  • Use simulation relation proof:

    • Relate start states, steps of composition

      to those of TO-Machine

    • Invariants, e.g.:

      Once a message is ordered everywhere in some

      majority view, its order is determined forever.

  • Checked using PVS theorem-prover, TAME [Archer]

TO

Composition


Conditional performance analysis

Conditional Performance Analysis

  • Assume VS satisfies VS-Property(b,d,C):

    • If C stabilizes, then within time b, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within time d.

  • And VStoTO satisfies:

    • Simple timing and fault-tolerance assumptions.

  • Then TO satisfies TO-Property(b+d,d,C):

    • If C stabilizes, then within time b+d, any message sent or delivered anywhere in C is delivered everywhere in C, within time d.


Conclusions vs

Conclusions: VS

  • Models for VS, TO, VStoTO

  • Proofs, performance/f-t analyses

  • Tractable, understandable, modular

  • [PODC 97], [TOCS 00]

  • Follow-on work:

    • Algorithm for VS [Fekete, Lesley]

    • Load balancingusing VS [Khazan]

    • Models for other Transis algorithms [Chockler]

  • But: VS is only a prototype; lacks some key features, like Virtual Synchrony

  • Next: Try a real system!


2 ensemble hickey lynch van renesse 99

2. Ensemble [Hickey, Lynch, van Renesse 99]

Goals:

  • Try, evaluate our approach on a real system

  • Develop techniques for modeling, verifying, analyzing more features, of GC systems, including Virtual Synchrony

  • Improve on prior methods for system validation


Ensemble

Ensemble

  • Ensemble system [Birman, Hayden 97]

    • Virtual Synchrony

    • Layered design, building blocks

    • Coded in ML [Hayden]

  • Prior verification work for Ensemble and predecessors:

    • Proving local properties using Nuprl [Hickey]

    • [Ricciardi], [Friedman]


Ensemble1

Ensemble

  • What we did:

    • Worked with developers

    • Followed VS example

    • Developed global specs for key layers:

      • Virtual Synchrony

      • Total Order with Virtual Synchrony

    • Modeled Ensemble algorithm spanning between layers

    • Attempted proof; found logical error in state exchange algorithm (repaired)

    • Developed models, proofs for repaired system


Conclusions ensemble

Conclusions: Ensemble

  • Models for two layers, algorithm

  • Tractable, easily understandable by developers

  • Error, proofs

  • Low-level models similar to actual ML code (4 to 1)

  • [TACAS 99]

  • Follow-on:

    • Same error found in Horus.

    • Incremental models, proofs [Hickey]

  • Next: Use our approach to design new services.


3 dynamic views de prisco fekete lynch shvartsman 98

3. Dynamic Views [De Prisco, Fekete, Lynch, Shvartsman 98]

Goals:

  • Define GC services that cope with both:

    • Long-term changes:

      • Permanent failures, new joins

      • Changes in the “universe” of processes

    • Transient changes

  • Use these to design consistent total order and consistent replicated data algorithms that tolerate both long-term and transient changes.


Dynamic views

A

B

C

D

E

Dynamic Views

  • Many applications with strong consistency requirements make progress only in primary views:

    • Consistent replicated data management

    • Totally ordered broadcast

  • Can use staticnotion of allowable primaries,e.g., majorities of universe, quorums

    • All intersect.

    • Only one exists at a time.

    • Information can flow from each to the next.

  • But: Static notion not good for

    long-term changes


Dynamic views1

A

B

C

D

E

F

Dynamic Views

  • For long-term changes, want dynamic notion of allowable primaries.

  • E.g., each primary might contain majority of previous:

  • But: Some might not intersect.

    Makes it hard to maintain consistency.


Dynamic views2

Dynamic Views

  • Key problem:

    • Processes may have different opinions about which is the previous primary

    • Could be disjoint.

  • [Yeger-Lotem, Keidar, Dolev 97]algorithm

    • Keeps track of allpossible previous primaries.

    • Ensures intersection with all of them.


Dynamic views3

Dynamic Views

What we did:

  • Defined Dynamic View Service,DVS, based on [YKD]

  • Designed to tolerate long-term failures

  • Membership:

    • Views delivered in consistent order, possible gaps.

    • Ensures new primary intersects all possible previous primaries.

  • Communication:

    • Similar toVS

    • Messages delivered within views,

    • Prefix property, safe notifications.


Dynamic views4

TO

DVS

Dynamic Views

  • What we did, cont’d

    • Modeled, proved implementing algorithm

    • Modeled, proved TO-Broadcast application

    • Distributed implementation [Ingols 00]


Handling transient failures dynamic configurations

Handling Transient Failures: Dynamic Configurations

  • Configuration = Set of processes plus structure, e.g., set of quorums, leader,…

  • Application: Highly available consistent replicated data management:

    • Paxos [Lamport], uses leader, quorums

    • [Attiya,Bar-Noy, Dolev], uses read quorums and write quorums

  • Quorums allow flexibility, availability in the face of transient failures.


Dynamic configurations de prisco fekete lynch shvartsman 99 00

Dynamic Configurations [De Prisco, Fekete, Lynch, Shvartsman 99, 00]

  • Combine ideas/benefits of

    • Dynamic views, for long-term failures, and

    • Static configurations, for transient failures

  • Idea:

    • Allow configuration to change (reconfiguration).

    • Each configuration satisfies intersection properties with respect to previous configuration

  • Example:

    • Config = (membership set, read quorums, write quorums)

    • Membership set of new configuration contains read quorum and write quorum of previous configuration


Dynamic configurations

Dynamic Configurations

What we did:

  • Defined dynamic configuration service DCS, guaranteeing intersection properties w.r.t. all possible previous configurations.

  • Designed implementing algorithm, extending [YKD]

  • Developed application: Replicated data

    • Dynamic version of [Paxos]

    • Dynamic version of [Attiya, BarNoy, Dolev]

    • Tolerate

      • Transient failures, using quorums

      • Longer-term failures, using reconfiguration


Conclusions dynamic views

Conclusions: Dynamic Views

  • New DVS, DC services for long-term changes in set of processes

  • Applications, implementations

  • Decomposed complex algorithms into tractable pieces:

    • Service specification, implementation, application

    • Static algorithm vs. reconfiguration

  • Couldn’t have done it without the formal framework.

  • [PODC 98], [DISC 99]


4 scalable group communication keidar khazan 99 k k l shvartsman 00

4. Scalable Group Communication [Keidar, Khazan 99], [K ,K, L, Shvartsman 00]

Goal:

  • Make GC work in wide area networks

    What we did:

  • Defined desired properties for GC services

  • Defined spec for scalable group membership service [Keidar, Sussman, Marzullo, Dolev 00],

    implemented on small set of membership servers


Scalable group communication

Scalable Group Communication

What we did, cont’d:

  • Developed new, scalable GC algorithms:

    • Use scalable GM service

    • Multicast implemented on clients

    • Efficient: Algorithm for virtual synchrony uses only one round for state exchange, in parallel with membership service’s agreement on views.

    • Processes can join during reconfiguration.

  • Distributed implementation [Tarashchanskiy]


Scalable gc

Scalable GC

What we did, cont’d:

  • Developed new incremental modeling, proof methods [Keidar, Khazan, Lynch, Shvartsman 00]

    • Proof Extension Theorem

  • Developed models, proofs (safety and liveness) , using the new methods.

S

S’

A

A’


Conclusions scalable gc

Conclusions: Scalable GC

  • Specs, new algorithms, proofs

  • New incremental proof methods

  • Couldn’t have done it without the formal framework.

  • [ICDCS 99], [ICSE 00]


Iv future work

IV. Future Work


Future work

Future Work

  • Model, analyze GC services, applications

  • Design new GC services

  • Catalog

  • Compare, evaluate GC services

  • Math foundations

  • Theory  Practice


Practical gc systems current status birman 99

Practical GC Systems: Current Status [Birman 99]

  • Commercial successes:

    • Stock exchange (Zurich, New York)

    • Air-traffic control (France)

  • Problems:

    • Performance, for strong guarantees like Virtual Synchrony

    • Not integrated with object-oriented programming technologies.

  • Trends:

    • Flexible services

    • Weaker guarantees; better performance

    • Integration with OO technologies, allowing programmers to make tradeoffs.


1 model analyze gc services

Analyze performance of our new services: Dynamic views, Scalable GC

Implementations

Applications: Replicated data, games, …

Compare predicted, observed performance.

Other existing services

1. Model, Analyze GC Services


2 design new services

2. Design New Services

Total Order + QoS [Bar-Joseph, Keidar, Anker, L.]

  • Specs for:

    • Bandwidth reservation service

    • TO Multicast service with QoS (latency, bandwidth)

  • Algorithms implementing TO-QoS using reservation service:

    Algorithm 1: Allows gaps, simple, small added latency

    Algorithm 2: No gaps, more complex, more latency

    Basic services: Consensus, resource allocation, leader election, spanning trees, overlay networks


3 catalog of gc services

3. Catalog of GC Services

  • Service specs

  • Property specs [Chockler, Keidar, Vitenberg 99]

  • Implementing algorithms

  • Prototype applications

  • Lower bounds, impossibility results


4 compare evaluate gc services

4. Compare, Evaluate GC Services

  • Study tradeoffs between strength of ordering and reliability guarantees vs. performance

  • Compare GC services with other reliable multicast algorithms:

    • Scalable Reliable Multicast [Floyd, Jacobson, et al. 95]:

      Unreliable GC (IP Multicast) + retransmission protocol

    • Bimodal Multicast [Birman, Hayden, et al. 99]


5 math foundations

5. Math Foundations

  • Models:

    • Timing models

      • For timing assumptions, guarantees, QoS

      • For conditional performance analysis

    • Failure models, probabilistic models, process creation models…

    • Combined models

  • Proof methods:

    • Incremental modeling, proof

    • Conditional performance analysis


Conditional performance analysis1

Conditional Performance Analysis

  • Idea:

    • Make conditional claims about system behavior, under various assumptions about behavior of environment, network.

    • Include timing, performance, failures.

  • Benefits:

    • Formal performance predictions

    • Says when system makes specific guarantees

      • Normal case + failure cases

      • Parameters, sensitivity analysis

    • Composable

    • Get probabilistic claims as corollaries


Cp analysis typical hypotheses

CP Analysis: Typical Hypotheses

  • Stabilization of underlying network.

  • Limited rate of change.

  • Bounds on message delay.

  • Limited amount of failure (number, density).

  • Limit input arrivals (number, density).

  • Method allows focus on tractable cases.


Example reliable multicast livadas keidar lynch

Example: Reliable Multicast [Livadas, Keidar, Lynch]

  • Specs for IP Mcast, Reliable Mcast services

  • Automaton model for Scalable Reliable Mcast (SRM) protocol [Floyd, Jacobson, et al. 95]

  • Example:

    • Assume bounds on IP-level message loss, processor failures

    • Prove bounds on:

      • Time from client send until all non-failed clients receive.

      • Amount of traffic generated.


Srm architecture

SRM Architecture

SRM

IPMcast


6 theory practice

6. Theory  Practice

  • IOA language, tool support for GC services, algorithms

  • Incremental development methods for algorithms, service specs, proofs, analyses

  • Methods for integrating group communication services with object-oriented programming technologies


V conclusions

V. Conclusions


Summary

GC

Summary

  • GC services help in programming

    dynamic distributed systems, though scalability, integration problems remain.

  • Our contributions:

    • Modeling style: Automata + performance properties

    • Techniques: Conditional performance analysis, incremental modeling/proof

    • Models, proofs for key services

    • Discovered errors

    • New services: Dynamic views, scalable GC

  • Mathematical framework makes it possible to design more complex systems correctly.


Future work1

Future Work

  • Model, analyze GC services, applications

  • Design new services

  • Catalog

  • Compare, evaluate services

  • Math foundations

  • Theory  Practice


Thank you

Thank you!


  • Login