Reliable Group Communication: a Mathematical Approach - PowerPoint PPT Presentation

Reliable group communication a mathematical approach
Download
1 / 78

  • 104 Views
  • Uploaded on
  • Presentation posted in: General

…. GC. Reliable Group Communication: a Mathematical Approach. Nancy Lynch Theory of Distributed Systems MIT LCS Kansai chapter, IEEE July 7, 2000. ?. ?. ?. ?. Dynamic Distributed Systems. Modern distributed systems are dynamic.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Reliable Group Communication: a Mathematical Approach

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Reliable group communication a mathematical approach

GC

Reliable Group Communication: a Mathematical Approach

Nancy Lynch

Theory of Distributed Systems

MIT LCS

Kansai chapter, IEEE

July 7, 2000


Dynamic distributed systems

?

?

?

?

Dynamic Distributed Systems

  • Modern distributed systems are dynamic.

  • Set of clients participating in an application changes, because of:

    • Network, processor failure, recovery

    • Changing client requirements

  • To cope with changes:

    • Use abstract groups of client processes with changing membership sets.

    • Processes communicate with group members by sending messages to the group as a whole.


Group communication services

GC

Group Communication Services

  • Support management of groups

  • Maintain membership info

  • Manage communication

  • Make guarantees about ordering, reliability of message delivery, e.g.:

    • Best-effort: IP Multicast

    • Strong consistency guarantees: Isis, Transis, Ensemble

  • Hide complexity of coping with changes


This talk

This Talk

  • Describe

    • Group communication systems

    • A mathematical approach to designing, modeling, analyzing GC systems.

    • Our accomplishments and ideas for future work.

  • Collaborators:

    Idit Keidar, Alan Fekete, Alex Shvartsman, Roger Khazan, Roberto De Prisco, Jason Hickey, Robert van Renesse, Carl Livadas, Ziv Bar-Joseph, Kyle Ingols, Igor Tarashchanskiy


Talk outline

Talk Outline

I.Background: Group Communication

II.Our Approach

III.Projects and Results

1. View Synchrony

2. Ensemble

3. Dynamic Views

4. Scalable Group Communication

IV. Future Work

V. Conclusions


I background group communication

I. Background: Group Communication


The setting

?

?

?

?

The Setting

  • Dynamic distributed system, changing set of participating clients.

  • Applications:

    • Replicated databases, file systems

    • Distributed interactive games

    • Multi-media conferencing, collaborative work


Groups

Groups

  • Abstract, named groups of client processes, changing membership.

  • Client processes send messages to the group (multicast).

  • Early 80s: Group idea used in replicated data management system designs

  • Late 80s: Separate group communication services.


Group communication service

GC

Group Communication Service

  • Communication middleware

  • Manages group membership, current views

    View = membership set + identifier

  • Manages multicastcommunication

    among group members

    • Multicasts respect views

    • Guarantees within each view:

      • Reliability constraints

      • Ordering constraints, e.g., FIFO from each sender, causal, common total order

  • Global service

B

A


Group communication service1

mcast

receive

new-view

mcast

new-view

GCS

receive

Group Communication Service

Client A

Client B


Isis birman joseph 87

A

B

Isis [Birman, Joseph 87]

  • Primary component group membership

  • Several reliable multicast services, different ordering guarantees, e.g.:

    • Atomic Broadcast: Common total order, no gaps

    • Causal Broadcast:

  • When partition is repaired, primary processes send state information to rejoining processes.

  • Virtually Synchronous message delivery


Example interactive game

A B C D

A B C D

Example: Interactive Game

  • Alice, Bob, Carol, Dan in view {A,B,C,D}

  • Primary component membership

    • {A}{B,C,D} split;

      only {B,C,D} may continue.

  • Atomic Broadcast

    • A fires, B moves away;

      need consistent order


Interactive game

Interactive Game

  • Causal Broadcast

    • C sees A enter a room; locks door.

  • Virtual Synchrony

    • {A}{BCD} split; B sees A shoot; so do C, D.

A B C D

A B C D


Applications

Applications

  • Replicated data management

    • State machine replication [Lamport 78] , [Schneider 90]

    • Atomic Broadcast provides support

    • Same sequence of actions performed everywhere.

    • Example: Interactive game state machine

  • Stock market

  • Air-traffic control


Transis amir dolev kramer malkhi 92

Transis [Amir, Dolev, Kramer, Malkhi 92]

  • Partitionable group membership

  • When components merge, processes exchange state information.

  • Virtual synchrony reduces amount of data exchanged.

  • Applications

    • Highly available servers

    • Collaborative computing, e.g. shared whiteboard

    • Video, audio conferences

    • Distributed jam sessions

    • Replicated data management [Keidar , Dolev 96]


Other systems

Other Systems

  • Totem [Amir, Melliar-Smith, Moser, et al., 95]

    • Transitional views, useful with virtual synchrony

  • Horus[Birman, van Renesse, Maffeis 96]

  • Ensemble[Birman, Hayden 97]

    • Layered architecture

    • Composable building blocks

  • Phoenix, Consul, RMP, Newtop, RELACS,…

  • Partitionable


Service specifications

Service Specifications

  • Precise specifications needed for GC services

    • Help application programmers write programs that use the services correctly, effectively

    • Help system maintainers make changes correctly

    • Safety, performance, fault-tolerance

  • But difficult:

    • Many different services; different guarantees about membership, reliability, ordering

    • Complicated

    • Specs based on implementations might not be optimal for application programmers.


Early work on gc service specs

Early Work on GC Service Specs

  • [Ricciardi 92]

  • [Jahanian, Fakhouri, Rajkumar 93]

  • [Moser, Amir, Melliar-Smith, Agrawal 94]

  • [Babaoglu et al. 95, 98]

  • [Friedman, van Renesse 95]

  • [Hiltunin, Schlichting 95]

  • [Dolev, Malkhi, Strong 96]

  • [Cristian 96]

  • [Neiger 96]

  • Impossibility results [Chandra, Hadzilacos, et al. 96]

  • But still difficult…


Ii our approach

II. Our Approach


Approach

Approach

Application

  • Model everything:

    • Applications

      • Requirements, algorithms

    • Service specs

      • Work backwards, see what

        the applications need

    • Implementations of the services

  • State, prove correctness theorems:

    • For applications, implementations.

    • Methods: Composition, invariants, simulation relations

  • Analyze performance, fault-tolerance.

  • Layered proofs, analyses

Service

Application

Algorithm


Math foundation i o automata

Math Foundation: I/O Automata

  • Nondeterministic state machines

  • Not necessarily finite-state

  • Input/output/internal actions (signature)

  • Transitions, executions, traces

  • System modularity:

    • Composition, respecting traces

    • Levels of abstraction, respecting traces

  • Language-independent, math model


Typical examples modeled

Typical Examples Modeled

  • Distributed algorithms

  • Communication protocols

  • Distributed data management systems


Modeling style

Modeling Style

  • Describe interfaces, behavior

  • Program-like behavior descriptions:

    • Precondition/effect style

    • Pseudocode or IOA language

  • Abstract models for algorithms, services

  • Model several levels of abstraction,

    • High-level, global service specs

    • Detailed distributed algorithms


Modeling style1

Modeling Style

  • Very nondeterministic:

    • Constrain only what must be constrained.

    • Simpler

    • Allows alternative implementations


Describing timing features

Describing Timing Features

  • TIOAs [Lynch, Vaandrager 93]

    • For describing:

      • Timeout-based algorithms.

      • Clocks, clock synchronization

      • Performance properties


Describing failures

fail

recover

fail

recover

Describing Failures

  • Basic or timed I/O automata, with fail,recover input actions.

  • Included in traces, can use them in specs.


Describing other features

Describing Other Features

  • Probabilistic behavior: PIOAs[Segala 95]

    • For describing:

      • Systems with combination of probabilistic + nondeterministic behavior

      • Randomized distributed algorithms

      • Probabilistic assumptions on environment

  • Dynamic systems: DIOAs[Attie, Lynch 99]

    • For describing:

      • Run-time process creation and destruction

      • Mobility

      • Agent systems [NTT collaboration]


Using i o automata general

Using I/O Automata (General)

  • Specify systems precisely

  • Validate designs:

    • Simulation

    • State, prove correctness theorems

    • Analyze performance

  • Generate validated code

  • Study theoretical upper and lower bounds


Using i o automata for group communication systems

Using I/O Automata for Group Communication Systems

  • Use for global services + distributed algorithms

  • Define safety properties separately from performance/fault-tolerance properties.

    • Safety:

      • Basic I/O automata; trace properties

    • Performance/fault-tolerance:

      • Timed I/O automata with failure actions; timed trace properties


Iii projects and results

III. Projects and Results


Projects

Projects

1. View Synchrony

2. Ensemble

3. Dynamic Views

4. Scalable Group Communication


1 view synchrony vs fekete lynch shvartsman 97 00

1. View Synchrony (VS) [Fekete, Lynch, Shvartsman 97, 00]

Goals:

  • Develop prototypes:

    • Specifications for typical GC services

    • Descriptions for typical GC algorithms

    • Correctness proofs

    • Performance analyses

  • Design simple math foundation for the area.

  • Try out,evaluate our approach.


View synchrony

View Synchrony

What we did:

  • Talked with system developers (Isis, Transis)

  • Defined I/O automaton models for:

    • VS, prototype partitionable GC service

    • TO, non-view-oriented totally ordered bcast service

    • VStoTO, application algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser]

  • Proved correctness

  • Analyzed performance/ fault-tolerance.


Vstoto architecture

VStoTO Architecture

brcv

bcast

TO

VStoTO

VStoTO

gprcv

newview

gpsnd

VS


To broadcast specification

TO

TO Broadcast Specification

Delivers messages to everyone, in the same order.

Safety: TO-Machine

Signature:

input: bcast(a,p)

output: brcv(a,p,q)

internal: to-order(a,p)

State:

queue, sequence of (a,p), initially empty

for each p:

pending[p], sequence of a, initially empty

next[p], positive integer, initially 1


To machine

Transitions:

bcast(a,p)

Effect:

append a to pending[p]

to-order(a,p)

Precondition:

a is head of pending[p]

Effect:

remove head of pending[p]

append (a,p) to queue

brcv(a,p,q)

Precondition:

queue[next[q]] = (a,p)

Effect:

next[q] := next[q] + 1

TO-Machine


Performance fault tolerance

Performance/Fault-Tolerance

TO-Property(b,d,C):If C stabilizes, then soon thereafter (time b), any message sent or received anywhere in C is received everywhere in C, within bounded time (time d).

stabilize

send

receive

b

d


Vs specification

VS

VS Specification

  • Partitionable view-oriented service

  • Safety: VS-Machine

    • Views presented in consistent order, possible gaps

    • Messages respect views

    • Messages in consistent order

    • Causality

    • Prefix property

    • Safe indication

  • Doesn’t guarantee Virtual Synchrony

  • Like TO-Machine, but per view


Performance fault tolerance1

stabilize

newview( v)

mcast(v)

receive(v)

b

d

Performance/Fault-Tolerance

VS-Property(b,d,C):

If C stabilizes, then soon thereafter (time b), views known within C become consistent, and messages sent in the final view v are delivered everywhere in C, within bounded time (time d).


Vstoto algorithm

VStoTO Algorithm

  • TO must deliver messages in order, no gaps.

  • VS delivers messages in orderper view.

  • Problems arise from view changes:

    • Processes moving between views could have different prefixes.

    • Processes could skip views.

  • Algorithm:

    • Real work done in majority views only

    • Processes in majority views totally order messages, and deliver to clients messages that VS has said are safe.

    • At start of new view, processes exchange state, to reconcile progress made in different majority views.


Correctness safety proof

Correctness (Safety) Proof

  • Show composition of VS-Machine and VStoTO machines implements TO-Machine.

  • Trace inclusion

  • Use simulation relation proof:

    • Relate start states, steps of composition

      to those of TO-Machine

    • Invariants, e.g.:

      Once a message is ordered everywhere in some

      majority view, its order is determined forever.

  • Checked using PVS theorem-prover, TAME [Archer]

TO

Composition


Conditional performance analysis

Conditional Performance Analysis

  • Assume VS satisfies VS-Property(b,d,C):

    • If C stabilizes, then within time b, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within time d.

  • And VStoTO satisfies:

    • Simple timing and fault-tolerance assumptions.

  • Then TO satisfies TO-Property(b+d,d,C):

    • If C stabilizes, then within time b+d, any message sent or delivered anywhere in C is delivered everywhere in C, within time d.


Conclusions vs

Conclusions: VS

  • Models for VS, TO, VStoTO

  • Proofs, performance/f-t analyses

  • Tractable, understandable, modular

  • [PODC 97], [TOCS 00]

  • Follow-on work:

    • Algorithm for VS [Fekete, Lesley]

    • Load balancingusing VS [Khazan]

    • Models for other Transis algorithms [Chockler]

  • But: VS is only a prototype; lacks some key features, like Virtual Synchrony

  • Next: Try a real system!


2 ensemble hickey lynch van renesse 99

2. Ensemble [Hickey, Lynch, van Renesse 99]

Goals:

  • Try, evaluate our approach on a real system

  • Develop techniques for modeling, verifying, analyzing more features, of GC systems, including Virtual Synchrony

  • Improve on prior methods for system validation


Ensemble

Ensemble

  • Ensemble system [Birman, Hayden 97]

    • Virtual Synchrony

    • Layered design, building blocks

    • Coded in ML [Hayden]

  • Prior verification work for Ensemble and predecessors:

    • Proving local properties using Nuprl [Hickey]

    • [Ricciardi], [Friedman]


Ensemble1

Ensemble

  • What we did:

    • Worked with developers

    • Followed VS example

    • Developed global specs for key layers:

      • Virtual Synchrony

      • Total Order with Virtual Synchrony

    • Modeled Ensemble algorithm spanning between layers

    • Attempted proof; found logical error in state exchange algorithm (repaired)

    • Developed models, proofs for repaired system


Conclusions ensemble

Conclusions: Ensemble

  • Models for two layers, algorithm

  • Tractable, easily understandable by developers

  • Error, proofs

  • Low-level models similar to actual ML code (4 to 1)

  • [TACAS 99]

  • Follow-on:

    • Same error found in Horus.

    • Incremental models, proofs [Hickey]

  • Next: Use our approach to design new services.


3 dynamic views de prisco fekete lynch shvartsman 98

3. Dynamic Views [De Prisco, Fekete, Lynch, Shvartsman 98]

Goals:

  • Define GC services that cope with both:

    • Long-term changes:

      • Permanent failures, new joins

      • Changes in the “universe” of processes

    • Transient changes

  • Use these to design consistent total order and consistent replicated data algorithms that tolerate both long-term and transient changes.


Dynamic views

A

B

C

D

E

Dynamic Views

  • Many applications with strong consistency requirements make progress only in primary views:

    • Consistent replicated data management

    • Totally ordered broadcast

  • Can use staticnotion of allowable primaries,e.g., majorities of universe, quorums

    • All intersect.

    • Only one exists at a time.

    • Information can flow from each to the next.

  • But: Static notion not good for

    long-term changes


Dynamic views1

A

B

C

D

E

F

Dynamic Views

  • For long-term changes, want dynamic notion of allowable primaries.

  • E.g., each primary might contain majority of previous:

  • But: Some might not intersect.

    Makes it hard to maintain consistency.


Dynamic views2

Dynamic Views

  • Key problem:

    • Processes may have different opinions about which is the previous primary

    • Could be disjoint.

  • [Yeger-Lotem, Keidar, Dolev 97]algorithm

    • Keeps track of allpossible previous primaries.

    • Ensures intersection with all of them.


Dynamic views3

Dynamic Views

What we did:

  • Defined Dynamic View Service,DVS, based on [YKD]

  • Designed to tolerate long-term failures

  • Membership:

    • Views delivered in consistent order, possible gaps.

    • Ensures new primary intersects all possible previous primaries.

  • Communication:

    • Similar toVS

    • Messages delivered within views,

    • Prefix property, safe notifications.


Dynamic views4

TO

DVS

Dynamic Views

  • What we did, cont’d

    • Modeled, proved implementing algorithm

    • Modeled, proved TO-Broadcast application

    • Distributed implementation [Ingols 00]


Handling transient failures dynamic configurations

Handling Transient Failures: Dynamic Configurations

  • Configuration = Set of processes plus structure, e.g., set of quorums, leader,…

  • Application: Highly available consistent replicated data management:

    • Paxos [Lamport], uses leader, quorums

    • [Attiya,Bar-Noy, Dolev], uses read quorums and write quorums

  • Quorums allow flexibility, availability in the face of transient failures.


Dynamic configurations de prisco fekete lynch shvartsman 99 00

Dynamic Configurations [De Prisco, Fekete, Lynch, Shvartsman 99, 00]

  • Combine ideas/benefits of

    • Dynamic views, for long-term failures, and

    • Static configurations, for transient failures

  • Idea:

    • Allow configuration to change (reconfiguration).

    • Each configuration satisfies intersection properties with respect to previous configuration

  • Example:

    • Config = (membership set, read quorums, write quorums)

    • Membership set of new configuration contains read quorum and write quorum of previous configuration


Dynamic configurations

Dynamic Configurations

What we did:

  • Defined dynamic configuration service DCS, guaranteeing intersection properties w.r.t. all possible previous configurations.

  • Designed implementing algorithm, extending [YKD]

  • Developed application: Replicated data

    • Dynamic version of [Paxos]

    • Dynamic version of [Attiya, BarNoy, Dolev]

    • Tolerate

      • Transient failures, using quorums

      • Longer-term failures, using reconfiguration


Conclusions dynamic views

Conclusions: Dynamic Views

  • New DVS, DC services for long-term changes in set of processes

  • Applications, implementations

  • Decomposed complex algorithms into tractable pieces:

    • Service specification, implementation, application

    • Static algorithm vs. reconfiguration

  • Couldn’t have done it without the formal framework.

  • [PODC 98], [DISC 99]


4 scalable group communication keidar khazan 99 k k l shvartsman 00

4. Scalable Group Communication [Keidar, Khazan 99], [K ,K, L, Shvartsman 00]

Goal:

  • Make GC work in wide area networks

    What we did:

  • Defined desired properties for GC services

  • Defined spec for scalable group membership service [Keidar, Sussman, Marzullo, Dolev 00],

    implemented on small set of membership servers


Scalable group communication

Scalable Group Communication

What we did, cont’d:

  • Developed new, scalable GC algorithms:

    • Use scalable GM service

    • Multicast implemented on clients

    • Efficient: Algorithm for virtual synchrony uses only one round for state exchange, in parallel with membership service’s agreement on views.

    • Processes can join during reconfiguration.

  • Distributed implementation [Tarashchanskiy]


Scalable gc

Scalable GC

What we did, cont’d:

  • Developed new incremental modeling, proof methods [Keidar, Khazan, Lynch, Shvartsman 00]

    • Proof Extension Theorem

  • Developed models, proofs (safety and liveness) , using the new methods.

S

S’

A

A’


Conclusions scalable gc

Conclusions: Scalable GC

  • Specs, new algorithms, proofs

  • New incremental proof methods

  • Couldn’t have done it without the formal framework.

  • [ICDCS 99], [ICSE 00]


Iv future work

IV. Future Work


Future work

Future Work

  • Model, analyze GC services, applications

  • Design new GC services

  • Catalog

  • Compare, evaluate GC services

  • Math foundations

  • Theory  Practice


Practical gc systems current status birman 99

Practical GC Systems: Current Status [Birman 99]

  • Commercial successes:

    • Stock exchange (Zurich, New York)

    • Air-traffic control (France)

  • Problems:

    • Performance, for strong guarantees like Virtual Synchrony

    • Not integrated with object-oriented programming technologies.

  • Trends:

    • Flexible services

    • Weaker guarantees; better performance

    • Integration with OO technologies, allowing programmers to make tradeoffs.


1 model analyze gc services

Analyze performance of our new services: Dynamic views, Scalable GC

Implementations

Applications: Replicated data, games, …

Compare predicted, observed performance.

Other existing services

1. Model, Analyze GC Services


2 design new services

2. Design New Services

Total Order + QoS [Bar-Joseph, Keidar, Anker, L.]

  • Specs for:

    • Bandwidth reservation service

    • TO Multicast service with QoS (latency, bandwidth)

  • Algorithms implementing TO-QoS using reservation service:

    Algorithm 1: Allows gaps, simple, small added latency

    Algorithm 2: No gaps, more complex, more latency

    Basic services: Consensus, resource allocation, leader election, spanning trees, overlay networks


3 catalog of gc services

3. Catalog of GC Services

  • Service specs

  • Property specs [Chockler, Keidar, Vitenberg 99]

  • Implementing algorithms

  • Prototype applications

  • Lower bounds, impossibility results


4 compare evaluate gc services

4. Compare, Evaluate GC Services

  • Study tradeoffs between strength of ordering and reliability guarantees vs. performance

  • Compare GC services with other reliable multicast algorithms:

    • Scalable Reliable Multicast [Floyd, Jacobson, et al. 95]:

      Unreliable GC (IP Multicast) + retransmission protocol

    • Bimodal Multicast [Birman, Hayden, et al. 99]


5 math foundations

5. Math Foundations

  • Models:

    • Timing models

      • For timing assumptions, guarantees, QoS

      • For conditional performance analysis

    • Failure models, probabilistic models, process creation models…

    • Combined models

  • Proof methods:

    • Incremental modeling, proof

    • Conditional performance analysis


Conditional performance analysis1

Conditional Performance Analysis

  • Idea:

    • Make conditional claims about system behavior, under various assumptions about behavior of environment, network.

    • Include timing, performance, failures.

  • Benefits:

    • Formal performance predictions

    • Says when system makes specific guarantees

      • Normal case + failure cases

      • Parameters, sensitivity analysis

    • Composable

    • Get probabilistic claims as corollaries


Cp analysis typical hypotheses

CP Analysis: Typical Hypotheses

  • Stabilization of underlying network.

  • Limited rate of change.

  • Bounds on message delay.

  • Limited amount of failure (number, density).

  • Limit input arrivals (number, density).

  • Method allows focus on tractable cases.


Example reliable multicast livadas keidar lynch

Example: Reliable Multicast [Livadas, Keidar, Lynch]

  • Specs for IP Mcast, Reliable Mcast services

  • Automaton model for Scalable Reliable Mcast (SRM) protocol [Floyd, Jacobson, et al. 95]

  • Example:

    • Assume bounds on IP-level message loss, processor failures

    • Prove bounds on:

      • Time from client send until all non-failed clients receive.

      • Amount of traffic generated.


Srm architecture

SRM Architecture

SRM

IPMcast


6 theory practice

6. Theory  Practice

  • IOA language, tool support for GC services, algorithms

  • Incremental development methods for algorithms, service specs, proofs, analyses

  • Methods for integrating group communication services with object-oriented programming technologies


V conclusions

V. Conclusions


Summary

GC

Summary

  • GC services help in programming

    dynamic distributed systems, though scalability, integration problems remain.

  • Our contributions:

    • Modeling style: Automata + performance properties

    • Techniques: Conditional performance analysis, incremental modeling/proof

    • Models, proofs for key services

    • Discovered errors

    • New services: Dynamic views, scalable GC

  • Mathematical framework makes it possible to design more complex systems correctly.


Future work1

Future Work

  • Model, analyze GC services, applications

  • Design new services

  • Catalog

  • Compare, evaluate services

  • Math foundations

  • Theory  Practice


Thank you

Thank you!


  • Login