reliable group communication a mathematical approach
Download
Skip this Video
Download Presentation
Reliable Group Communication: a Mathematical Approach

Loading in 2 Seconds...

play fullscreen
1 / 78

Reliable Group Communication: a Mathematical Approach - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

…. GC. Reliable Group Communication: a Mathematical Approach. Nancy Lynch Theory of Distributed Systems MIT LCS Kansai chapter, IEEE July 7, 2000. ?. ?. ?. ?. Dynamic Distributed Systems. Modern distributed systems are dynamic.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Reliable Group Communication: a Mathematical Approach' - mimis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
reliable group communication a mathematical approach

GC

Reliable Group Communication: a Mathematical Approach

Nancy Lynch

Theory of Distributed Systems

MIT LCS

Kansai chapter, IEEE

July 7, 2000

dynamic distributed systems

?

?

?

?

Dynamic Distributed Systems
  • Modern distributed systems are dynamic.
  • Set of clients participating in an application changes, because of:
    • Network, processor failure, recovery
    • Changing client requirements
  • To cope with changes:
    • Use abstract groups of client processes with changing membership sets.
    • Processes communicate with group members by sending messages to the group as a whole.
group communication services

GC

Group Communication Services
  • Support management of groups
  • Maintain membership info
  • Manage communication
  • Make guarantees about ordering, reliability of message delivery, e.g.:
    • Best-effort: IP Multicast
    • Strong consistency guarantees: Isis, Transis, Ensemble
  • Hide complexity of coping with changes
this talk
This Talk
  • Describe
    • Group communication systems
    • A mathematical approach to designing, modeling, analyzing GC systems.
    • Our accomplishments and ideas for future work.
  • Collaborators:

Idit Keidar, Alan Fekete, Alex Shvartsman, Roger Khazan, Roberto De Prisco, Jason Hickey, Robert van Renesse, Carl Livadas, Ziv Bar-Joseph, Kyle Ingols, Igor Tarashchanskiy

talk outline
Talk Outline

I. Background: Group Communication

II. Our Approach

III. Projects and Results

1. View Synchrony

2. Ensemble

3. Dynamic Views

4. Scalable Group Communication

IV. Future Work

V. Conclusions

the setting

?

?

?

?

The Setting
  • Dynamic distributed system, changing set of participating clients.
  • Applications:
    • Replicated databases, file systems
    • Distributed interactive games
    • Multi-media conferencing, collaborative work
groups
Groups
  • Abstract, named groups of client processes, changing membership.
  • Client processes send messages to the group (multicast).
  • Early 80s: Group idea used in replicated data management system designs
  • Late 80s: Separate group communication services.
group communication service

GC

Group Communication Service
  • Communication middleware
  • Manages group membership, current views

View = membership set + identifier

  • Manages multicastcommunication

among group members

    • Multicasts respect views
    • Guarantees within each view:
      • Reliability constraints
      • Ordering constraints, e.g., FIFO from each sender, causal, common total order
  • Global service

B

A

group communication service1

mcast

receive

new-view

mcast

new-view

GCS

receive

Group Communication Service

Client A

Client B

isis birman joseph 87

A

B

Isis [Birman, Joseph 87]
  • Primary component group membership
  • Several reliable multicast services, different ordering guarantees, e.g.:
    • Atomic Broadcast: Common total order, no gaps
    • Causal Broadcast:
  • When partition is repaired, primary processes send state information to rejoining processes.
  • Virtually Synchronous message delivery
example interactive game

A B C D

A B C D

Example: Interactive Game
  • Alice, Bob, Carol, Dan in view {A,B,C,D}
  • Primary component membership
    • {A}{B,C,D} split;

only {B,C,D} may continue.

  • Atomic Broadcast
    • A fires, B moves away;

need consistent order

interactive game
Interactive Game
  • Causal Broadcast
    • C sees A enter a room; locks door.
  • Virtual Synchrony
    • {A}{BCD} split; B sees A shoot; so do C, D.

A B C D

A B C D

applications
Applications
  • Replicated data management
    • State machine replication [Lamport 78] , [Schneider 90]
    • Atomic Broadcast provides support
    • Same sequence of actions performed everywhere.
    • Example: Interactive game state machine
  • Stock market
  • Air-traffic control
transis amir dolev kramer malkhi 92
Transis [Amir, Dolev, Kramer, Malkhi 92]
  • Partitionable group membership
  • When components merge, processes exchange state information.
  • Virtual synchrony reduces amount of data exchanged.
  • Applications
    • Highly available servers
    • Collaborative computing, e.g. shared whiteboard
    • Video, audio conferences
    • Distributed jam sessions
    • Replicated data management [Keidar , Dolev 96]
other systems
Other Systems
  • Totem [Amir, Melliar-Smith, Moser, et al., 95]
    • Transitional views, useful with virtual synchrony
  • Horus[Birman, van Renesse, Maffeis 96]
  • Ensemble[Birman, Hayden 97]
    • Layered architecture
    • Composable building blocks
  • Phoenix, Consul, RMP, Newtop, RELACS,…
  • Partitionable
service specifications
Service Specifications
  • Precise specifications needed for GC services
    • Help application programmers write programs that use the services correctly, effectively
    • Help system maintainers make changes correctly
    • Safety, performance, fault-tolerance
  • But difficult:
    • Many different services; different guarantees about membership, reliability, ordering
    • Complicated
    • Specs based on implementations might not be optimal for application programmers.
early work on gc service specs
Early Work on GC Service Specs
  • [Ricciardi 92]
  • [Jahanian, Fakhouri, Rajkumar 93]
  • [Moser, Amir, Melliar-Smith, Agrawal 94]
  • [Babaoglu et al. 95, 98]
  • [Friedman, van Renesse 95]
  • [Hiltunin, Schlichting 95]
  • [Dolev, Malkhi, Strong 96]
  • [Cristian 96]
  • [Neiger 96]
  • Impossibility results [Chandra, Hadzilacos, et al. 96]
  • But still difficult…
approach
Approach

Application

  • Model everything:
    • Applications
      • Requirements, algorithms
    • Service specs
      • Work backwards, see what

the applications need

    • Implementations of the services
  • State, prove correctness theorems:
    • For applications, implementations.
    • Methods: Composition, invariants, simulation relations
  • Analyze performance, fault-tolerance.
  • Layered proofs, analyses

Service

Application

Algorithm

math foundation i o automata
Math Foundation: I/O Automata
  • Nondeterministic state machines
  • Not necessarily finite-state
  • Input/output/internal actions (signature)
  • Transitions, executions, traces
  • System modularity:
    • Composition, respecting traces
    • Levels of abstraction, respecting traces
  • Language-independent, math model
typical examples modeled
Typical Examples Modeled
  • Distributed algorithms
  • Communication protocols
  • Distributed data management systems
modeling style
Modeling Style
  • Describe interfaces, behavior
  • Program-like behavior descriptions:
    • Precondition/effect style
    • Pseudocode or IOA language
  • Abstract models for algorithms, services
  • Model several levels of abstraction,
    • High-level, global service specs

    • Detailed distributed algorithms
modeling style1
Modeling Style
  • Very nondeterministic:
    • Constrain only what must be constrained.
    • Simpler
    • Allows alternative implementations
describing timing features
Describing Timing Features
  • TIOAs [Lynch, Vaandrager 93]
    • For describing:
      • Timeout-based algorithms.
      • Clocks, clock synchronization
      • Performance properties
describing failures

fail

recover

fail

recover

Describing Failures
  • Basic or timed I/O automata, with fail,recover input actions.
  • Included in traces, can use them in specs.
describing other features
Describing Other Features
  • Probabilistic behavior: PIOAs[Segala 95]
    • For describing:
      • Systems with combination of probabilistic + nondeterministic behavior
      • Randomized distributed algorithms
      • Probabilistic assumptions on environment
  • Dynamic systems: DIOAs[Attie, Lynch 99]
    • For describing:
      • Run-time process creation and destruction
      • Mobility
      • Agent systems [NTT collaboration]
using i o automata general
Using I/O Automata (General)
  • Specify systems precisely
  • Validate designs:
    • Simulation
    • State, prove correctness theorems
    • Analyze performance
  • Generate validated code
  • Study theoretical upper and lower bounds
using i o automata for group communication systems
Using I/O Automata for Group Communication Systems
  • Use for global services + distributed algorithms
  • Define safety properties separately from performance/fault-tolerance properties.
    • Safety:
      • Basic I/O automata; trace properties
    • Performance/fault-tolerance:
      • Timed I/O automata with failure actions; timed trace properties
projects
Projects

1. View Synchrony

2. Ensemble

3. Dynamic Views

4. Scalable Group Communication

1 view synchrony vs fekete lynch shvartsman 97 00
1. View Synchrony (VS) [Fekete, Lynch, Shvartsman 97, 00]

Goals:

  • Develop prototypes:
    • Specifications for typical GC services
    • Descriptions for typical GC algorithms
    • Correctness proofs
    • Performance analyses
  • Design simple math foundation for the area.
  • Try out,evaluate our approach.
view synchrony
View Synchrony

What we did:

  • Talked with system developers (Isis, Transis)
  • Defined I/O automaton models for:
    • VS, prototype partitionable GC service
    • TO, non-view-oriented totally ordered bcast service
    • VStoTO, application algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser]
  • Proved correctness
  • Analyzed performance/ fault-tolerance.
vstoto architecture
VStoTO Architecture

brcv

bcast

TO

VStoTO

VStoTO

gprcv

newview

gpsnd

VS

to broadcast specification

TO

TO Broadcast Specification

Delivers messages to everyone, in the same order.

Safety: TO-Machine

Signature:

input: bcast(a,p)

output: brcv(a,p,q)

internal: to-order(a,p)

State:

queue, sequence of (a,p), initially empty

for each p:

pending[p], sequence of a, initially empty

next[p], positive integer, initially 1

to machine
Transitions:

bcast(a,p)

Effect:

append a to pending[p]

to-order(a,p)

Precondition:

a is head of pending[p]

Effect:

remove head of pending[p]

append (a,p) to queue

brcv(a,p,q)

Precondition:

queue[next[q]] = (a,p)

Effect:

next[q] := next[q] + 1

TO-Machine
performance fault tolerance
Performance/Fault-Tolerance

TO-Property(b,d,C):If C stabilizes, then soon thereafter (time b), any message sent or received anywhere in C is received everywhere in C, within bounded time (time d).

stabilize

send

receive

b

d

vs specification

VS

VS Specification
  • Partitionable view-oriented service
  • Safety: VS-Machine
    • Views presented in consistent order, possible gaps
    • Messages respect views
    • Messages in consistent order
    • Causality
    • Prefix property
    • Safe indication
  • Doesn’t guarantee Virtual Synchrony
  • Like TO-Machine, but per view
performance fault tolerance1

stabilize

newview( v)

mcast(v)

receive(v)

b

d

Performance/Fault-Tolerance

VS-Property(b,d,C):

If C stabilizes, then soon thereafter (time b), views known within C become consistent, and messages sent in the final view v are delivered everywhere in C, within bounded time (time d).

vstoto algorithm
VStoTO Algorithm
  • TO must deliver messages in order, no gaps.
  • VS delivers messages in orderper view.
  • Problems arise from view changes:
    • Processes moving between views could have different prefixes.
    • Processes could skip views.
  • Algorithm:
    • Real work done in majority views only
    • Processes in majority views totally order messages, and deliver to clients messages that VS has said are safe.
    • At start of new view, processes exchange state, to reconcile progress made in different majority views.
correctness safety proof
Correctness (Safety) Proof
  • Show composition of VS-Machine and VStoTO machines implements TO-Machine.
  • Trace inclusion
  • Use simulation relation proof:
    • Relate start states, steps of composition

to those of TO-Machine

    • Invariants, e.g.:

Once a message is ordered everywhere in some

majority view, its order is determined forever.

  • Checked using PVS theorem-prover, TAME [Archer]

TO

Composition

conditional performance analysis
Conditional Performance Analysis
  • Assume VS satisfies VS-Property(b,d,C):
    • If C stabilizes, then within time b, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within time d.
  • And VStoTO satisfies:
    • Simple timing and fault-tolerance assumptions.
  • Then TO satisfies TO-Property(b+d,d,C):
    • If C stabilizes, then within time b+d, any message sent or delivered anywhere in C is delivered everywhere in C, within time d.
conclusions vs
Conclusions: VS
  • Models for VS, TO, VStoTO
  • Proofs, performance/f-t analyses
  • Tractable, understandable, modular
  • [PODC 97], [TOCS 00]
  • Follow-on work:
    • Algorithm for VS [Fekete, Lesley]
    • Load balancingusing VS [Khazan]
    • Models for other Transis algorithms [Chockler]
  • But: VS is only a prototype; lacks some key features, like Virtual Synchrony
  • Next: Try a real system!
2 ensemble hickey lynch van renesse 99
2. Ensemble [Hickey, Lynch, van Renesse 99]

Goals:

  • Try, evaluate our approach on a real system
  • Develop techniques for modeling, verifying, analyzing more features, of GC systems, including Virtual Synchrony
  • Improve on prior methods for system validation
ensemble
Ensemble
  • Ensemble system [Birman, Hayden 97]
    • Virtual Synchrony
    • Layered design, building blocks
    • Coded in ML [Hayden]
  • Prior verification work for Ensemble and predecessors:
    • Proving local properties using Nuprl [Hickey]
    • [Ricciardi], [Friedman]
ensemble1
Ensemble
  • What we did:
    • Worked with developers
    • Followed VS example
    • Developed global specs for key layers:
      • Virtual Synchrony
      • Total Order with Virtual Synchrony
    • Modeled Ensemble algorithm spanning between layers
    • Attempted proof; found logical error in state exchange algorithm (repaired)
    • Developed models, proofs for repaired system
conclusions ensemble
Conclusions: Ensemble
  • Models for two layers, algorithm
  • Tractable, easily understandable by developers
  • Error, proofs
  • Low-level models similar to actual ML code (4 to 1)
  • [TACAS 99]
  • Follow-on:
    • Same error found in Horus.
    • Incremental models, proofs [Hickey]
  • Next: Use our approach to design new services.
3 dynamic views de prisco fekete lynch shvartsman 98
3. Dynamic Views [De Prisco, Fekete, Lynch, Shvartsman 98]

Goals:

  • Define GC services that cope with both:
    • Long-term changes:
      • Permanent failures, new joins
      • Changes in the “universe” of processes
    • Transient changes
  • Use these to design consistent total order and consistent replicated data algorithms that tolerate both long-term and transient changes.
dynamic views

A

B

C

D

E

Dynamic Views
  • Many applications with strong consistency requirements make progress only in primary views:
    • Consistent replicated data management
    • Totally ordered broadcast
  • Can use staticnotion of allowable primaries,e.g., majorities of universe, quorums
    • All intersect.
    • Only one exists at a time.
    • Information can flow from each to the next.
  • But: Static notion not good for

long-term changes

dynamic views1

A

B

C

D

E

F

Dynamic Views
  • For long-term changes, want dynamic notion of allowable primaries.
  • E.g., each primary might contain majority of previous:
  • But: Some might not intersect.

Makes it hard to maintain consistency.

dynamic views2
Dynamic Views
  • Key problem:
    • Processes may have different opinions about which is the previous primary
    • Could be disjoint.
  • [Yeger-Lotem, Keidar, Dolev 97]algorithm
    • Keeps track of allpossible previous primaries.
    • Ensures intersection with all of them.
dynamic views3
Dynamic Views

What we did:

  • Defined Dynamic View Service,DVS, based on [YKD]
  • Designed to tolerate long-term failures
  • Membership:
    • Views delivered in consistent order, possible gaps.
    • Ensures new primary intersects all possible previous primaries.
  • Communication:
    • Similar toVS
    • Messages delivered within views,
    • Prefix property, safe notifications.
dynamic views4

TO

DVS

Dynamic Views
  • What we did, cont’d
    • Modeled, proved implementing algorithm
    • Modeled, proved TO-Broadcast application
    • Distributed implementation [Ingols 00]
handling transient failures dynamic configurations
Handling Transient Failures: Dynamic Configurations
  • Configuration = Set of processes plus structure, e.g., set of quorums, leader,…
  • Application: Highly available consistent replicated data management:
    • Paxos [Lamport], uses leader, quorums
    • [Attiya,Bar-Noy, Dolev], uses read quorums and write quorums
  • Quorums allow flexibility, availability in the face of transient failures.
dynamic configurations de prisco fekete lynch shvartsman 99 00
Dynamic Configurations [De Prisco, Fekete, Lynch, Shvartsman 99, 00]
  • Combine ideas/benefits of
    • Dynamic views, for long-term failures, and
    • Static configurations, for transient failures
  • Idea:
    • Allow configuration to change (reconfiguration).
    • Each configuration satisfies intersection properties with respect to previous configuration
  • Example:
    • Config = (membership set, read quorums, write quorums)
    • Membership set of new configuration contains read quorum and write quorum of previous configuration
dynamic configurations
Dynamic Configurations

What we did:

  • Defined dynamic configuration service DCS, guaranteeing intersection properties w.r.t. all possible previous configurations.
  • Designed implementing algorithm, extending [YKD]
  • Developed application: Replicated data
    • Dynamic version of [Paxos]
    • Dynamic version of [Attiya, BarNoy, Dolev]
    • Tolerate
      • Transient failures, using quorums
      • Longer-term failures, using reconfiguration
conclusions dynamic views
Conclusions: Dynamic Views
  • New DVS, DC services for long-term changes in set of processes
  • Applications, implementations
  • Decomposed complex algorithms into tractable pieces:
    • Service specification, implementation, application
    • Static algorithm vs. reconfiguration
  • Couldn’t have done it without the formal framework.
  • [PODC 98], [DISC 99]
4 scalable group communication keidar khazan 99 k k l shvartsman 00
4. Scalable Group Communication [Keidar, Khazan 99], [K ,K, L, Shvartsman 00]

Goal:

  • Make GC work in wide area networks

What we did:

  • Defined desired properties for GC services
  • Defined spec for scalable group membership service [Keidar, Sussman, Marzullo, Dolev 00],

implemented on small set of membership servers

scalable group communication
Scalable Group Communication

What we did, cont’d:

  • Developed new, scalable GC algorithms:
    • Use scalable GM service
    • Multicast implemented on clients
    • Efficient: Algorithm for virtual synchrony uses only one round for state exchange, in parallel with membership service’s agreement on views.
    • Processes can join during reconfiguration.
  • Distributed implementation [Tarashchanskiy]
scalable gc
Scalable GC

What we did, cont’d:

  • Developed new incremental modeling, proof methods [Keidar, Khazan, Lynch, Shvartsman 00]
    • Proof Extension Theorem
  • Developed models, proofs (safety and liveness) , using the new methods.

S

S’

A

A’

conclusions scalable gc
Conclusions: Scalable GC
  • Specs, new algorithms, proofs
  • New incremental proof methods
  • Couldn’t have done it without the formal framework.
  • [ICDCS 99], [ICSE 00]
future work
Future Work
  • Model, analyze GC services, applications
  • Design new GC services
  • Catalog
  • Compare, evaluate GC services
  • Math foundations
  • Theory  Practice
practical gc systems current status birman 99
Practical GC Systems: Current Status [Birman 99]
  • Commercial successes:
    • Stock exchange (Zurich, New York)
    • Air-traffic control (France)
  • Problems:
    • Performance, for strong guarantees like Virtual Synchrony
    • Not integrated with object-oriented programming technologies.
  • Trends:
    • Flexible services
    • Weaker guarantees; better performance
    • Integration with OO technologies, allowing programmers to make tradeoffs.
1 model analyze gc services
Analyze performance of our new services: Dynamic views, Scalable GC

Implementations

Applications: Replicated data, games, …

Compare predicted, observed performance.

Other existing services

1. Model, Analyze GC Services
2 design new services
2. Design New Services

Total Order + QoS [Bar-Joseph, Keidar, Anker, L.]

  • Specs for:
    • Bandwidth reservation service
    • TO Multicast service with QoS (latency, bandwidth)
  • Algorithms implementing TO-QoS using reservation service:

Algorithm 1: Allows gaps, simple, small added latency

Algorithm 2: No gaps, more complex, more latency

Basic services: Consensus, resource allocation, leader election, spanning trees, overlay networks

3 catalog of gc services
3. Catalog of GC Services
  • Service specs
  • Property specs [Chockler, Keidar, Vitenberg 99]
  • Implementing algorithms
  • Prototype applications
  • Lower bounds, impossibility results
4 compare evaluate gc services
4. Compare, Evaluate GC Services
  • Study tradeoffs between strength of ordering and reliability guarantees vs. performance
  • Compare GC services with other reliable multicast algorithms:
    • Scalable Reliable Multicast [Floyd, Jacobson, et al. 95]:

Unreliable GC (IP Multicast) + retransmission protocol

    • Bimodal Multicast [Birman, Hayden, et al. 99]
5 math foundations
5. Math Foundations
  • Models:
    • Timing models
      • For timing assumptions, guarantees, QoS
      • For conditional performance analysis
    • Failure models, probabilistic models, process creation models…
    • Combined models
  • Proof methods:
    • Incremental modeling, proof
    • Conditional performance analysis
conditional performance analysis1
Conditional Performance Analysis
  • Idea:
    • Make conditional claims about system behavior, under various assumptions about behavior of environment, network.
    • Include timing, performance, failures.
  • Benefits:
    • Formal performance predictions
    • Says when system makes specific guarantees
      • Normal case + failure cases
      • Parameters, sensitivity analysis
    • Composable
    • Get probabilistic claims as corollaries
cp analysis typical hypotheses
CP Analysis: Typical Hypotheses
  • Stabilization of underlying network.
  • Limited rate of change.
  • Bounds on message delay.
  • Limited amount of failure (number, density).
  • Limit input arrivals (number, density).
  • Method allows focus on tractable cases.
example reliable multicast livadas keidar lynch
Example: Reliable Multicast [Livadas, Keidar, Lynch]
  • Specs for IP Mcast, Reliable Mcast services
  • Automaton model for Scalable Reliable Mcast (SRM) protocol [Floyd, Jacobson, et al. 95]
  • Example:
    • Assume bounds on IP-level message loss, processor failures
    • Prove bounds on:
      • Time from client send until all non-failed clients receive.
      • Amount of traffic generated.
6 theory practice
6. Theory  Practice
  • IOA language, tool support for GC services, algorithms
  • Incremental development methods for algorithms, service specs, proofs, analyses
  • Methods for integrating group communication services with object-oriented programming technologies
summary

GC

Summary
  • GC services help in programming

dynamic distributed systems, though scalability, integration problems remain.

  • Our contributions:
    • Modeling style: Automata + performance properties
    • Techniques: Conditional performance analysis, incremental modeling/proof
    • Models, proofs for key services
    • Discovered errors
    • New services: Dynamic views, scalable GC
  • Mathematical framework makes it possible to design more complex systems correctly.
future work1
Future Work
  • Model, analyze GC services, applications
  • Design new services
  • Catalog
  • Compare, evaluate services
  • Math foundations
  • Theory  Practice
ad