720 likes | 856 Views
This presentation discusses RAMBO (Reconfigurable Atomic Memory for Basic Objects), a distributed atomic memory system designed for dynamic networks. It examines the challenges of consensus, quorum systems, and memory access in dynamic environments where nodes can fail and messages may be lost or delayed. The talk covers the RAMBO algorithm's high-level overview, the reconfiguration service it offers, and its implications for achieving atomic consistency with fault tolerance. The insights are based on foundational works and innovations in the field of distributed shared memory.
E N D
RAMBO: Reconfigurable Atomic Memory for Dynamic Networks Seth Gilbert Nancy Lynch Alexander Shvartsman Presenter: Anastasia Braginsky (December 2013) Slides partially borrowed from Seth Gilbert (DSN ’03) and Edward Bortnikov (talk)
RAMBO name • Reconfigurable Atomic Memory for Basic Objects
Outline • Introduction • Background • Static Quorum Systems • Consensus • RAMBO high level overview • Preliminaries • The RAMBO algorithm • The reconfiguration service • Conclusions
Distributed Shared Memory Read Write(7) Write(0)
Atomic Consistency (linearizability) • Definition: Each operation appears to occur at some point between its invocation and response • Sufficient condition: For each object x, all the read and write operations for xcan be partially ordered by , so that: • No operation has infinitely many other operations ordered before it • is consistent with the order of invocations and responses: there are no operations such that 1 completes before 2 starts, yet 21 • All write operations are ordered with respect to each other and with respect to all the reads. • Every read returns the value of the last write preceding it in
Write(7) Read 7 op A completes before op B begins, then B returns the results of A Read Write(7) Write(0)
Suggestions? • Central server? • Performance bottleneck • Single point of failure • So multiple servers need to replicate the content • And do not stop the world if some reconfiguration is needed • But now how to find the latest value of replicated object?
Distributed Networked System • All-to-all connectivity, but messages can be lost, delayed, or re-ordered • No global clock or synchronization mechanism - asynchrony • Nodes can fail • A distributed networked system can be static (fixed set of participating nodes) or dynamic
And if everything fails? • Memory access operations are guaranteed to terminate under certain assumptions • Static: • The majority of replicas need to be active • Network delays are bounded • Dynamic: • Dynamically changing subset of replicas need to be active during certain periods • Otherwise… Sorry… • Operations may not terminate
Quorums Read Write(7) Dependable Systems and Networks 2003
Dynamic Atomic Memory Dependable Systems and Networks 2003
Outline • Introduction • Background • Static Quorum Systems • Consensus • RAMBO high level overview • Preliminaries • The RAMBO algorithm • The reconfiguration service • Conclusions
Static Quorum Systems • Upfaland Wigderson (85) • First general scheme for emulating shared-memory in the message-passing system • majority sets of readers and writers … • Attiya, Bar-Noy and Dolev (90/95) • Dijkstra Award in 2011 • Including extensions to the original algorithm [N. Lynch and A. Shvartsman. Robust emulation of shared memory using dynamic quorum-acknowledged broadcast. 1997]
A(ttiya) B(ar-Noy) D(olev) • Algorithm uses replication to achieve fault-tolerance and availability • n nodes • The system tolerates at most n/2-1 crashes
ABD for a single register • Each node i maintains the local value of the register • valueiandtagi = <seq, pid> • Tags are compared lexicographically • Each new write assigns a unique tag (pid to break ties) • Read and write operations have two phases • Query replicas for information • Propagate information to replicas • Send to everyone, majority should response
Consistency • Two majorities have non-empty intersection • There is at least one node participating in Propagation phase of previous operation and in Query phase of this one • All writes ordered by their tags
Too long waiting for the majority? • Use quorum systems • Quorum is a subset of nodes • Any two quorums intersect • The size of the set can be much less than the majority • The majority-based implementations tolerate crashes of any minority • The quorum-based implementations require that the nodes in at least one quorum do not crash
Consensus • Set of processes need to agree an a value • Nodes propose several values for consideration • Any solution must satisfy: • Agreement: no two processes decide on different values • Validity: the value decided was proposed by some node • Termination: all correct processes reach a decision • Consensus termination can not be guaranteed in the presence of even a single process crash • Paxos is an implementation of a consensus
Outline • Introduction • Background • Static Quorum Systems • Consensus • RAMBO high level overview • Preliminaries • The RAMBO algorithm • The reconfiguration service
RAMBO multi-reader, multi-writer • Short term: Quorum-basedReplication – to provide fault tolerance • Read- and write- quorums collected into configurations • Any quorum-configuration can be installed in any time • Long term: Reconfiguration – to cope with changing participants • Participants can joinand fail
Rambo • Decouple read/write ops and reconfiguration • fast read/write ops, even if recon slow • Astable state (no reconfigurations) is similar to the static two-phase ABD, but • Extended for multi-writer registers • Generalized to use quorum systems • New participants can join the service by contacting at least one existing participant
Quorums Reconfigurations • Performed concurrently with any ongoing reads and writes • Multiple reconfigurations can be in progress concurrently • Reconfiguration involves • Introduction of a new configuration • Garbage collection of obsolete configuration(s)
frequent reconfiguration? clocks out of synch? • messages lost? • messages delayed? Rambo stabilizes Network stabilizes Rambo stabilization Dependable Systems and Networks 2003
Three Sub-Protocols • Joiner • Joiner is notified by a device that the device wants to join • The device provides the initial world view (set of devices that this device thinks has already joined) • Joiner contacts this world and retrieves the information necessary for the new device to participate • Reader-Writer: Executing read-write operations and old configurations garbage-collection • Recon: Producing new configurations
Configuration map • Each participant maintain a configuration map – cmap – to store the sequence of configurations • For node i, cmapi(k) is • the configuration number k if configuration is active • or a notification that this configuration doesn’t yet exist • or a notification that this configuration was already garbage collected • This sequence evolves as new configurations are introduced by Recon and as all configurations are garbage collected
. . . c0 . . . c0 c1 . . . c0 c1 c2 . . . ± c1 c2 . . . ± ± c2 . . . ± ± ± c3 . . . CMAP Evolution
Reader-Writer • Each read or write executes in the context of one or more active configurations (must use all active configurations) • Reads and writes proceed concurrently with ongoing reconfigurations • Two phases • Query phase – information is retrieved from one (or more) read-quorums of all active configurations • Propagate phase – information is updated in one (or more) write-quorums of all active configurations • Garbage-Collection (GC) – removing old configurations • Notifying about old configuration(s) • Propagating information from old configuration to the next
RAMBO Assumptions • Assumptions regarding RAMBO behavior: • Regularly sends gossip messages to the participants • The initial world views overlap sufficiently such that every node that has joined the system is aware about every other node soon enough • Every configuration remains viable until sufficiently long after the next new configuration is installed • Reconfigurations are not initiated too frequent
Outline • Introduction • Background • Static Quorum Systems • Consensus • RAMBO high level overview • Preliminaries • The RAMBO algorithm • The reconfiguration service
The system • Set of devices communicating via all-to-all asynchronous message-passing network • I : totally ordered set of device identifiers 4 1 2 7 6 A node or a participant 5 3
The system Joiner Read-Write Recon • Set of devices communicating via all-to-all asynchronous message-passing network • I : totally ordered set of device identifiers • Nodes may fail by stopping (all components) without worning 1 4 Joiner Read-Write Recon Joiner Read-Write Recon 2 7 Joiner Read-Write Recon Joiner Read-Write Recon 6 Joiner Read-Write Recon Joiner Read-Write Recon 3 5
Shared Memory Read/Write Objects • X : set of object identifiers • For each object xX, Vx is the set of values that x may take on • (v0)x– the initial value of object x • (i0)x – the initial creator of object x, the node that is initially responsible for object x (this responsibility can be delegated) • T = N x I : set of tags, used to order the values written to the system
Configurations • C : set of configuration identifiers • Each identifier cC is assosiated with unique configuration consisting of: • members(c) – a finite subset of I • read-quorums(c) – a set of finite subsets of members(c) • write-quorums(c) – a set of finite subsets of members(c) • For every cC, for every Rread-quorums(c), and for every Wwrite-quorums(c): RW≠
RAMBO API Domains • I = set of Nodes • V = set of Values • C = set of Configurations Inputs and Outputs are all asynchronous per node iI and object xX Input (Request) Join(J)// J – initial world view Read Write(v) Recon (c, c’)// reconfiguration request Fail Output (Response) Join-ack Read-ack(v) Write-ack Recon-ack// request has been proceeded Report (c) // new configuration
Requests’ Well-Formedness • No requests after fail • Each client issues at most one join request and waits for acknowledgement before any further requests • Before issuing a new read/write/recon wait for previous acknowledgment • Each client issues at most one recon(*,c) request (configuration identifiers are unique) • Client can request reconfiguration from c to c’ only if c was installed and all members of c’ have already joined
Responses’ Well-Formedness • No responses after fail • Responses comes only upon requests
Reconfiguration service API Domains • I = set of Nodes • V = set of Values • C = set of Configurations Inputs and Outputs are all asynchronous per node iI and object xX Input (Request) Join Recon(c,c’) Request-config (k) // the client has learned of every configuration preceding k Fail Output (Response) Join-ack Recon-ack New-config(c,k) // the kth configuration has been agreed upon Report(c)
Recon Service Specification • Recon • Chooses configurations • Tells members of the previous and new configuration. • Informs Reader-Writer components (new-config). • Behavior (assuming well-formedness): • Agreement: Two configs never assigned to same k. • Validity: Any announced new-config was previously requested by someone. • No duplication: No configuration is assigned to more than one k.
Outline • Introduction • Background • Static Quorum Systems • Consensus • RAMBO high level overview • Preliminaries • The RAMBO algorithm • The reconfiguration service
Suppress explicit mention of x • The shared memory is described as the composition of a separate implementation for each object xX • V, v0, c0, and i0as shorthand for • Vx, (v0)x, (c0)x, and (i0)x
Joiner automata state • status {idle, joining, active, failed}, initially idle • others-status, a mapping from Recon and Reader-Writer to {idle, joining, active}, initially everywhere idle • initial-world (iw) I, initially
Join(J) Joiner automata
Hope at least one will answer… Join(J) join Joiner automata
Join(J) join join join Joiner automata