190 likes | 308 Views
Distributed Snapshot. Distributed Systems. Introduction : ¿ What is a Distributed System ?. A network of processes . The nodes are processes , and the edges are comunication channels. Introduction.
E N D
DistributedSnapshot DistributedSystems
Introduction: ¿Whatis a DistributedSystem? • A network of processes. Thenodes are processes, and theedges are comunicationchannels.
Introduction • A computationis a sequence of atomicactionsthattransform a giveninitialstatetothe final state. Whilesuchactions are totallyordered in a sequentialprocess, they are onlypartiallyordered in a distributedsystem.
Introduction • In thiscontext, thestate (alsoknown as global state) of a distributedsystemisthe set of local states of allthecomponentprocesses, as well as thestates of everychannelthroughwhichmessagesflow.
Introduction So theimportantquestionis: whenorhow do we record thestates of theprocesses and thechannels? Dependingonwhenthestates of the individual components are recorded, thevalue of the global state can varywidely.
Difficulties • Therecording of the global statemay look simple forsomeexternalobservertwho looks at thesystemfromoutside. Thesameproblemissurprisinglychallenging, whenonetakes a snapshotfrominsidethesystem.
Difficulties • Consider a system of threeprocessesnumbered 0, 1, and 2 connectedby FIFO channels, and assumethatanunknownnumber of indistinguishabletokens are circulatingindefinitelythroughthisnetwork. • Wewanttheprocessestocooperatewithoneanothertocounttheexactnumber of tokenscirculating in thesystem (withouteverstoppingthesystem).
Difficulties • Deadlockdetection. Anyprocessthatdoesnothaveaneligibleactionfor a prolongedperiodwouldliketofindoutifthesystem has reached a deadlockconfiguration. • Terminationdetection. Tobeginthecomputation in a certainphase, a processmustthereforeknowwhethereveryotherprocess has finishedtheircomputation in thepreviousphase. • Network reset. In case of a malfunctionor a loss of coordination, a distributedsystemwillneedto roll back to a consistent global state and initiate a recovery. Previoussnapshotsmay be helpful.
Properties of ConsistentSnapshots • A snapshot state (SSS) consists of a set of local states, where each local state is the outcome of a recording event that follows a send, or a receive, or an internal action. The important notion here is that of a consistent cut.
Properties of ConsistentSnapshots • A cut is a set of events—it contains at least one event per process. • A cut is called consistent, if for each event that it contains, it also includes all events causally ordered before it.
Properties of ConsistentSnapshots • The set of local states following the recorded recent events of a consistent cut forms a consistentsnapshot. • In a distributed system, many consistent snapshots can be recorded. A snapshot that is often of practical interest is the one that is most recent.
TheChandy-LamportAlgorithm • Let the topology of a distributed system be represented by a strongly connected graph. Each node represents a process and each directed edge represents a FIFO channel. • A process called the initiator initiates the distributed snapshot algorithm. Any process can be an initiator. The initiator process sends a special message, called a marker (*) that prompts other processes in the system to record their states. • The global state consists of the states of the processes as well as the channels. However, channels are passive entities — so the responsibility of recording the state of a channel lies with the process on which the channel is incident.
TheChandy-LamportAlgorithm • DS1 The initiator process, in one atomic action, does the following: • Turnsred • Records itsownstate • Sends a marker along all its outgoing channels • DS2 Every process, upon receiving a marker for the first time and before doing anything else, does the following in one atomic action: • Turnsred • Records itsstate • Sends markers along all its outgoing channels
TheChandy-LamportAlgorithm • Thesnapshotalgorithmterminates, when: • Every process has turned red • Every process has received a marker through each of its incoming channels
TheChandy-LamportAlgorithm • The individual processes only record the fragments of a snapshot state SSS. It requires another phase of activity to collect these fragments and form a composite view of SSS. Global state collection is not a part of the snapshot algorithm.
TheLai-Yang Algorithm • Lai andYangproposed an algorithm for distributed snapshot on a network of processes where the channels need not be FIFO. • A message is white if it is sent by a process that has not recorded its state, and a message is red if the sender has already recorded its state. • However, there are no markers — processes are allowed to record their local states spontaneously,
TheLai-Yang Algorithm • LY1. The initiator records its own state. When it needs to send a message m to another process, itsends(m, red). • LY2. When a process receives a message (m, red), it records its state if it has not already done so, and then accepts the message m.
The Lai-Yang Algorithm • The approach is “lazy” in as much as processes do not send or use any control message for the sake of recording a consistent snapshot. • The good thing is that if a complete snapshot is taken, then it will be consistent. • However, there is no guarantee that a complete snapshot will eventually be taken: if a process i wants to detect termination, then i will record its own state following its last action, but send no message, so other process may not record their states (dummy control messages).