1 / 16

Global States and Checkpoints

Global States and Checkpoints. Distributed Checkpoints and Rollback Recovery. Fault tolerance is achieved by periodically using stable storage to save the processes’ states during the failure-free execution .

nora
Download Presentation

Global States and Checkpoints

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Global States and Checkpoints CS 271

  2. Distributed Checkpoints and Rollback Recovery • Fault tolerance is achieved by periodically using stable storage to save the processes’ states during the failure-free execution. • Upon a failure, a failed process rolls back from one of its saved states, thereby reducing the amount of lost computation. • Each of the saved states is called a checkpoint CS 271

  3. Checkpoint based Recovery • Uncoordinated checkpointing: Each process takes its checkpoints independently • Coordinated checkpointing: Process coordinate their checkpoints in order to save a system-wide consistent state. • Communication-induced checkpointing: It forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes. CS 271

  4. Domino effect: example RecoveryLine P0 m7 m2 m5 m0 m3 P1 m6 m4 m1 P2 Domino Effect: Cascaded rollback which causes the system to roll back to too far in the computation (even to the beginning), in spite of all the checkpoints CS 271

  5. Global StateChandy and Lamport—TOCS 1985 • Global state of a distributed system • Local state of each process • Messages sent but not received • Many applications need the state of the system • Failure recovery, distributed deadlock detection • Detect stable properties. • Problem: how can you figure out the state of a distributed system? • Each process is independent • Network does not have any processing power. • Distributed snapshot: a consistent global state CS 271

  6. Global State • A consistent cut • An inconsistent cut CS 271

  7. Distributed Snapshot Algorithm • Assume each process communicates with another process using unidirectional FIFO point-to-point channels (e.g, TCPconnections) • Any process can initiate the algorithm • Checkpoint local state • Send MARKERon every outgoing channel • On receiving a first marker on a channel: • Process checkpoints local state and • Send markers on all outgoing channels, and save messages on all other channels. • On receiving subsequent marker on a channel: • stop saving messages for that channel • Saved messages are the state of the channel CS 271

  8. Distributed Snapshot • A process finishes when • It receives a marker on each incoming channel and processes them all • State: local state plus state of all channels • Send state to initiator • Any process can initiate snapshot • Multiple snapshots may be in progress • Each is distinguished by tagging the marker with the initiator ID (and sequence number) CS 271

  9. Snapshot Algorithm Example • Organization of a process and channels for a distributed snapshot CS 271

  10. Snapshot Algorithm Example • Process Q receives a marker for first time and records local state • Q records all incoming message • Q receives a marker for its incoming channel and finishes recording the state of the incoming channel CS 271

  11. Execution Example Sp0 Sp1 Sp2 Sp3 p m1 m2 m3 q Sq0 Sq1 Sq2 Sq3 CS 271

  12. Execution Example q records state as Sq1 , sends marker to p Sp0 Sp1 Sp2 Sp3 p m1 m2 m3 q Sq0 Sq1 Sq2 Sq3 CS 271

  13. Execution Example p records state as Sp2, channel state as empty Sp0 Sp1 Sp2 Sp3 p m1 m2 m3 q Sq0 Sq1 Sq2 Sq3 CS 271

  14. Execution Example q records channel state as m3 Sp0 Sp1 Sp2 Sp3 p m1 m2 m3 q Sq0 Sq1 Sq2 Sq3 CS 271

  15. Execution Example Recorded Global State = ((Sp2, Sq1), (0,m3) ) Sp0 Sp1 Sp2 Sp3 p m1 m2 m3 q Sq0 Sq1 Sq2 Sq3 CS 271

  16. Take home Message(Snapshot and global states) • General solution for global state detection. • Causality based detection of stable properties. • Simple efficient protocol, uses Markers and FIFO properties. • Don’t forget Channel States. • Foundation for Distributed Checkpointing and Rollback Recovery CS 271

More Related