1 / 64

Recovery From Failure in Distributed Systems CS 188 Distributed Systems February 26, 2015

Recovery From Failure in Distributed Systems CS 188 Distributed Systems February 26, 2015. Introduction. Failure is a fact of life in most distributed systems Particularly crashes of either apps or nodes If failed element was involved in distributed computations, what then?

agustinaw
Download Presentation

Recovery From Failure in Distributed Systems CS 188 Distributed Systems February 26, 2015

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recovery From Failure in Distributed SystemsCS 188Distributed SystemsFebruary 26, 2015

  2. Introduction • Failure is a fact of life in most distributed systems • Particularly crashes of either apps or nodes • If failed element was involved in distributed computations, what then? • Especially if it starts up again

  3. Problems of Failures • Loss of data on failed node • Loss of participation in distributed computation • Results in inconsistent state information • Can partition the network • Lost/corrupted messages

  4. Recovery Has Its Problems, Too • Recovered nodes arrive unexpectedly • If they try to rejoin system, they are missing critical state • Properly integrating recovering nodes requires extra resources • And may harm distributed computation on non-failed nodes • Nevertheless, you usually want them back • So you’ve got to handle these problems

  5. Recovery Mechanisms • Transactional mechanisms • Replicated data recovery • Rejoining groups • Snapshots and distributed recovery

  6. Transactional Mechanisms • Use mechanisms like the two- or three-phase commit • Storing phases of algorithm in stable storage • On recovery, determine whether commits were in progress • Saved state of the phase tells you what to do

  7. Saving State For Transactions • Save all the state related to your local actions • Since you’ll need to write that if you must commit • Save information about phase of the protocol you know about • Save before sending the message . . .

  8. What Does The Recovering Node Do? • Must make the same choice that the nodes that didn’t fail made • If they committed, recovering node must commit • If they aborted, recovering node must abort

  9. Three Phase Commit What did the others do? Coordinator Here? send canCommit Participant(s) receive canCommit wait OK nak timeout no send ack Here? COMMIT! all ack ABORT! wait abort abort timeout abort send startCommit receive startCommit Let’s say a participant fails and recovers prep nak timeout send ack At what point? all ack Here? prep send Commit timeout receive Commit confirm send ack Commit

  10. How Do You Abort? • Throw away any aborted changes • Could just clear saved record of transaction • Maybe better to write “abort” to it • And perhaps garbage collect later

  11. How Do You Commit? • Apply the local changes you have saved in stable storage • After which you can release that storage • Send “committed” to coordinator • So he can free his records • Could just clear saved record of transaction • Maybe better to write “committed” to it • And perhaps garbage collect later

  12. Replicated Data Recovery • If recovering site holds replicated copies of data, need to check on status • Perhaps they were updated at another site during the failure • If so, you probably missed any update propagation attempt

  13. Reconciliation • The process of bringing a replica up to date • Perhaps because of failures or partitions • Perhaps part of regular operations • Easier if only primary site can update the data

  14. The Reconciliation Process • Contact some site storing another replica • Preferably the one with the most recent version of that data • Determine what updates the other site has that you haven’t seen • Propagate those updates to your replica

  15. Issues for Replication • How do I know whether the remote data is newer? • Timestamps, version vectors, etc. • What if there are conflicts? • Making it cheap

  16. The Cost of Reconciliation • Let’s say you replicate at the volume level • Let’s say a volume contains 5000 separate files • How do you determine which ones have changed? • Two options: • Scan both replicas and compare timestamps/version vectors/whatever • Keep track of updates as they occur

  17. Propagating the Data • Potentially very expensive, if there’s a lot of new data • Do you grab it all before allowing local action to proceed? • Do you schedule it for update later? • Perhaps marking the local copy as “dirty”? • Do you prioritize somehow?

  18. Keeping Track of Updates • Results in a list of candidate files to reconcile • Vital that the list be complete • Too inclusive better than not sufficiently inclusive • Typically requires hooks into the file system • So you can update your list as writes occur • Also requires mechanisms to prune the list

  19. Rejoining Groups • The failed node might have belonged to some groups before failure • Participating in distributed computations • Being part of a distributed file system • Serving as an element of a reliability mechanism • On recovery, it often wants to rejoin those groups

  20. Mechanisms for Rejoining Groups • If groups use a leader, generally invoke the leader election mechanism • Which ordinarily should be cheap and easy • There’s already a leader and he stays the leader • The recovering node just gets enlisted in his group

  21. Example: The Bully Algorithm • Someone who wasn’t the leader fails • The leader will eventually notice and remove the failed node from the group • On recovery, the failed node will ask if the leader is around • He probably is, so the failed node accepts his leadership and rejoins

  22. How Important Is Perfect Accuracy? • Is it vital that all nodes in a distributed system know about each other’s status? • Perfectly correctly? • With complete agreement? • It’s easier if you don’t require that • Especially at high scale • If you require it, your system will devote a lot of effort to maintaining groups

  23. An Example: The Locus System • Locus was a distributed operating system • It tried to provide excellent transparency • Which required a very consistent view of the available resources • So you could be sure everyone saw everything the same way • But nodes failed and recovered • And partitions were possible

  24. The Locus Solution • A process called topology change • Run whenever nodes disappeared or reappeared • With the goal of reaching agreement on which nodes were present • And hiding the protocol behavior from applications • E.g., if an application used resources from nodes A and B, failure of C should not affect it

  25. Basics of Locus Topology Change • Run by a master node • If multiple nodes run it, all but one abandons it when that’s known • Ask all available nodes for the set of other nodes they can talk to • Create a partition based on the intersection of all these sets • Iterate until stable and deliver to the nodes in the final set

  26. Combining Resources • After settling on a partition, Locus had to make sure all members knew what resources were available • A second protocol performed that task • E.g., figuring out what file volumes were stored on each node

  27. Handling Failures in the Protocol • At any moment, the node running topology change could fail • Other participating nodes watched for failure • Taking over, if necessary • Leader watched for another node also running it • In which case, one took over

  28. Characteristics of Topology Change • Complex • Especially during unstable periods • Expensive • In messages • Required much effort to get correct • The lesson: be sure you really want this before you try to do it

  29. Checkpoints and Snapshots • We often run distributed computations • Often they take a long time • Some participating nodes might fail in the middle • It’s a bummer if that means we need to totally restart the whole computation • What are our other options?

  30. Checkpointing • Each node in the computation could periodically save its local state to disk • If it fails, on recovery it could consult the disk • Restoring the most recent checkpoint and rejoining the computation • Ideally, the failure only costs a temporary performance penalty

  31. Local Checkpoints • Could be quite specific to a particular computation • Only store data vital to that computation • Specific solutions are cheaper, but only apply to that computation • Could be quite general • Store all of the contents of memory • General solutions are expensive, but comprehensive

  32. The Problem With Local Checkpoints • They only capture local state • Leading to two issues: • The other nodes move on while the failed node is recovering • What about messages?

  33. Dealing With Problem 1 • Could hold everyone up when failure is detected • Until the failed node recovers • Loses opportunities for faster execution • Never perfectly synchronized, anyway • Could instead catch recovering node up to everyone else

  34. Dealing With Problem 2 • Local state typically doesn’t include the state of the network • The failed node sent some messages • Maybe between the checkpoint and the failure • Other nodes sent some messages to the failed node • Between the checkpoint and recovery

  35. How Do We Handle These Messages? • Maybe the messages sent to the failed node aren’t critical • Application should deal with lost messages, anyway • Messages that were sent by the failed node are more difficult • Assuming determinism, on checkpoint restoration, they’re resent • Is that OK?

  36. A Different Approach • On recovery of a failed node, don’t just reset the failed node • Reset the entire computation • To a globally consistent state • Implying checkpoints for all nodes • Also implying capturing the state of the network

  37. Distributed Snapshots • Algorithms for taking consistent pictures of a distributed computation • Including the state of all nodes and the state of the network • Allows restoration of a distributed computation after failure recovery • Also useful for other purposes

  38. Capturing the State of a Distributed System • The distributed system contains two kinds of components • Processors • Network links • Both have state that is part of the overall state

  39. One node starts an election algorithm to choose a leader Now folks start voting A State Example: A Consensus Algorithm E A Node E fails C Maybe before he voted, maybe after B So his voting message might (or might not) be in the network D

  40. Solving the Problem With a Distributed Snapshot • If we could halt every element of the system and examine its state, we could decide if E had voted or not • Either there would be a vote message for E somewhere • Or there wouldn’t • How do we gather a correct snapshot?

  41. Distributed Snapshots and State • Distributed snapshots try to produce a consistent state of the system • What’s a state composed of? • Assume M processors, p1 through pM • At a given point in time, pi is in state si • Each local state is composed of the important local values

  42. Capturing States • Assume that, at any point, any processor can record its state • And send it to another processor • Is the problem just capturing the M states at the right instants? • Not quite - there are also messages in transit

  43. The Message Problem B A M B’s snapshot accounts for message M When system takes a snapshot of node A, how do we account for message M? Perhaps it was delivered?

  44. But What If M Wasn’t Delivered? B A M Neither snapshot accounts for M So if you just restore A and B’s snapshots, you get the wrong result

  45. Accounting for Messages • Assume a channel Ci,j between every pair of processes pi and pj • Assume FIFO and reliable delivery • Assume unidirectional • The snapshot mechanism must account for messages in those channels • Channel state is set of ordered, outstanding messages • Li,j=(m1,m2, . . .,mk)

  46. Global System State • Combination of processor states and channel states • G=(S,L) • S is set of processor states • L is set of channel states • If only we could obtain all of each set instantaneously . . . • Well, we can’t

  47. Consistent States • We can’t capture all processor and channel states simultaneous • But we can capture them one by one • Can we organize our state capture to produce a consistent result? • Defined as an overall state that could have occurred

  48. Obtaining Consistent States • We will capture a state sifor each processor pi • If pi received a message m from pj at the time we capture si • Then sjshould indicate that m was sent • We must also capture the L states

  49. Consistent Cuts • Let oi be the event of observing si at pi • A state S is a consistent cut if every oi and oj are concurrent events • Why? • Because if they’re concurrent, we know that no send/receives will be mis-ordered

  50. How To Achieve Consistent Cuts? • Done relative to some particular observed protocol • Create messages that are outside that protocol that cause states to be observed • Making sure snapshot protocol doesn’t get in the way

More Related