1 / 66

ITEC801 Distributed Systems

ITEC801 Distributed Systems. Fault Tolerance Coulouris Chapter 8, 14. Introduction. Characteristic feature of a distributed system: Partial Failure When one component fails. Affect proper operation of the system: Some components In contrast, a failure in non distributed system is total.

ulema
Download Presentation

ITEC801 Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ITEC801Distributed Systems Fault Tolerance Coulouris Chapter 8, 14 Fault Tolerance

  2. Introduction • Characteristic feature of a distributed system: Partial Failure • When one component fails. • Affect proper operation of the system: Some components • In contrast, a failure in non distributed system is total. • Distributed Design Goal: Automatic recovery from partial failure without seriously affecting the overall performance. • Acceptable operation Fault Tolerance

  3. Introduction to Fault Tolerance • Also known as fail-safe design: A design that enables a system to continue operation, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of the system fails. • More or less fully operational: Throughput, Response Time • Other examples • Motor vehicle • Structure • not just a property of individual machines: TCP Fault Tolerance

  4. Key Properties Dependable System: • Availability: Property that a system is ready to be used immediately. • Reliability: Property that a system can run continuously without failure. • Safety: A situation when the system temporarily fails to operate correctly, nothing catastrophic happens. • Maintainability: How easily can be failed system be repaired. • Note Availability versus Reliability Fault Tolerance

  5. Error and Fault • Error : part of system state that leads to a failure. • Example: Damaged packets • Fault: Cause of an error is called a fault. • Example: Bad Transmission medium Fault Tolerance

  6. Classification of Faults • Transient: Occurs once and disappears. • Environmental conditions • Soft • Intermittent: occurs, vanishes, and then reappears. • Difficult to diagnose. • Unstable/variation • Permanent: Continues to exist • Hard. Fault Tolerance

  7. Failure Models Fault Tolerance

  8. Failure Models Crash Failure: Occurs when a component permanently halts. but was working correctly until it stopped. • Nothing else is heard • OS failure. Omission fault/failure A component that does not respond to an input from another component, and thereby fails by not producing the expected output is exhibiting an omission fault and the corresponding failure an omission failure. Example: A server fails to respond to a request. Fault Tolerance

  9. Failure Models • Timing fault/failure A timing fault causes the component to respond with the correct value but outside the specified interval (either too soon, or too late). The corresponding failure is a timing failure. • Example: Overloaded server processing slowly. Fault Tolerance

  10. Failure Models Response Failure: Response of a component is simply incorrect. Example; Response of a server is simply incorrect. Value fault/failure A fault that causes a component to respond within the correct time interval but with an incorrect value is termed a value fault (with the corresponding failure called a value failure). Example: Faulty Communication link State Transition Failure; When a component reacts unexpectedly to an incoming request. Example: server receives an unrecognizable message. Fault Tolerance

  11. Failure Models • Arbitrary failures: It is possible for a component to fail in both the time and the value domains in a manner which is not covered by one of the previous classes. A failed component which produces such an output will be said to be exhibiting an arbitrary failure (Byzantine failure). • Example: • Server- Incorrect output not detected. • Malicious collusion. Fault Tolerance

  12. Agreement in Faulty Systems • In most cases we assume that a process group reaches an agreement. • Examples: • Coordinator election. • Commit/not to commit • Task division. • Synchronization. • Achieving an agreement can be non trivial. • Assumption: Processes cooperate: May not be the case. Fault Tolerance

  13. Agreement in Faulty Systems • Challenge: Consensus amongst • Non faulty processes • Finite steps. • Problem: Different assumption about the underlying system require different solutions. • Synchronous versus Asynchronous • Delay bounded or not. • Delivery ordered or not. • Unicasting versus multicasting. Fault Tolerance

  14. Agreement in Faulty Systems Circumstances under which distributed agreement can be reached. Fault Tolerance

  15. Byzantine Agreement Problem • The problem of reaching a consensus among distributed units if some of them give misleading answers. • the problem is couched in terms of generals deciding on a common plan of attack. • Some traitorous generals may lie about whether they will support a particular plan and what other generals told them. Exchanging only messages, what decision making algorithm should the generals use to reach a consensus? • What percentage of liars can the algorithm tolerate and still correctly determine a consensus? Fault Tolerance

  16. Byzantine Agreement Problem Fault Tolerance

  17. Byzantine Agreement Problem Figure 8-5. The Byzantine agreement problem for three nonfaulty and one faulty process. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives in step 3. K= faulty, 2k+1=non faulty, total= 3k+1 Fault Tolerance

  18. Byzantine Agreement Problem The same as in the previous case, except now with two correct process and one faulty process. Fault Tolerance

  19. Reliable Client-Server Communication • Reliable Point to Point Communication: TCP • TCP masks Omission failures: Acknowledgements and Retransmissions. • Crash failures are not masked: Connection abruptly broken • Resend a connection request. Fault Tolerance

  20. RPC Semantics in the Presence of Failures • Five different classes of failures that can occur in RPC systems: • The client is unable to locate the server. • The request message from the client to the server is lost. • The server crashes after receiving a request. • The reply message from the server to the client is lost. • The client crashes after sending a request. Fault Tolerance

  21. Server Crashes (1) • Figure 8-7. A server in client-server communication. (a) The normal case. (b) Crash after execution. (c) Crash before execution. Fault Tolerance

  22. Server Crashes • Three Approaches: • At Least One Semantics • At Most One Semantics. • Guarantee Nothing. • Exactly One Semantics. Fault Tolerance

  23. Server Crashes (2) Scenario: Remote Print Job • Client sends request, sever has 2 options • First send completion and then tell the printer • First tell the printer and then send the completion. • Three events that can happen at the server: (in different orderings) • Send the completion message (M), • Print the text (P), • Crash (C). • Strategies for Client to follow: • Never Issue a request • Always issue a request • Reissue only in absence of acknowledgement for delivery • Reissue only in absence of acknowledgement for print Fault Tolerance

  24. Server Crashes (3) • These events can occur in six different orderings: • M →P →C: A crash occurs after sending the completion message and printing the text. • M →C (→P): A crash happens after sending the completion message, but before the text could be printed. • P →M →C: A crash occurs after sending the completion message and printing the text. • P→C(→M): The text printed, after which a crash occurs before the completion message could be sent. • C (→P →M): A crash happens before the server could do anything. • C (→M →P): A crash happens before the server could do anything. Fault Tolerance

  25. Server Crashes (4) • Figure 8-8. Different combinations of client and server strategies in the presence of server crashes. Fault Tolerance

  26. Lost Reply Messages • Appears like a server crash. • Idempotent Request can be repeated and not Non Idempotent requests. • Example: Bank Accounts • Other mechanisms: • Use of sequence numbers • Book Keeping Fault Tolerance

  27. Client Crashes • If client crashes before getting the reply, computation becomes orphan (computation is assumed to take a long time). • Orphans • Harmful • Cause of Confusion Solutions • Orphan extermination • Grand Orphans • Reincarnation: Using time based epochs. • Gentle Reincarnation • Expiration Fault Tolerance

  28. Fault Tolerance in Groups • The key approach to tolerating a faulty process is to organize several identical processes in to a group. • Collections of processes dealt as single abstraction. • When a message is sent: All members of group get. • We have seen several group communication models already. • The objective is to see how fault tolerance can be achieved. Fault Tolerance

  29. Receipt & Delivery • The communication layer on a node receives a message • It informs all other node’s communication layers that it has • When it has received all such messages from all other nodes it delivers the message Fault Tolerance

  30. Send Message 3 2 4 1 5 8 6 7 Fault Tolerance

  31. Send Confirms 3 2 4 1 5 8 6 7 Fault Tolerance

  32. Reliability • First, use something like TCP • Reliable point to point • Ordered • What happens if sender fails during sending? Fault Tolerance

  33. Reliability • When message delivered it is marked as stable • So what to do with unstable messages? • What if group changes? Fault Tolerance

  34. Send Message 3 2 4 1 5 8 6 7 Fault Tolerance

  35. Failure • Confirms will not all be received • Message will not be marked as stable • Removal of failed process from group will be noticed Fault Tolerance

  36. Group Membership Change • Process that registers change will notify all other remaining members • On receipt of such message process • multicasts unstable messages • Multicasts flush message • When it receives flush message from all remaining processes knows new group membership Fault Tolerance

  37. Group Membership Change 3 2 4 1 5 8 6 7 Fault Tolerance

  38. Unstable & Flush – process 2 3 2 4 Unstable message Flush message 1 5 8 6 7 Fault Tolerance

  39. Distributed Commit • Having an operation being performed by each member of a process group, or none at all. • Reliable multicasting: Delivery • Distributed Transaction: Commit. • Often established by means of a coordinator • Participants told to perform an operation. • Distributed Commit implemented using • Two phase Commit • Three Phase Commit. Fault Tolerance

  40. Two-Phase Commit (1) • Step 1 : Coordinator send VOTE_REQUEST to all participants. • Step 2: Participant can return a VOTE_COMMIT or a VOTE_ABORT • Step 3: Coordinator gathers response: Issues GLOBAL_COMMIT or a VOTE_ABORT. • Step 4: Participant either commits or aborts. Fault Tolerance

  41. Two Phase Commit Fault Tolerance

  42. Two-Phase Commit (1) • (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant. Fault Tolerance

  43. Two-Phase Commit (2) • Actions taken by a participant P when residing in state READY and having contacted another participant Q. Fault Tolerance

  44. Two-Phase Commit (3) • Outline of the steps taken by the coordinator in a two-phase commit protocol. . . . Fault Tolerance

  45. Two-Phase Commit (4) • Outline of the steps taken by the coordinator in a two-phase commit protocol. . . . Fault Tolerance

  46. Two-Phase Commit (5) • (a) The steps taken by a participant process in 2PC. Fault Tolerance

  47. Two-Phase Commit (7) • . (b) The steps for handling incoming decision requests.. Fault Tolerance

  48. Three-Phase Commit (1) • The states of the coordinator and each participant • satisfy the following two conditions: • There is no single state from which it is possible to make a transition directly to either a COMMIT or an ABORT state. • There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made. Fault Tolerance

  49. Three-Phase Commit (2) • Figure 8-22. (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant. Fault Tolerance

  50. Recovery • So far, the focus was on algorithms that tolerated faults. • Once a failure occurred, it is essential to bring the process to a correct state (before the failure happened) • What do we mean by recovery? • How are the states recorded Fault Tolerance

More Related