460 likes | 691 Views
Fault Tolerance I. CSE5306 Lecture Quiz due 17 July 2014. Fault Tolerance. S ingle-machine centralized systems go down, when essential parts fail. A distributed system is said to “tolerate a fault,” if it recovers and continues to perform, while its faulty part is being repaired. R U O K ?.
E N D
Fault Tolerance I CSE5306 Lecture Quiz due 17 July 2014
Fault Tolerance • Single-machine centralized systems go down, when essential parts fail. • A distributed system is said to “tolerate a fault,” if it recovers and continues to perform, while its faulty part is being repaired.
R U O K ? 1. What does “tolerate a fault“ mean? • A centralized system limping along after a debilitating failure. • Continuing to perform, except for an unimportant requirements violation caused by a minor failure. • A distributed system recovering and continuing to perform, while its faulty part is being repaired. • All of the above. • None of the above.
Basic Concepts of Fault Tolerance • A distributed system is “dependable,” if it is… • Available: ready for immediate use, MTTF/(MTTF+MTTR). • Reliable: works well, except during maintenance 1-2AM. • Safe: nuclear power plant controller failure does not cause catastrophe. • Maintainable: easily repaired. • Secure: failure does not expose users’ secrets. • Vocabulary words: • Failure: performance that violates system requirements. • Error: a component’s unexpected state that leads to failure. • Fault: a transient (bird strike), intermittent (loose connector) or permanent (burned out chip) cause of an error.
R U O K ? Match the following terms with their definitions or examples below. 2. Dependable __ 3. Available __ 4. Reliable __ 5. Safe __ 6. Maintainable __ 7. Secure __ 8. Failure __ 9. Error __ 10. Fault __ • Nuclear power plant controller failure does not cause catastrophe. • Works well, except during regularly scheduled maintenance 1-2AM. • Ready for immediate use, MTTF/(MTTF+MTTR). • Easily repaired. • Available, reliable, safe, maintainable and secure. • Any performance that violates system requirements. • Failure does not expose users’ secrets. • A component’s unexpected state that leads to failure. • A transient, intermittent or permanent cause of an error.
Failure Models fatal • Crash failure: nothing to do but reboot. • Omission f.: no transport layer, no listening thread, send buffer over flow, infinite loop, scrambled dialog. • Timing f.: server’s late response drops connection, client responds before receive buffer allocation. • Response f.: Web search for beagles returns cats. • State transition f.: server takes default action after reasonable request. • Arbitrary (Byzantine) f.: insecure server sends deliberately wrong answers. • Fail-stop f.: clearly visible to other processes, after a warning perhaps. • Fail-silent systems: marginal performance; e.g., slow responses. • Fail-safe faults: pretends to perform, but close analysis reveals nonsense. serious annoying
R U O K ? Match the following terms with their definitions or examples below. 11. Crash failure __ 12. Omission f. __ 13. Timing f. __ 14. Response f. __ 15. State transition f. __ 16. Arbitrary (Byzantine) f. __ 17. Fail-stop f. __ 18. Fail-silent systems __ 19. Fail-safe faults __ • A server takes default action after reasonable request. • No transport layer, no listening thread, send buffer over flow, infinite loop, scrambled dialog. • Nothing to do but reboot. • Marginal performance; e.g., slow responses. • Pretends to perform, but close analysis reveals nonsense. • Insecure server sends deliberately wrong answers. • Clearly visible to other processes, after a warning perhaps. • Web search for beagles returns cats. • Server’s late response drops connection, client responds before receive buffer allocation.
Failure Masking by Redundancy • Fault tolerant systems hide failures; e.g., by 3-way voting on every decision (“triple modular redundancy” above). • Redundancy: • Information: Hamming code error correcting bits. • Time: hide transient/intermittent faults by aborting transaction and trying again. • Physical: hospital can run on batteries till its diesel generators start.
R U O K ? Match the following terms with their definitions or examples below. 20. Triple modular redundancy __ 21. Redundancy types __ 22. Information __ 23. Time __ 24. Physical __ • Hospital runs on batteries till its diesel generators start. • Hamming code error correcting bits. • A fault tolerant system hiding failures by 3-way voting on every decision. • Information, time and physical. • Hide transient/intermittent faults by aborting transaction and trying again.
Process Resilience • Collaborative groups of k+1 members can tolerate k crashes (p.331). • Byzantine groups of 3k+1 members can tolerate k lies (Fig. 8-5, p.333). • Processors of different administrative domains are “BAR fault tolerant”; i.e., Byzantine, altruistic and rational. (Their management is beyond the scope of this course, p.335.)
R U O K ? 25. Which of the following is true of fault tolerant group decision making? • Collaborative groups of k+1 members can tolerate k crashes. • Byzantine groups of 3k+1 members can tolerate k lies. • Processors of different administrative domains are “BAR fault tolerant.” • All of the above. • None of the above.
Design Issues • Tolerate a faulty process by organizing several identical processes into a group, in which all members receive the group’s messages. • A process may be a member of many groups, and join or drop out as needed. • Clients who rely upon a group’s services don’t know the members or how many there are.
R U O K ? 26. Which of the following accurately characterizes collaborative groups? • They tolerate faulty processes by organizing identical processors into a group, in which all members receive the group’s messages. • A process may be a member of many groups, and join or drop out as needed. • Clients who rely upon a group’s services don’t know the members or how many there are. • All of the above. • None of the above.
Flat Groups vs. Hierarchical Groups • Flat: all members are equal; symmetry with no single point of failure; voting takes time. • Hierarchical: director assigns specialists; director issingle point of failure; her quick decisions don’t distract specialists.
R U O K ? Match the following group attributes with the group types below. 27. Director assigns specialists __ 28. All members are equal __ 29. Director is single point of failure __ 30. Symmetrical __ 31. No single point of failure __ 32. Voting takes time __ 33. Quick decisions don’t distract specialists__ • Flat. • Hierarchical.
Group Membership • A group server uses its calling lists of “first responders” to muster groups (e.g., “Tiger Teams”), as needs arise. • Problem: the group server is a group management single point of failure. • Solution: group members can manage themselves by multicasting their joining message, and by leaving (becoming unresponsive) when needed elsewhere. • Joiners must receive the group’s legacy messages; leavers must not receive group messages. • Protocols must exist for… • Reconstituting a group that loses too many members. • Arbitrating between two contenders for leadership.
R U O K ? 34. Which of the following is a group server’s responsibility? • Mustering new groups as needed. • Reconstituting a group that loses too many members. • Arbitrating among contenders for leadership. • All of the above. • None of the above.
Failure Masking and Replication • Primary-based replication: if a sound statistical analysis shows that a replicated system must tolerate k crashes, • then k backup systems must be ready for election as the primary system’s replacement (e.g., the Catholic Pope). • Flat groups use replicated-write and quorum-based protocols to coordinate groups of identical backup processes. • To tolerate k “sick” processes (i.e., Byzantine failures), 3k+1 flat group members are required. (The 2k+1 honest processes must out-vote the k liars.)
Agreement in Faulty Systems • Distributed agreement algorithms seek consensus among processes in a limited number of steps, under the following assumptions: • All processes march together (synchronous) or not. • All messages arrive within maximum time or not. • Processes’ messages are natural ordered (TCP) or not. • Unicasting (separate messages) or multicasting. • Agreement is possible in only half of these combinations (see above, not Fig. 8-4, p.333).
R U O K ? • 35. Which of the following describes assumptions that a distributed agreement algorithm designer must make? • All messages arrive within maximum time or not. • Processes’ messages are natural ordered (TCP) or not. • Unicasting (separate messages) or multicasting. • All of the above. • None of the above.
Agreement in Byzantine Systems • How can a group of 4 agree, in spite of one sick member (see a above)? (Assume synchronous, unicasts and bounded message delays.) The 3-step solution: • Honest nodes send their node numbers to all others; liar sends correct node number only to herself (b above). • Every node sends a vector of everything she received to everyone else (c above). • Every node sees an accurate majority vote in every matrix column.
R U O K ? 36. How can a group of 3k+1 Byzantine group members reach an agreement, in spite of its k lying members? • Honest nodes send their node numbers to all others; liars sends correct node number only to themselves. • Every node sends a vector of everything she received to everyone else. • Every node sees an accurate majority vote in every matrix column. • All of the above. • None of the above.
Failure Detection • How do you know if a server is alive? • Ping it. If no response till time expires, assume it is dead. But… maybe the network is unreliable, and pinging is crude. • If ping times out, ask node to ping via other path. • Gossiping (saying, “I’m alive”) is more reliable. • Regularly exchange information with neighbors. • When ping times out, honor your fallen comrade by failing to ACK too, till the whole group dies!
R U O K ? 37. How do you find out if a server is alive? • Ping it, and listen for “I’m alive.” • If ping times out, ask another node to ping via its alternate path. • Regularly exchange information with neighbors. • All of the above. • None of the above.
Reliable Point-to-Point Client-Server Communications • Communication failures: • Crash: server can attempt to set up a new connection, if client drops. • Omission: TCP masks lost messages by NAKing and getting message sent again. • Timing: late message deliveries. • Arbitrary: network may send an old buffered message after sender resends it.
R U O K ? Match the following communication failures with their definitions or remedies below. 38. Crash __ 39. Omission __ 40. Timing __ 41. Arbitrary __ • Network may send an old buffered message, after sender resends it. • Late message deliveries. • Server can attempt to set up a new connection, if client drops. • TCP masks lost messages by NAKing and getting message sent again.
RPC Semantics in the Presence of Failures • Five different failures foil systems’ attempts to hide communications and make RPCs appear local: • The client cannot locate the server. • The client-to-server request message is lost. • The server crashes after receiving a request. • The server-to-client reply message is lost. • The client crashes after sending a request.
R U O K ? 42. What communications failures can foil systems’ attempts to make RPCs appear local? • The client cannot locate the server. • The client-to-server request message is lost. • The server crashes after receiving a request. • All of the above. • None of the above.
Client Cannot Locate Server • Reasons why client can’t reach server: • Server is down. • Client’s interface protocol is obsolete. • What to do about it: raise an exception. • Drawbacks: • Languages disagree on how to handle exceptions. • Destroys illusion that the RPC is local.
R U O K ? 43. What is wrong with raising an exception, when a RPC client can’t reach server? • Languages disagree on how to handle exceptions. • It destroys illusion that the RPC is local. • All of the above. • None of the above.
Lost Request Messages • What to do about it: when reasonable response time expires, resend message. • Drawbacks: • When resent messages get lost too, see “Client Cannot Locate Server” above. • When message response is too slow (message not lost), server must deal with duplicated messages (see “Lost Reply Messages” below).
Server Crashes • Servers crash (and fail to reply) after or before executing the requested process (b and c above). • What to do: • Wait till server reboots and call again; i.e., “at least once semantics.” • Give up and report a failure; i.e., “at most once semantics.” • Do nothing; i.e., don’t help or even explain. • What if server crashes before printing a large file? • Client can resend request, risking printing two copies. • Client doesn’t resend request, risking getting no print out. • Client can resend, if its request is not ACKed. • Client can resend, if server did not say, “Print out is ready.”
Server Crashes (continued) • What if print server crashes before (or after) printing a large file? • These events can occur in six different orderings (didn’t happen): • M →P →C: A crash occurs after sending the completion message and printing the text. • M →C (→P): A crash happens after sending the completion message, but before the text could be printed. • P →M →C: A crash occurs after sending the completion message and printing the text. • P→C(→M): The text printed, after which a crash occurs before the completion message could be sent. • C (→P →M): A crash happens before the server could do anything. • C (→M →P): A crash happens before the server could do anything. • See the client’s possible/necessary responses and outcomes above.
Lost Reply Messages • Safely repeated requests are idempotent; e.g., resend a file block, not retransfer $1000. • It is safest to assume that no request is idempotent: • Mark all requests with sequence numbers to distinguish originals from repeats. • Set a bit in the header of repeated requests, so that the server can handle it with care, which depends upon circumstances.
R U O K ? 44. What can you do to help safeguard against lost reply messages? • Assume that no request is idempotent. • Mark all requests with sequence numbers to distinguish originals from repeats. • Set a bit in the header of repeated requests, so that the server can handle it with care. • All of the above. • None of the above.
Client Crashes • Un-received server responses are “orphans”: • They waste CPU cycles, lock files and use resources. • Their premature arrival after client reboots can be confusing. • What to do about orphans? • Log every step, and read log after reboot. If it shows request was issued, kill the orphan. • Broadcast every step completion and broadcast reboot message. Let listeners kill the orphans. • Upon receiving reboot messages, others try to locate parents. If they are dead, the orphans die. • Orphans die, when client’s response times out. • Killing orphans can have lasting undesired side effects.
R U O K ? 45. Why should you care about “orphans” (i.e., unreceivedserver responses)? • They waste CPU cycles and use resources. • Their premature arrival after client reboots can be confusing. • Even if killed without mercy, they can leave devastating lasting effects; e.g., locked files. • All of the above. • None of the above.
Reliable Group Communication • Reliable multicast services are as important as resilient process replication. • Multicasts should guarantee deliveries to all members of a group. • But that ain’t easy…!
Basic Reliable-Multicasting Schemes • TCP only guarantees point-to-point deliveries. • Broadcasting via point-to-point connections is efficient for a few group members (see above). • Sequence numbers on every broadcast message prompt receivers to NAK missing messages. (Sender retains each message till every receiver ACKs.)
R U O K ? 46. Which of the following describe basic reliable multicasting? • TCP only guarantees point-to-point deliveries. • Broadcasting via point-to-point connections is efficient for relatively few group members. • Sequence numbering broadcast messages enables receivers to NAK missing messages. • All of the above. • None of the above.
Scalability in ReliableMulticasting • Receivers sending a few NAKs but not a lot of ACKs, scales up to larger groups. • Server’s deleting an old message risks the possibility that some receiver still has not received it.
Nonhierarchical Feedback Control • The Scalable Reliable Multicasting protocol does just the right amount of feedback suppression. • When a receiver misses a message, it multicasts its NAK (see above), which suppresses all others’ NAKs. • NAK collisions are prevented by randomly delaying the NAK while listening for others’ NAKs, as in the Ethernet protocol. • WANs with long propagation delays can’t do this very well. Neighboring nodes can team up on NAKing, by communicating with each other via a separate channel.
R U O K ? 47. Which of the following accurately characterizes the Scalable Reliable Multicasting protocol doing just the right amount of feedback suppression? • When a receiver misses a message, it multicasts its NAK, which suppresses all others’ NAKs. • NAK collisions are prevented by the receiver’s randomly delaying its NAK while listening for others’ NAKs, as in the Ethernet protocol. • WANs with long propagation delays can’t do this very well, but neighboring nodes can team up on NAKing, by communicating with each other via a separate channel. • All of the above. • None of the above.
Hierarchical Feedback Control • Hierarchical groups scale better than flat ones. • Sender sends to roots of large spanning trees. • Root’s local coordinators buffer and relay messages, as well as handle their subgroups’ ACKs and NAKs. • Application-level multicasting (pp.166-170) can solve the hierarchical subgroups’ dynamic growth and contraction problems.
R U O K ? 48. Which of the following is a reason why hierarchical groups scale better than flat ones. • Sender sends to roots of large spanning trees. • Roots’ local coordinators buffer and relay messages, as well as handle their subgroups’ ACKs and NAKs. • Application-level multicasting can solve the hierarchical subgroups’ dynamic growth and contraction problems. • All of the above. • None of the above.