1 / 32

Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. Fred B. Schneider Presenter: Aly Farahat. Contents. Introduction State Machines Fault-Tolerance Agreement & Order Logical Clocks Synchronized Clocks Server Side Ordering Faulty Output Devices

lenci
Download Presentation

Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat CS5090

  2. Contents • Introduction • State Machines • Fault-Tolerance • Agreement & Order • Logical Clocks • Synchronized Clocks • Server Side Ordering • Faulty Output Devices • Faulty Clients • Using Time to Make Requests • Reconfiguration • Managing Reconfiguration • Integrating Repaired Replicas CS5090

  3. Client/Server Model CS5090

  4. Fault Types • Fail Stop Faults: a faulty component enters a predefined state and halts • Byzantine Faults: arbitrary malicious faults Q: Why do we need logic for programs? CS5090

  5. Fault Tolerance • Based on the concept of Replication • t- tolerant: system delivers correct service up to a failure of t components • Identical Replicas of Server • t+1 for Fail Stop faults • 2t+1 for Byzantine faults Q: What kind of fault tolerance is this? What types of faults it can tolerate? CS5090

  6. Replication Scheme CS5090

  7. State Machine Model • Each Server Replica is an identical state machine • State Machines are Request Driven Machines and cannot progress on their own • A client Issues a Request to the State Machine CS5090

  8. State Machine Behavior with respect to clients • O1: Requests Issued by a single client should be processed in the same order they were issued • O2: If a request r2 is causally related to r1, r1 should be processed before r2 CS5090

  9. Example Q: Find the analogy between state machine in this context and FSM used in sequential circuits synthesis CS5090

  10. Agreement and Order • Coordination is necessary to assure O1 and O2 • Agreement: All Replicas agree upon the value of request they should process • Order: All Replicas should process requests in the same order (agree on order of requests) • Stable Request: a request whose value and order are agreed among Replicas CS5090

  11. Agreement • IC1: All nonfaulty processors agree on the same value • IC2: If the transmitter is nonfaulty, all nonfaulty processors use its value as the one on which they agree Q: How to determine faulty processors assuming a byzantine fault model? CS5090

  12. Order and Stability • Order: all replicas process the requests in the same order • Stability: a property of a request, meaning that it is in the correct order • Protocols: • Logical Clocks • Synchronized Clocks • Server Side Identification Q: Suggest a scenario for an out of order request reception CS5090

  13. Logical Clocks CS5090

  14. Stability Test • r is stable at a replica if for a new request r’ from every client, T(r) < T(r’): ( T: returns the logical clock value appended to a request) • As unbounded delays of messages are accepted, agreement in the case of Byzantine faults is impossible CS5090

  15. Synchronized Real-Time Clocks • Each Processor has a real-time clock synchronized with all other processors clocks. • Upper bounds on request delays guarantee order in the case of Byzantine failures CS5090

  16. Stability Test • 1- Replica waits to guarantee no reception of requests: disadvantage (Replica has to wait) • 2- Check for a request from every client with a larger identifier • In practice the disjunction of both tests is used Q: How Byzantine Failures are handled in this case? CS5090

  17. Replica Generated Identifiers • Advantage: not all processors need to communicate • Phase 1: each replica proposes a unique ID for the received request, a request is seen in this case • Phase 2: all replicas agree upon the request ID, the request is accepted in this case CS5090

  18. Requirements for Stability Agreement • Stability Test: For all received request r’ from every client, their candidate identifiers should be strictly greater than an accepted request r CS5090

  19. Generating Unique Identifiers Q: What is the significance of i/N term? CS5090

  20. Tolerating Faulty Output Devices • Outputs Used Outside the System • Replicate Output Devices • Replicate Voters • Outputs Used Inside the system • Outputs go back to Clients • Each Client has a voter inside it CS5090

  21. Tolerating Faulty Clients • Replication • Server State Machine Modification • Voter Inside the State Machine • Requests having same content but different identifiers • Requests having different content and identifiers Q: How a voter failure inside server is handled? CS5090

  22. Defensive Programming • Replicas are not always possible • Lack of hardware • Application Semantics do not allow replication • Defensive Programming: additional requirements on state machines to prevent some possibly destructive actions from a faulty client • Examples: • Memory Partitioning and prevention of shared access • Bounded time shared resources by using scheduled requests on the server side CS5090

  23. Timed Requests • Pro: No need to transmit requests • Con: Does not have parameters • Default Request: Executes on time at the server unless the client sends a different request CS5090

  24. Reconfiguration CS5090

  25. C, O and S • A configuration is a Triplet <C,O,S> • C: the set of operational clients • O: the set of operational output devices • S: the set of operational state machine replicas • C and O are needed by the state machine replicas • S is needed by the agreement protocol CS5090

  26. Configurators • Manages a single object in C, O or S • Detects failures and repairs of this objects • Are clients by themselves • Issue requests of reconfiguration to State Machine Replicas • State machine use application dependent mechanisms for failure detection CS5090

  27. Note The Next Slides are adapted from a presentation by Leon Traille From Georgia Tech For a presenatation of the same paper CS5090

  28. Integrating a Repaired Object • e[ri]:the state that a non-faulty system element e should be after processing requests r0 through ri • An element joining the configuration immediately after request rjoin must be in state e[rjoin] before it can participate • Fail-stop failures • output device : e[rjoin] is likely to be a small amount of setup information that can be provided by state variables of smi • a client : e[rjoin] is frequently based on previous sensor values and can be determined by information from other clients • a state machine replica :the information for e[rjoin] is stored in state variables and pending requests at smi • Byzantine failures • require t + 1 replicas instead of just one CS5090

  29. Integration with Logical Clocks • Integrating element e by state machine replica smi at request rjoin • Fail-stop processors If e is client or e is output device then send any relevant portion of state variables to e before sending any output produced by requests with unique identifiers larger than the one on rjoin If e is state machine replica smnew then 1) send the values of its state variables and copies of any pending requests to smnew 2) send to smnew every subsequent received from each client c such that uid(r) < uid(rc) where rc is the first request smnew received directly from c after being restarted • Byzantine failures • Because information from smi might be incorrect t + 1 copies of identical state information and t + 1 copies of relayed messages must be obtained CS5090

  30. Integration with Real-time Clocks • Integrating element e by state machine replica smi at request rjoin • Fail-stop processors If e is client or e is output device then send relevant portions of its state variables to e before sending any output produced by requests with unique identifiers larger than the one on rjoin If e is state machine replica smnew then 1) send the values of its state variables and copies of any pending requests to smi 2) send to smnew every request received during the next interval of duration Δ • Byzantine failures • Because information from smi might be incorrect t + 1 copies of identical state information and t + 1 copies of relayed messages must be obtained CS5090

  31. Stability Test During Restart • Relaying of messages break the stability tests • A request r may be received directly from client c but later a request r’, also from c, is relayed by smi with uid(r) > uid(r’) • Solution: must consider requests from c as stable only after no relayed requests from c can arrive • Stability Test During Restart: A request r received directly from a client c by restarting state machine replica smnew is stable only after the last request from c relayed by another processor has been received by smnew CS5090

  32. Thank you! CS5090

More Related