1 / 21

Fail-Stop Processors

UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department. Fail-Stop Processors. CS 739 Distributed Systems. Andrea C. Arpaci-Dusseau. Byzantine Generals in Action: Implementing Fail-Stop Processors, Fred Schneider, TOCS, May 1984 Example usage of byzantine agreement

paley
Download Presentation

Fail-Stop Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UNIVERSITY of WISCONSIN-MADISONComputer Sciences Department Fail-Stop Processors CS 739Distributed Systems Andrea C. Arpaci-Dusseau • Byzantine Generals in Action: Implementing Fail-Stop Processors, Fred Schneider, TOCS, May 1984 • Example usage of byzantine agreement • Why fail-stop processors can simplify replicated services • Why fail-top processors are expensive (impractical?) to build • Remaining Time: Byzantine Werewolves (improved?)

  2. Motivation • Goal: Build systems that continue to work in presence of component failure • Difficulty/cost of building those systems depends upon how components can fail • Fail-stop components make building reliable systems easier than components with byzantine failures

  3. Fail-Stop Processors • What is a failure? • Output (or behavior) that is inconsistent with specification • What is a Byzantine failure? • Arbitrary, even malicious, behavior • Components may collude with each other • Cannot necessarily detect output is faulty • What is a fail-stop processor? • Halts instead of performing erroneous transformations • Others can detect halted state • Others can access uncorrupted stable storage even after failure

  4. Questions to Answer • What are the advantages of fail-stop processors? • 2) Real processors are not fail-stop • Can we build one? • How can we build an approximation of one? • 3) Approximations of fail-stop processors are expensive to build • Under what circumstances is replicated service with fail-stop processors “better”?

  5. 1) Distributed State Machine • Common approach for building a reliable system • Idea: Replicate faulty servers, coordinate client interactions with replicas input sequence State machine Client Byzantine agreement R R R Combine outputs output • T-fault tolerant: Satisfies specification as long as no more than t components fail • Failure model of components determines how many replicas, R, are needed and their interactions

  6. How to build t-fault tolerant state machine? • Inputs • Key: All replicas receive and process same sequence of inputs • 1) Agreement: Every nonfaulty replica receives same request (interactive consistency or byzantine agreement) • 2) Ordering: Every nonfaulty replica processes requests in same order (logical clocks) • Outputs

  7. 2) Building a Fail-Stop Processor • Must provide stable storage • Volatile: Lost on failure • Stable • Not affected (lost or corrupted) by failure • Can be read by any processor • Benefit: Recover work of failed process • Drawback: Minimize interactions since slow • Can only build approximation of fail-stop processor • Finite hardware -> Finite failures could disable all error detection hardware • k-fail-stop processor: behaves fail-stop unless k+1 or more failures

  8. Implementation of k-FSP: Overview • Two components • k+1 p-processes (program) • 2k+1 s-processes (storage) • Each process runs on own processor, all connected with network • P-Processes (k+1) • Each runs program for state machine • Interacts with s-processes to read and write data • If any fail (if any disagreement), then all STOP • Cannot necessarily detect k+1 failures • S-Processes (2k+1) • Each replicates contents of stable storage for this FSP • Provides reliable data with k failures (cannot just stop) • Detects disagreements/failures across p-processes • How???

  9. Interactive Consistency Requirements • IC1. If nonfaulty p-process, then every nonfaulty s-process receives request within  seconds (as measured on s-process clock) • IC2. Non-faulty s-processes in same k-FSP agree on every request from p-process j • S-processes must agree even when p-process is faulty • To provide IC1 and IC2: • Assuming can authenticate sender of messages,use signed message (SM) protocol for byzantine agreement • Need just k+1 processes for agreeement • IC3. For each k-FSP, clocks of all p-processes are synchronized • All non-faulty p-processes must send requests at same time to s-processes

  10. FSP Algorithm Details: Writes • Each p-process, on a write: • Broadcast write to all s-processes • Byzantine agreement across all s-processes (all s-processes must agree on same input value from particular p-process) • Each s-process, on a write (Fig 1): • Ensure each p-process writes same value and receive within time bound • Initial code: Handle messages after at least time  has transpired since receipt (every s-process should receive by then) • If receive write request from all k+1 p-processes (|M| = k+1), then update value in stable storage • If not, then halt all p-processes • Set failed variable to true • Do not allow future writes

  11. FSP Algorithm Details: Reads • Each p-process, on a read: • Broadcast request to all s-processes • Use result from majority (k+1 out of 2k+1) • Can read from other FSPs as well • Useful if FSP failed and re-balancing work • Each p-process, determine if halted/failed: • Read failed variable from s-process (use majority)

  12. 0 1 2 0 1 2 3 4 FSP Example • k=2, SM code: “b=a+1”; How many p and s processes? p: s: a: 6 b: failed: 0 • How do p-processes read a? • Broadcast request to each s-process 2) Each s-process responds to read request • 3) Each p-process uses majority of responses from s-process

  13. 0 1 2 0 1 2 3 4 FSP Example • k=2, SM code: “b=a+1” p: s: a: b: failed: • How do p-processes read a? • What if 2 s-processes fail? • E.g., think a=5? • What if 3 s-processes fail?

  14. 0 1 2 0 1 2 3 4 FSP Example • k=2, SM code: “b=a+1” p: s: a: b: failed: • How do p-processes write b? • Each p-process j performs byzantine agreement using signed message protocol SM(2) across s-processes • Each s-process must agree on what p-process j is doing, even if j is faulty • Each s-process looks at requests after time delta elapsed • If see same write from all k+1 processes, perform write • Otherwise, halt all p-processes; forbid future writes

  15. 0 1 2 0 1 2 3 4 FSP Example • k=2, SM code: “b=a+1” p: s: a: b: failed: • How do p-processes write b? • What if 1 p-process (or network) is very slow? • What if 1 p-process gives incorrect request to all s-processes? • What if 1 p-process gives incorrect request to some? • Byzantine agreement catches: All s-processes agree that p-process is faulty (giving different requests); agree to treat it similarly • When see doesn’t agree with other p-processes, will halt • What if 3 p-processes give bad result?

  16. FSP0 FSP1 FSP2 FSP3 FSP4 SS0 SS1 SS2 SS3 SS4 3) Higher-Level Example • Goal: Service handling k faults; N nodes for performance • Solution: Use N+k k-failstop processors • Example: N=2, k=3 • What happens if: • 3 p-processes in FSP0 fail? 4 p-processes in FSP0 fail? • 1 p-process in FSP0, FSP1, and FSP2 fail? also in FSP3? • 2 p-processes in FSP0, FSP1, and FSP2 fail? • 1 s-process in SS0 fails? also in SS1, SS2, and SS3? • 4 s-processes in SS0 fail?

  17. Should we use Fail Stop Processors? • Metric: Hardware cost for state machines: • Fail-stop components: • Worst-case (assuming 1 process per processor): • (N+k) * [2k+1 + k+1] = (N+k) * (3k+2) processors • Best-case (assuming s-processes from different FSP share same processor) • (N+k)(k+1) + (2k+1) processors • Byzantine components: • N * (2k+1) • Fail-stop can be better if s-processes share and N>k… • Metric: Frequency of byzantine agreement protocol • Fail-Stop: On every access to stable storage • Byzantine: On every input read • Probably fewer input reads

  18. Summary • Why build fail-stop components? • Easier for higher layers to model and deal with • Matches assumptions of many distributed protocols • Why not? • Usually more hardware • Usually more agreements needed • Higher-levels may be able to cope with “slightly faulty” components • Violates end-to-end argument • Conclusion: Probably shouldn’t assume fail-stop components

  19. Byzantine Werewolves • Previous :Too easy for villagers to identify werewolves • Villager A had reliable information that Z was werewolf • Villager B could validate that A was villager • Hard for Z to lie that C was werewolf, because D could have checked C too • Signed Protocol: Many could hear what one said • Difficult for werewolves to tell different lies to others • Have to tell everyone same thing • New Changes to give more advantage to werewolves: • Unknown number of werewolves (1 <= w < 1/2 N) • Night: Werewolves convert multiple villagers to wolves (1 <= v <= w) • Key: Info told by moderator will then be stale and wrong! • Day: Villagers can vote to lynch multiple victims

  20. Byzantine-Werewolf Game Rules • Everyone secretly assigned as werewolf or villager • W werewolves, rest are “seeing” villagers • I am moderator • Night round (changed order): • “Close your eyes”; make noises with one hand to hide activity • For all: “NAME, open your eyes” “Pick someone to ask about” • Useless for Werewolves, but hides their identity… • Point to another player • Moderator signs thumbs up for werewolf, down for villager • “NAME, close your eyes” • “Werewolves, open your eyes”: W can see who is who • “Werewolves, pick villagers to convert” • Moderator picks secret number between 1 and W • Silently agree on villagers by pointing • Moderator taps converts on shoulder; should open eyes to see other werewolves • “Werewolves, close your eyes”

  21. Rules: Day Time • Day Time: “Everyone open your eyes; its daytime” • Agreement time: Everyone talks and votes on who should be “decommissioned” • Villagers try to decommission werewolves • Werewolves try to trick villagers with bad info • Someone must propose who should be killed • Vote until kill villager or no more proposals or no majority • Werewolves really spread at night, so large incentive to kill as many as possible now • Moderator: Uses majority voting to determine who is decommissioned “Okay, NAME is dead” • Person is out of game (can’t talk anymore) and shows card • Repeat cycle until All werewolves dead OR werewolves >= villagers

More Related