Failure detectors
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Failure Detectors PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on
  • Presentation posted in: General

Failure Detectors. Slides by I. Gupta, modified by N. Vaidya. Two Different System Models. S ynchronous Distributed System Each message is received within bounded time Each step in a process takes lb < time < ub Each local clock’s drift has a known bound Asynchronous Distributed System

Download Presentation

Failure Detectors

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Failure detectors

Failure Detectors

Slides by I. Gupta, modified by N. Vaidya


Two different system models

Two Different System Models

  • Synchronous Distributed System

    • Each message is received within bounded time

    • Each step in a process takes lb < time < ub

    • Each local clock’s drift has a known bound

  • Asynchronous Distributed System

    • No bounds on process execution

    • No bounds on message transmission delays

    • The drift rate of a clock is arbitrary

      Internet is an asynchronous distributed system


Failure model

Failure Model

  • Process omission failure

    • Crash-stop (fail-stop) – a process halts and does not execute any further operations

    • Crash-recovery – a process does not execute operations for a while

  • Crash-stop failures can be detected in synchronous systems

  • Next: detecting crash-stop failures in asynchronous systems


What s a failure detector

What’s a failure detector?

pi

pj


What s a failure detector1

What’s a failure detector?

Crash-stop failure

pi

pj

X


What s a failure detector2

What’s a failure detector?

needs to know about pj’s failure

Crash-stop failure

pi

pj

X


I heartbeating protocol

I. Heartbeating Protocol

needs to know about pj’s failure

heartbeat

pi

pj

- pj maintains a sequence number

- pj sends pi a heartbeat with incremented

seq. number after every T’(=T) time units

  • if pi has not received a heartbeat for the

  • past T time units, pi declares pj as failed


Ii ping ack protocol

II. Ping-Ack Protocol

needs to know about pj’s failure

(within 2T ms)

ping

pi

pj

ack

- pi queries pj once every T time units

- if pj does not respond within T time units,

pi marks pj as failed

- pj replies


Failure detector properties

Failure Detector Properties

  • Completeness = every process failure is eventually detected

  • Accuracy = every detected failure corresponds to a crashed process

  • Given a failure detector that satisfies both Completeness and Accuracy

    • Consensus is achievable (why?)

    • FLP => one cannot design a failure detector (for an asynchonrous system) that guarantees both above properties


Completeness or accuracy

Completeness or Accuracy?

  • Most failure detector implementations are willing to tolerate some inaccuracy, but require 100% Completeness

  • Heartbeating – satisfies completeness but not accuracy (why?)

  • Ping-Ack – satisfies completeness but not accuracy (why?)

  • Plenty of distributed apps designed assuming 100% completeness, e.g., p2p systems


Failure detection in a distributed system

Failure Detection in a Distributed System

  • Difference from original failure detection is

    • we want all processes to know about failure

  •  May need combine failure detection with a dissemination protocol

    • What’s an example of a dissemination protocol?


Centralized heartbeating

Centralized Heartbeating

pj

pj, Heartbeat Seq. l++

pi

Needs a separate dissemination component


Ring heartbeating

Ring Heartbeating

pj

pj, Heartbeat Seq. l++

pi

Needs a separate dissemination component


All to all heartbeating

All-to-All Heartbeating

pj

pj, Heartbeat Seq. l++

pi

Does not need a separate dissemination component


Failure detector metrics

Failure Detector Metrics

  • Measuring Speed: Detection Time

    • Time between a process crash and its detection

    • Determines speed of failure detector

  • Measuring Accuracy: depends on distributed application

    App1: Failure detection => unavailability, e.g., read-only replicated database with no updates

    App2: Failure detection => exclusion from group, e.g., replicated database with updates


Accuracy metrics for app1

Accuracy metrics for App1

  • App1: Failure detection => unavailability, e.g., read-only replicated database with no updates

  • Tmr: Mistake recurrence time

    • Time between two consecutive mistakes

  • Tm: Mistake duration time

    • Length of time for which correct process is marked as failed

pj

up

FD for pj at pi

Tm

Tmr

pj is down

pj is up


Accuracy metrics for app2

Accuracy metrics for App2

  • App2: Failure detection => exclusion from group, e.g., replicated database with updates

    Possible metrics:

  • Number of false failure detections per time unit

  • Fraction of failure detections that are false


Processes and channels

Processes and Channels


Other failure types

Other Failure Types

  • Communication omission failures

    • Send-omission: loss of messages between the sending process and the outgoing message buffer

    • Channel omission: loss of message in the communication channel.

    • Receive-omission: loss of messages between the incoming message buffer and the receiving process


Other failure types1

Other Failure Types

Arbitrary failures

  • Arbitrary process failure: arbitrarily omits intended processing steps or takes unintended processing steps.

  • Arbitrary channel failures: messages may be corrupted, duplicated, delivered out of order, incurs extremely large delays; or non-existent messages may be delivered.

  • Above are Byzantine failures


  • Omission and arbitrary failures

    Class of failure

    Affects

    Description

    Fail-stop

    Process

    Process halts and remains halted. Other processes may

    detect this state.

    Crash

    Process

    Process halts and remains halted. Other processes may

    not be able to detect this state.

    Omission

    Channel

    A message inserted in an outgoing message buffer never

    arrives at the other end’s incoming message buffer.

    Send-omission

    Process

    A process completes a

    send,

    but the message is not put

    in its outgoing message buffer.

    Receive-omission

    Process

    A message is put in a process’s incoming message

    buffer, but that process does not receive it.

    Arbitrary

    Process or

    Process/channel exhibits arbitrary behaviour: it may

    (Byzantine)

    channel

    send/transmit arbitrary messages at arbitrary times,

    commit omissions; a process may stop or take an

    incorrect step.

    Omission and Arbitrary Failures


    Summary

    Summary

    • Failure detectors required in distributed systems to maintain liveness in spite of process crashes

    • Properties – completeness & accuracy, together unachievable

    • Most apps require 100% completeness, but can tolerate inaccuracy

    • Heartbeating and Ping

    • Distributed FD through heartbeating: Centralized, Ring, All-to-all

    • Accuracy metrics

    • Other Types of Failures


  • Login