Csc 536 lecture 8
Sponsored Links
This presentation is the property of its rightful owner.
1 / 89

CSC 536 Lecture 8 PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

CSC 536 Lecture 8. Outline. Fault tolerance Reliable client-server communication Reliable group communication Distributed commit Case study Google infrastructure. Reliable client-server communication. Process-to-process communication.

Download Presentation

CSC 536 Lecture 8

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

CSC 536 Lecture 8


  • Fault tolerance

    • Reliable client-server communication

    • Reliable group communication

    • Distributed commit

      Case study

    • Google infrastructure

Reliable client-server communication

Process-to-process communication

  • Reliable process-to-process communications is achieved using the Transmission Control Protocol (TCP)

    • TCP masks omission failures using acknowledgments and retransmissions

    • Completely hidden to client and server

    • Network crash failures are not masked

RPC/RMI Semantics inthe Presence of Failures

  • Five different classes of failures that can occur in RPC/RMI systems:

    • The client is unable to locate the server.

    • The request message from the client to the server is lost.

    • The server crashes after receiving a request.

    • The reply message from the server to the client is lost.

    • The client crashes after sending a request.

RPC/RMI Semantics inthe Presence of Failures

  • Five different classes of failures that can occur in RPC/RMI systems:

    • The client is unable to locate the server.

      • Throw exception

  • The request message from the client to the server is lost.

    • Resend request

  • The server crashes after receiving a request.

  • The reply message from the server to the client is lost.

    • Assign each request a unique id and have the server keep track or request ids

  • The client crashes after sending a request.

    • What to do with orphaned RPC/RMIs?

  • Server Crashes

    • A server in client-server communication

    • (a) The normal case. (b) Crash after execution. (c) Crash before execution.

    What should the client do?

    • Try again: at least once semantics

    • Report back failure: at most once semantics

    • Want: exactly once semantics

    • Impossible to do in general

      • Example: Server is a print server

    Print server crash

    • Three events that can happen at the print server:

      • Send the completion message (M),

      • Print the text (P),

      • Crash (C).

    • Note: M could be sent by the server just before it sends the file to be printed to the printer or just after

    Server Crashes

    • These events can occur in six different orderings:

      • M →P →C: A crash occurs after sending the completion message and printing the text.

      • M →C (→P): A crash happens after sending the completion message, but before the text could be printed.

      • P →M →C: A crash occurs after sending the completion message and printing the text.

      • P→C(→M): The text printed, after which a crash occurs before the completion message could be sent.

      • C (→P →M): A crash happens before the server could do anything.

      • C (→M →P): A crash happens before the server could do anything.

    Client strategies

    • If the server crashes and subsequently recovers, it will announce to all clients that it is running again

    • The client does not know whether its request to print some text has been carried out

    • Strategies for the client:

      • Never reissue a request

      • Always reissue a request

      • Reissue a request only if client did not receive a completion message

      • Reissue a request only if client did receive a completion message

    Server Crashes

    • Different combinations of client and server strategies in the presence of server crashes.

    • Note that exactly once semantics is not achievable under any client/server strategy.

    Akka client-server communication

    • At most once semantics

    • Developer is left the job of implementing any additional guarantees required by the application

    Reliable group communication

    Reliable group communication

    • Process replication helps in fault tolerance but gives rise to a new problem:

      • How to construct a reliable multicast service, one that provides a guarantee that all processes in a group receive a message?

  • A simple solution that does not scale:

    • Use multiple reliable point-to-point channels

  • Other problems that we will consider later:

    • Process failures

    • Processes join and leave groups

      We assume that unreliable multicasting is available

    • We assume processes are reliable, for now

  • Implementing reliable multicasting on top of unreliable multicasting

    • Solution attempt 1:

      • (Unreliably) multicast message to process group

      • A process acknowledges receipt with an ack message

      • Resend message if no ack received from one or more processes

  • Problem:

    • Sender needs to process all the acks

    • Solution does not scale

  • Implementing reliable multicasting on top of unreliable multicasting

    • Solution attempt 2:

      • (Unreliably) multicast numbered message to process group

      • A receiving process replies with a feedback message only to inform that it is missing a message

      • Resend missing message to process

  • Problems with solution attempt:

    • Sender must keep a log of all message it multicast forever, and

    • The number of feedback message may still be huge

  • Implementing reliable multicasting on top of unreliable multicasting

    • We look at two more solutions

      • The key issue is the reduction of feedback messages!

      • Also care about garbage collection

    Nonhierarchical Feedback Control

    • Feedback suppression

      Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

    Hierarchical Feedback Control

    • Each local coordinator forwards the message to its children

    • A local coordinator handles retransmission requests

    • How to construct the tree dynamically?

    Hierarchical Feedback Control

    • Each local coordinator forwards the message to its children.

    • A local coordinator handles retransmission requests.

    • How to construct the tree dynamically?

      • Use the multicast tree in the underlying network

    Atomic Multicast

    • We assume now that processes could fail, while multicast communication is reliable

    • We focus on atomic (reliable) multicasting in the following sense:

      • A multicast protocol is atomic if every message multicast to group view G is delivered to each non-faulty process in G

        • if the sender crashes during the multicast, the message is delivered to all non-faulty processes or none

        • Group view: the view on the set of processes contained in the group which sender has at the time message M was multicast

        • Atomic = all or nothing

    Virtual Synchrony

    • The principle of virtual synchronous multicast.

    The model for atomic multicasting

    • The logical organization of a distributed system to distinguish between message receipt and message delivery

    Implementing atomic multicasting

    • How to implement atomic multicasting using reliable multicasting

      • Reliable multicasting could be simply sending a separate message to each member using TCP or use one of the methods described in slides 43-46.

      • Note 1: Sender could fail before sending all messages

      • Note 2: A message is delivered to the application only when all non-faulty processes have received it (at which point the message is referred to as stable)

      • Note 3: A message is therefore buffered at a local process until it can be delivered

    Implementing atomic multicasting

    • On initialization, for every process:

      • Received = {}

        To A-multicast message m to group G:

      • R-multicast message m to group G

        When process q receives an R-multicast message m:

      • if message m not in Received set:

        • Add message m to Received set

        • R-multicast message m to group G

        • A-deliver message m

    Implementing atomic multicasting

    We must especially guarantee that all messages sent to view G are delivered to all non-faulty processes in G before the next group membership change takes place

    Implementing Virtual Synchrony

    • Process 4 notices that process 7 has crashed, sends a view change

    • Process 6 sends out all its unstable messages, followed by a flush message

    • Process 6 installs the new view when it has received a flush message from everyone else

    Distributed Commit

    Distributed Commit

    • The problem is to insure that an operation is performed by all processes of a group, or none at all

    • Safety property:

      • The same decision is made by all non-faulty processes

      • Done in a durable (persistent) way so faulty processes can learn committed value when they recover

  • Liveness property:

    • Eventually a decision is made

  • Examples:

    • Atomic multicasting

    • Distributed transactions

  • The commit problem

    • An initiating process communicates with a group of actors, who vote

      • Initiator is often a group member, too

      • Ideally, if all vote to commit we perform the action

      • If any votes to abort, none does so

  • Assume synchronous model

  • To handle asynchronous model, introduce timeouts

    • If timeout occurs the leader can presume that a member wants to abort. Called the presumed abort assumption.

  • Two-Phase Commit Protocol

    2PC initiator






    Two-Phase Commit Protocol


    2PC initiator






    Two-Phase Commit Protocol


    2PC initiator






    All vote “commit”

    Two-Phase Commit Protocol



    2PC initiator






    All vote “commit”

    Two-Phase Commit Protocol

    Phase 1



    2PC initiator






    All vote “commit”

    Two-Phase Commit Protocol

    Phase 1

    Phase 2



    2PC initiator






    All vote “commit”

    Fault tolerance

    • If no failures, protocol works

    • However, any member can abort any time, even before the protocol runs

    • If failure, we can separate this into three cases

      • Group member fails; initiator remains healthy

      • Initiator fails; group members remain healthy

      • Both initiator and group member fail

  • Further separation

    • Handling recovery of a failed member

  • Fault tolerance

    • Some cases are pretty easy

      • E.g. if a member fails before voting we just treat it as an abort

      • If a member fails after voting commit, we assume that when it recovers it will finish up the commit protocol and perform whatever action was decided

        • All members keep a log of their actions in persistent memory.

  • Hard cases involve crash of initiator

  • Two-Phase Commit

    • The finite state machine for the coordinator in 2PC.

    • The finite state machine for a participant.

    Two-Phase Commit

    • If failure(s), note that coordinator and participants block in some states: Init, Wait, Ready

    • Use timeouts

    • If participant blocked in INIT state:

      • Abort

  • If coordinator blocks in WAIT:

    • Abort and broadcast GLOBAL_ABORT

  • If participant P blocked in READY:

    • Wait for coordinator or contact another participant Q

  • Two-Phase Commit

    State of Q

    Action by P


    Make transition to COMMIT


    Make transition to ABORT


    Make transition to ABORT


    Contact another participant

    • Actions taken by a participant P when residing in state READY and having contacted another participant Q.

    • If all other participants are READY, P must block!

    As a time-line picture

    Phase 1

    Phase 2



    2PC initiator






    All vote “commit”

    Why do we get stuck?

    • If process qvoted “commit”, the coordinator may have committed the protocol

      • And qmay have learned the outcome

      • Perhaps it transferred $10M from a bank account…

      • So we want to be consistent with that

  • If qvoted “abort”, the protocol must abort

    • And in this case we can’t risk committing

      Important: In this situation we must block; we cannot restart the transaction, say.

  • Two-Phase Commit

    actions by coordinator:

    while START _2PC to local log;multicast VOTE_REQUEST to all participants;while not all votes have been collected { wait for any incoming vote; if timeout { while GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote;}if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants;} else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants;}

    • Outline of the steps taken by the coordinator in a two phase commit protocol

    Two-Phase Commit

    actions by participant:

    write INIT to local log;wait for VOTE_REQUEST from coordinator;if timeout { write VOTE_ABORT to local log; exit;}if participant votes COMMIT { write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout { multicast DECISION_REQUEST to other participants; wait until DECISION is received; /* remain blocked */ write DECISION to local log; } if DECISION == GLOBAL_COMMIT write GLOBAL_COMMIT to local log; else if DECISION == GLOBAL_ABORT write GLOBAL_ABORT to local log;} else { write VOTE_ABORT to local log; send VOTE ABORT to coordinator;}

    • Steps taken by participant process in 2PC.

    Two-Phase Commit

    actions for handling decision requests: /* executed by separate thread */

    while true { wait until any incoming DECISION_REQUEST is received; /* remain blocked */ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /* participant remains blocked */

    • Steps taken for handling incoming decision requests.

    Three-phase commit protocol

    • Unlike the Two Phase Commit Protocol, it is non-blocking

    • Idea is to add an extra PRECOMMIT (“prepared to commit”) stage

    3 phase commit

    Phase 1

    Phase 2

    Phase 3


    Prepare to commit


    3PC initiator






    All vote “commit”

    All say “ok”

    They commit

    Three-Phase Commit

    • Finite state machine for the coordinator in 3PC

    • Finite state machine for a participant

    Why 3 phase commit?

    • A process can deduce the outcomes when this protocol is used

    • Main insight?

      • Nobody can enter the commit state unless all are first in the prepared state

      • Makes it possible to determine the state, then push the protocol forward (or back)

    Overview of Google’s distributed systems

    Original Google search engine architecture

    More than just a search engine

    Organization of Google’sphysical infrastructure

    40-80 PCs

    per rack

    (terabytes of

    disk space each)

    30+ racks

    per cluster


    of clusters

    spread across

    data centers


    System architecture requirements




    Openness (at the beginning, at least)

    Overall Google systems architecture

    Google infrastructure

    Design philosophy

    • Simplicity

      • Software should do one thing and do it well

  • Provable performance

    • “every millisecond counts”

    • Estimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.)

  • Testing

    • ”if it ain’t broke, you’re not trying hard enough”

    • Stringent testing

  • Data and coordination services

    • Google File System (GFS)

      • Broadly similar to NFS and AFS

      • Optimized to type of files and data access used by Google

  • BigTable

    • A distributed database that stores (semi-)structured data

    • Just enough organization and structure for the type of data Google uses

  • Chubby

    • a locking service (and more) for GFS and BigTable

  • GFS requirements

    • Must run reliably on the physical platform

      • Must tolerate failures of individual components

        • So application-level services can rely on the file system

  • Optimized for Google’s usage patterns

    • Huge files (100+MB, up to 1GB)

    • Relatively small number of files

    • Accesses dominated by sequential reads and appends

    • Appends done concurrently

  • Meets the requirements of the whole Google infrastructure

    • scalable, reliable, high performance, openness

    • Important: throughput has higher priority than latency

  • GFS architecture

    File stored in 64MB chunks in a cluster with

    • a master node (operations log replicated on remote machines)

    • hundreds of chunk servers

    • Chunks replicated 3 times

    Reading and writing

    • When the client wants to access a particular offset in a file

      • The GFS client translates this to a (file name, chunk index)

      • And then send this to the master

        When the master receives the (file name, chunk index) pair

      • It replies with the chunk identifier and replica locations

        The client then accesses the closest chunk replica directly

        No client-side caching

      • Caching would not help in the type of (streaming) access GFS has

    Keeping chunk replicas consistent

    Keeping chunk replicas consistent

    • When the master receives a mutation request from a client

      • the master grants a chunk replica a lease (replica is primary)

      • returns identity of primary and other replicas to client

        The client sends the mutation directly to all the replicas

      • Replicas cache the mutation and acknowledge receipt

        The client sends a write request to primary

      • Primary orders mutations and updates accordingly

      • Primary then requests that other replicas do the mutations in the same order

      • When all the replicas have acknowledged success, the primary reports an ack to the client

        What consistency model does this seem to implement?

    GFS (non-)guarantees

    • Writes (at a file offset) are not atomic

      • Concurrent writes to the same location may corrupt replicated chunks

      • If any replica is left inconsistent, the write fails (and is retried a few times)

        Appends are executed atomically “at least once”

      • Offset is chosen by primary

      • May end up with non-identical replicated chunks with some having duplicate appends

      • GFS does not guarantee that the replicas are identical

      • It only guarantees that some file regions are consistent across replicas

        When needed, GFS needs an external locking service (Chubby)

      • As well as a leader election service (also Chubby) to select the primary replica


    • GFS provides raw data storage

    • Also needed:

      • Storage for structured data ...

      • ... optimized to handle the needs of Google’s apps ...

      • ... that is reliable, scalable, high-performance, open, etc

    Examples of structured data

    • URLs:

      • Content, crawl metadata, links, anchors, pagerank, ...

    • Per-user data:

      • User preference settings, recent queries/search results, …

        Geographic locations:

      • Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …

    Commercial DB

    • Why not use commercial database?

      • Not scalable enough

      • Too expensive

      • Full-featured relational database not required

      • Low-level optimizations may be needed

    Bigtable table

    • Implementation: Distributed multi-dimensional sparse map

      (row, column, timestamp) → cell contents


    • Each row has a key

      • A string up to 64KB in size

      • Access to data in a row is atomic

  • Rows ordered lexicographically

    • Rows close together lexicographically reside on one or close machines (locality)

  • Columns

    “CNN Sports”

    “CNN world”






    • Columns have two-level name structure:

      • family:qualifier

  • Column family

    • logical grouping of data

    • groups unbounded number of columns (named with qualifiers)

    • may have a single column with no qualifier

  • Timestamps

    • Used to store different versions of data in a cell

      • default to current time

      • can also be set explicitly set by client

  • Garbage Collection

    • Per-column-family GC settings

      • “Only retain most recent K values in a cell”

      • “Keep values until they are older than K seconds”

      • ...

  • API

    • Create / delete tables and column families

      Table *T = OpenOrDie(“/bigtable/web/webtable”);

      RowMutation r1(T, “com.cnn.www”);

      r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”);


      Operation op;

      Apply(&op, &r1);

    Bigtable architecture

    An instance of BigTable is a cluster that stores tables

    • library on client side

    • master server

    • tablet servers

      • table is decomposed into tablets


    • A table is decomposed into tablets

      • Tablet holds contiguous range of rows

      • 100MB - 200MB of data per tablet

      • Tablet server responsible for ~100 tablets

        Each tablet is represented by

        • A set of files stored in GFS

          • The files use the SSTable format, a mapping of (string) keys to (string) values

        • Log files

    Tablet Server

    • Master assigns tablets to tablet servers

    • Tablet server

      • Handles reads / writes requests to tablets from clients

  • No data goes through master

    • Bigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata table

    • The metadata table contains metadata about actual tablets

      • including location information of associated SSTables and log files

  • Master

    Upon startup, must grab master lock to insure it is the single master

    of a set of tablet servers

    • provided by locking service (Chubby)

      Monitors tablet servers

    • periodically scans directory of tablet servers provided by naming service (Chubby)

    • keeps track of tablets assigned to its table servers

      • obtains a lock on the tablet server from locking service (Chubby)

        • lock is the communication mechanism between master and tablet server

          Assigns unassigned tablets in the cluster to tablet servers it monitors

  • and moving tablets around to achieve load balancing

    Garbage collects underlying files stored in GFS

  • The storage architecture in Bigtable

    Each is an ordered and immutable mapping of keys to values

    Tablet Serving

    • Writes committed to log

      • Memtable: ordered log of recent commits (in memory)

      • SSTables really store a snapshot

        When Memtable gets too big

      • Create new empty Memtable

      • Merge old Memtable with SSTables and write to GFS


    • Operations

      • Look up value for key

      • Iterate over all key/value pairs in specified range

  • Relies on lock service (Chubby)

    • Ensure there is at most one active master

    • Administer table server death

    • Store column family information

    • Store access control lists

  • Chubby

    • Chubby provides to the infrastructure

      • a locking service

      • a file system for reliable storage of small files

      • a leader election service (e.g. to select a primary replica)

      • a name service

        Seemingly violates “simplicity” design philosophy but...

        ... Chubby really provides a distributed agreement service

    Chubby API

    Overall architecture of Chubby

    Cell: single instance

    of Chubby system

    • 5 replicas

    • 1 master replica

      Each replica maintains a database

    • of directories and files/locks

  • Consistency achieved using Lamport’sPaxos consensus protocol that uses an operation log

  • Chubby internally supports snapshots to periodically GC

  • the operation log

  • Paxos distributed consensus algorithm

    • A distributed consensus protocol for asynchronous systems

    • Used by servers managing replicas in order to reach agreement on update when

      • messages may be lost, re-ordered, duplicated

      • servers may operate at arbitrary speed and fail

      • servers have access to stable persistent storage

        Fact: Consensus not always possible in asynchronous systems

      • Paxos works by insuring safety (correctness) not liveness (termination)

    Paxos algorithm - step 1

    Paxos algorithm - step 2

    The Big Picture

    • Customized solutions for Google-type problems

    • GFS: Stores data reliably

      • Just raw files

  • BigTable: provides key/value map

    • Database like, but doesn’t provide everything we need

      Chubby: locking mechanism

  • Common Principles

    • One master, multiple workers

      • MapReduce: master coordinates work amongst map / reduce workers

      • Bigtable: master knows about location of tablet servers

      • GFS: master coordinates data across chunkservers

  • Login