Csc 536 lecture 8
1 / 89

CSC 536 Lecture 8 - PowerPoint PPT Presentation

  • Uploaded on

CSC 536 Lecture 8. Outline. Fault tolerance Reliable client-server communication Reliable group communication Distributed commit Case study Google infrastructure. Reliable client-server communication. Process-to-process communication.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'CSC 536 Lecture 8' - lorna

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Csc 536 lecture 8
CSC 536 Lecture 8


  • Fault tolerance

    • Reliable client-server communication

    • Reliable group communication

    • Distributed commit

      Case study

    • Google infrastructure

Process to process communication
Process-to-process communication

  • Reliable process-to-process communications is achieved using the Transmission Control Protocol (TCP)

    • TCP masks omission failures using acknowledgments and retransmissions

    • Completely hidden to client and server

    • Network crash failures are not masked

Rpc rmi semantics in the presence of failures
RPC/RMI Semantics inthe Presence of Failures

  • Five different classes of failures that can occur in RPC/RMI systems:

    • The client is unable to locate the server.

    • The request message from the client to the server is lost.

    • The server crashes after receiving a request.

    • The reply message from the server to the client is lost.

    • The client crashes after sending a request.

Rpc rmi semantics in the presence of failures1
RPC/RMI Semantics inthe Presence of Failures

  • Five different classes of failures that can occur in RPC/RMI systems:

    • The client is unable to locate the server.

      • Throw exception

  • The request message from the client to the server is lost.

    • Resend request

  • The server crashes after receiving a request.

  • The reply message from the server to the client is lost.

    • Assign each request a unique id and have the server keep track or request ids

  • The client crashes after sending a request.

    • What to do with orphaned RPC/RMIs?

  • Server crashes
    Server Crashes

    • A server in client-server communication

    • (a) The normal case. (b) Crash after execution. (c) Crash before execution.

    What should the client do
    What should the client do?

    • Try again: at least once semantics

    • Report back failure: at most once semantics

    • Want: exactly once semantics

    • Impossible to do in general

      • Example: Server is a print server

    Print server c rash
    Print server crash

    • Three events that can happen at the print server:

      • Send the completion message (M),

      • Print the text (P),

      • Crash (C).

    • Note: M could be sent by the server just before it sends the file to be printed to the printer or just after

    Server crashes1
    Server Crashes

    • These events can occur in six different orderings:

      • M →P →C: A crash occurs after sending the completion message and printing the text.

      • M →C (→P): A crash happens after sending the completion message, but before the text could be printed.

      • P →M →C: A crash occurs after sending the completion message and printing the text.

      • P→C(→M): The text printed, after which a crash occurs before the completion message could be sent.

      • C (→P →M): A crash happens before the server could do anything.

      • C (→M →P): A crash happens before the server could do anything.

    Client strategies
    Client strategies

    • If the server crashes and subsequently recovers, it will announce to all clients that it is running again

    • The client does not know whether its request to print some text has been carried out

    • Strategies for the client:

      • Never reissue a request

      • Always reissue a request

      • Reissue a request only if client did not receive a completion message

      • Reissue a request only if client did receive a completion message

    Server crashes2
    Server Crashes

    • Different combinations of client and server strategies in the presence of server crashes.

    • Note that exactly once semantics is not achievable under any client/server strategy.

    Akka client server communication
    Akka client-server communication

    • At most once semantics

    • Developer is left the job of implementing any additional guarantees required by the application

    Reliable group communication1
    Reliable group communication

    • Process replication helps in fault tolerance but gives rise to a new problem:

      • How to construct a reliable multicast service, one that provides a guarantee that all processes in a group receive a message?

  • A simple solution that does not scale:

    • Use multiple reliable point-to-point channels

  • Other problems that we will consider later:

    • Process failures

    • Processes join and leave groups

      We assume that unreliable multicasting is available

    • We assume processes are reliable, for now

  • Implementing reliable multicasting on top of unreliable multicasting
    Implementing reliable multicasting on top of unreliable multicasting

    • Solution attempt 1:

      • (Unreliably) multicast message to process group

      • A process acknowledges receipt with an ack message

      • Resend message if no ack received from one or more processes

  • Problem:

    • Sender needs to process all the acks

    • Solution does not scale

  • Implementing reliable multicasting on top of unreliable multicasting1
    Implementing reliable multicasting on top of unreliable multicasting

    • Solution attempt 2:

      • (Unreliably) multicast numbered message to process group

      • A receiving process replies with a feedback message only to inform that it is missing a message

      • Resend missing message to process

  • Problems with solution attempt:

    • Sender must keep a log of all message it multicast forever, and

    • The number of feedback message may still be huge

  • Implementing reliable multicasting on top of unreliable multicasting2
    Implementing reliable multicasting on top of unreliable multicasting

    • We look at two more solutions

      • The key issue is the reduction of feedback messages!

      • Also care about garbage collection

    Nonhierarchical feedback control
    Nonhierarchical Feedback Control

    • Feedback suppression

      Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

    Hierarchical feedback control
    Hierarchical Feedback Control

    • Each local coordinator forwards the message to its children

    • A local coordinator handles retransmission requests

    • How to construct the tree dynamically?

    Hierarchical feedback control1
    Hierarchical Feedback Control

    • Each local coordinator forwards the message to its children.

    • A local coordinator handles retransmission requests.

    • How to construct the tree dynamically?

      • Use the multicast tree in the underlying network

    Atomic multicast
    Atomic Multicast

    • We assume now that processes could fail, while multicast communication is reliable

    • We focus on atomic (reliable) multicasting in the following sense:

      • A multicast protocol is atomic if every message multicast to group view G is delivered to each non-faulty process in G

        • if the sender crashes during the multicast, the message is delivered to all non-faulty processes or none

        • Group view: the view on the set of processes contained in the group which sender has at the time message M was multicast

        • Atomic = all or nothing

    Virtual synchrony
    Virtual Synchrony

    • The principle of virtual synchronous multicast.

    The model for atomic multicasting
    The model for atomic multicasting

    • The logical organization of a distributed system to distinguish between message receipt and message delivery

    Implementing atomic multicasting
    Implementing atomic multicasting

    • How to implement atomic multicasting using reliable multicasting

      • Reliable multicasting could be simply sending a separate message to each member using TCP or use one of the methods described in slides 43-46.

      • Note 1: Sender could fail before sending all messages

      • Note 2: A message is delivered to the application only when all non-faulty processes have received it (at which point the message is referred to as stable)

      • Note 3: A message is therefore buffered at a local process until it can be delivered

    Implementing atomic multicasting1
    Implementing atomic multicasting

    • On initialization, for every process:

      • Received = {}

        To A-multicast message m to group G:

      • R-multicast message m to group G

        When process q receives an R-multicast message m:

      • if message m not in Received set:

        • Add message m to Received set

        • R-multicast message m to group G

        • A-deliver message m

    Implementing atomic multicasting2
    Implementing atomic multicasting

    We must especially guarantee that all messages sent to view G are delivered to all non-faulty processes in G before the next group membership change takes place

    Implementing virtual synchrony
    Implementing Virtual Synchrony

    • Process 4 notices that process 7 has crashed, sends a view change

    • Process 6 sends out all its unstable messages, followed by a flush message

    • Process 6 installs the new view when it has received a flush message from everyone else

    Distributed commit1
    Distributed Commit

    • The problem is to insure that an operation is performed by all processes of a group, or none at all

    • Safety property:

      • The same decision is made by all non-faulty processes

      • Done in a durable (persistent) way so faulty processes can learn committed value when they recover

  • Liveness property:

    • Eventually a decision is made

  • Examples:

    • Atomic multicasting

    • Distributed transactions

  • The commit problem
    The commit problem

    • An initiating process communicates with a group of actors, who vote

      • Initiator is often a group member, too

      • Ideally, if all vote to commit we perform the action

      • If any votes to abort, none does so

  • Assume synchronous model

  • To handle asynchronous model, introduce timeouts

    • If timeout occurs the leader can presume that a member wants to abort. Called the presumed abort assumption.

  • Two phase commit protocol
    Two-Phase Commit Protocol

    2PC initiator






    Two phase commit protocol1
    Two-Phase Commit Protocol


    2PC initiator






    Two phase commit protocol2
    Two-Phase Commit Protocol


    2PC initiator






    All vote “commit”

    Two phase commit protocol3
    Two-Phase Commit Protocol



    2PC initiator






    All vote “commit”

    Two phase commit protocol4
    Two-Phase Commit Protocol

    Phase 1



    2PC initiator






    All vote “commit”

    Two phase commit protocol5
    Two-Phase Commit Protocol

    Phase 1

    Phase 2



    2PC initiator






    All vote “commit”

    Fault tolerance
    Fault tolerance

    • If no failures, protocol works

    • However, any member can abort any time, even before the protocol runs

    • If failure, we can separate this into three cases

      • Group member fails; initiator remains healthy

      • Initiator fails; group members remain healthy

      • Both initiator and group member fail

  • Further separation

    • Handling recovery of a failed member

  • Fault tolerance1
    Fault tolerance

    • Some cases are pretty easy

      • E.g. if a member fails before voting we just treat it as an abort

      • If a member fails after voting commit, we assume that when it recovers it will finish up the commit protocol and perform whatever action was decided

        • All members keep a log of their actions in persistent memory.

  • Hard cases involve crash of initiator

  • Two phase commit
    Two-Phase Commit

    • The finite state machine for the coordinator in 2PC.

    • The finite state machine for a participant.

    Two phase commit1
    Two-Phase Commit

    • If failure(s), note that coordinator and participants block in some states: Init, Wait, Ready

    • Use timeouts

    • If participant blocked in INIT state:

      • Abort

  • If coordinator blocks in WAIT:

    • Abort and broadcast GLOBAL_ABORT

  • If participant P blocked in READY:

    • Wait for coordinator or contact another participant Q

  • Two phase commit2
    Two-Phase Commit

    State of Q

    Action by P


    Make transition to COMMIT


    Make transition to ABORT


    Make transition to ABORT


    Contact another participant

    • Actions taken by a participant P when residing in state READY and having contacted another participant Q.

    • If all other participants are READY, P must block!

    As a time line picture
    As a time-line picture

    Phase 1

    Phase 2



    2PC initiator






    All vote “commit”

    Why do we get stuck
    Why do we get stuck?

    • If process qvoted “commit”, the coordinator may have committed the protocol

      • And qmay have learned the outcome

      • Perhaps it transferred $10M from a bank account…

      • So we want to be consistent with that

  • If qvoted “abort”, the protocol must abort

    • And in this case we can’t risk committing

      Important: In this situation we must block; we cannot restart the transaction, say.

  • Two phase commit3
    Two-Phase Commit

    actions by coordinator:

    while START _2PC to local log;multicast VOTE_REQUEST to all participants;while not all votes have been collected { wait for any incoming vote; if timeout { while GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote;}if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants;} else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants;}

    • Outline of the steps taken by the coordinator in a two phase commit protocol

    Two phase commit4
    Two-Phase Commit

    actions by participant:

    write INIT to local log;wait for VOTE_REQUEST from coordinator;if timeout { write VOTE_ABORT to local log; exit;}if participant votes COMMIT { write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout { multicast DECISION_REQUEST to other participants; wait until DECISION is received; /* remain blocked */ write DECISION to local log; } if DECISION == GLOBAL_COMMIT write GLOBAL_COMMIT to local log; else if DECISION == GLOBAL_ABORT write GLOBAL_ABORT to local log;} else { write VOTE_ABORT to local log; send VOTE ABORT to coordinator;}

    • Steps taken by participant process in 2PC.

    Two phase commit5
    Two-Phase Commit

    actions for handling decision requests: /* executed by separate thread */

    while true { wait until any incoming DECISION_REQUEST is received; /* remain blocked */ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /* participant remains blocked */

    • Steps taken for handling incoming decision requests.

    Three phase commit protocol
    Three-phase commit protocol

    • Unlike the Two Phase Commit Protocol, it is non-blocking

    • Idea is to add an extra PRECOMMIT (“prepared to commit”) stage

    3 phase commit
    3 phase commit

    Phase 1

    Phase 2

    Phase 3


    Prepare to commit


    3PC initiator






    All vote “commit”

    All say “ok”

    They commit

    Three phase commit
    Three-Phase Commit

    • Finite state machine for the coordinator in 3PC

    • Finite state machine for a participant

    Why 3 phase commit
    Why 3 phase commit?

    • A process can deduce the outcomes when this protocol is used

    • Main insight?

      • Nobody can enter the commit state unless all are first in the prepared state

      • Makes it possible to determine the state, then push the protocol forward (or back)

    Overview of google s distributed systems
    Overview of Google’s distributed systems

    Original google search engine architecture
    Original Google search engine architecture

    Organization of google s physical infrastructure
    Organization of Google’sphysical infrastructure

    40-80 PCs

    per rack

    (terabytes of

    disk space each)

    30+ racks

    per cluster


    of clusters

    spread across

    data centers


    System architecture requirements
    System architecture requirements




    Openness (at the beginning, at least)

    Overall google systems architecture
    Overall Google systems architecture

    Google infrastructure
    Google infrastructure

    Design philosophy
    Design philosophy

    • Simplicity

      • Software should do one thing and do it well

  • Provable performance

    • “every millisecond counts”

    • Estimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.)

  • Testing

    • ”if it ain’t broke, you’re not trying hard enough”

    • Stringent testing

  • Data and coordination services
    Data and coordination services

    • Google File System (GFS)

      • Broadly similar to NFS and AFS

      • Optimized to type of files and data access used by Google

  • BigTable

    • A distributed database that stores (semi-)structured data

    • Just enough organization and structure for the type of data Google uses

  • Chubby

    • a locking service (and more) for GFS and BigTable

  • Gfs requirements
    GFS requirements

    • Must run reliably on the physical platform

      • Must tolerate failures of individual components

        • So application-level services can rely on the file system

  • Optimized for Google’s usage patterns

    • Huge files (100+MB, up to 1GB)

    • Relatively small number of files

    • Accesses dominated by sequential reads and appends

    • Appends done concurrently

  • Meets the requirements of the whole Google infrastructure

    • scalable, reliable, high performance, openness

    • Important: throughput has higher priority than latency

  • Gfs architecture
    GFS architecture

    File stored in 64MB chunks in a cluster with

    • a master node (operations log replicated on remote machines)

    • hundreds of chunk servers

    • Chunks replicated 3 times

    Reading and writing
    Reading and writing

    • When the client wants to access a particular offset in a file

      • The GFS client translates this to a (file name, chunk index)

      • And then send this to the master

        When the master receives the (file name, chunk index) pair

      • It replies with the chunk identifier and replica locations

        The client then accesses the closest chunk replica directly

        No client-side caching

      • Caching would not help in the type of (streaming) access GFS has

    Keeping chunk replicas consistent1
    Keeping chunk replicas consistent

    • When the master receives a mutation request from a client

      • the master grants a chunk replica a lease (replica is primary)

      • returns identity of primary and other replicas to client

        The client sends the mutation directly to all the replicas

      • Replicas cache the mutation and acknowledge receipt

        The client sends a write request to primary

      • Primary orders mutations and updates accordingly

      • Primary then requests that other replicas do the mutations in the same order

      • When all the replicas have acknowledged success, the primary reports an ack to the client

        What consistency model does this seem to implement?

    Gfs non guarantees
    GFS (non-)guarantees

    • Writes (at a file offset) are not atomic

      • Concurrent writes to the same location may corrupt replicated chunks

      • If any replica is left inconsistent, the write fails (and is retried a few times)

        Appends are executed atomically “at least once”

      • Offset is chosen by primary

      • May end up with non-identical replicated chunks with some having duplicate appends

      • GFS does not guarantee that the replicas are identical

      • It only guarantees that some file regions are consistent across replicas

        When needed, GFS needs an external locking service (Chubby)

      • As well as a leader election service (also Chubby) to select the primary replica


    • GFS provides raw data storage

    • Also needed:

      • Storage for structured data ...

      • ... optimized to handle the needs of Google’s apps ...

      • ... that is reliable, scalable, high-performance, open, etc

    Examples of structured data
    Examples of structured data

    • URLs:

      • Content, crawl metadata, links, anchors, pagerank, ...

    • Per-user data:

      • User preference settings, recent queries/search results, …

        Geographic locations:

      • Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …

    Commercial db
    Commercial DB

    • Why not use commercial database?

      • Not scalable enough

      • Too expensive

      • Full-featured relational database not required

      • Low-level optimizations may be needed

    Bigtable table
    Bigtable table

    • Implementation: Distributed multi-dimensional sparse map

      (row, column, timestamp) → cell contents


    • Each row has a key

      • A string up to 64KB in size

      • Access to data in a row is atomic

  • Rows ordered lexicographically

    • Rows close together lexicographically reside on one or close machines (locality)

  • Columns

    “CNN Sports”

    “CNN world”






    • Columns have two-level name structure:

      • family:qualifier

  • Column family

    • logical grouping of data

    • groups unbounded number of columns (named with qualifiers)

    • may have a single column with no qualifier

  • Timestamps

    • Used to store different versions of data in a cell

      • default to current time

      • can also be set explicitly set by client

  • Garbage Collection

    • Per-column-family GC settings

      • “Only retain most recent K values in a cell”

      • “Keep values until they are older than K seconds”

      • ...

  • API

    • Create / delete tables and column families

      Table *T = OpenOrDie(“/bigtable/web/webtable”);

      RowMutation r1(T, “com.cnn.www”);

      r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”);


      Operation op;

      Apply(&op, &r1);

    Bigtable architecture
    Bigtable architecture

    An instance of BigTable is a cluster that stores tables

    • library on client side

    • master server

    • tablet servers

      • table is decomposed into tablets


    • A table is decomposed into tablets

      • Tablet holds contiguous range of rows

      • 100MB - 200MB of data per tablet

      • Tablet server responsible for ~100 tablets

        Each tablet is represented by

        • A set of files stored in GFS

          • The files use the SSTable format, a mapping of (string) keys to (string) values

        • Log files

    Tablet server
    Tablet Server

    • Master assigns tablets to tablet servers

    • Tablet server

      • Handles reads / writes requests to tablets from clients

  • No data goes through master

    • Bigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata table

    • The metadata table contains metadata about actual tablets

      • including location information of associated SSTables and log files

  • Master

    Upon startup, must grab master lock to insure it is the single master

    of a set of tablet servers

    • provided by locking service (Chubby)

      Monitors tablet servers

    • periodically scans directory of tablet servers provided by naming service (Chubby)

    • keeps track of tablets assigned to its table servers

      • obtains a lock on the tablet server from locking service (Chubby)

        • lock is the communication mechanism between master and tablet server

          Assigns unassigned tablets in the cluster to tablet servers it monitors

  • and moving tablets around to achieve load balancing

    Garbage collects underlying files stored in GFS

  • The storage architecture in bigtable
    The storage architecture in Bigtable

    Each is an ordered and immutable mapping of keys to values

    Tablet serving
    Tablet Serving

    • Writes committed to log

      • Memtable: ordered log of recent commits (in memory)

      • SSTables really store a snapshot

        When Memtable gets too big

      • Create new empty Memtable

      • Merge old Memtable with SSTables and write to GFS


    • Operations

      • Look up value for key

      • Iterate over all key/value pairs in specified range

  • Relies on lock service (Chubby)

    • Ensure there is at most one active master

    • Administer table server death

    • Store column family information

    • Store access control lists

  • Chubby

    • Chubby provides to the infrastructure

      • a locking service

      • a file system for reliable storage of small files

      • a leader election service (e.g. to select a primary replica)

      • a name service

        Seemingly violates “simplicity” design philosophy but...

        ... Chubby really provides a distributed agreement service

    Overall architecture of chubby
    Overall architecture of Chubby

    Cell: single instance

    of Chubby system

    • 5 replicas

    • 1 master replica

      Each replica maintains a database

    • of directories and files/locks

  • Consistency achieved using Lamport’sPaxos consensus protocol that uses an operation log

  • Chubby internally supports snapshots to periodically GC

  • the operation log

  • Paxos distributed consensus algorithm
    Paxos distributed consensus algorithm

    • A distributed consensus protocol for asynchronous systems

    • Used by servers managing replicas in order to reach agreement on update when

      • messages may be lost, re-ordered, duplicated

      • servers may operate at arbitrary speed and fail

      • servers have access to stable persistent storage

        Fact: Consensus not always possible in asynchronous systems

      • Paxos works by insuring safety (correctness) not liveness (termination)

    Paxos algorithm step 1
    Paxos algorithm - step 1

    Paxos algorithm step 2
    Paxos algorithm - step 2

    The big picture
    The Big Picture

    • Customized solutions for Google-type problems

    • GFS: Stores data reliably

      • Just raw files

  • BigTable: provides key/value map

    • Database like, but doesn’t provide everything we need

      Chubby: locking mechanism

  • Common principles
    Common Principles

    • One master, multiple workers

      • MapReduce: master coordinates work amongst map / reduce workers

      • Bigtable: master knows about location of tablet servers

      • GFS: master coordinates data across chunkservers