Csc 536 lecture 8
This presentation is the property of its rightful owner.
Sponsored Links
1 / 89

CSC 536 Lecture 8 PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on
  • Presentation posted in: General

CSC 536 Lecture 8. Outline. Fault tolerance Reliable client-server communication Reliable group communication Distributed commit Case study Google infrastructure. Reliable client-server communication. Process-to-process communication.

Download Presentation

CSC 536 Lecture 8

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Csc 536 lecture 8

CSC 536 Lecture 8


Outline

Outline

  • Fault tolerance

    • Reliable client-server communication

    • Reliable group communication

    • Distributed commit

      Case study

    • Google infrastructure


Reliable client server communication

Reliable client-server communication


Process to process communication

Process-to-process communication

  • Reliable process-to-process communications is achieved using the Transmission Control Protocol (TCP)

    • TCP masks omission failures using acknowledgments and retransmissions

    • Completely hidden to client and server

    • Network crash failures are not masked


Rpc rmi semantics in the presence of failures

RPC/RMI Semantics inthe Presence of Failures

  • Five different classes of failures that can occur in RPC/RMI systems:

    • The client is unable to locate the server.

    • The request message from the client to the server is lost.

    • The server crashes after receiving a request.

    • The reply message from the server to the client is lost.

    • The client crashes after sending a request.


Rpc rmi semantics in the presence of failures1

RPC/RMI Semantics inthe Presence of Failures

  • Five different classes of failures that can occur in RPC/RMI systems:

    • The client is unable to locate the server.

      • Throw exception

  • The request message from the client to the server is lost.

    • Resend request

  • The server crashes after receiving a request.

  • The reply message from the server to the client is lost.

    • Assign each request a unique id and have the server keep track or request ids

  • The client crashes after sending a request.

    • What to do with orphaned RPC/RMIs?


  • Server crashes

    Server Crashes

    • A server in client-server communication

    • (a) The normal case. (b) Crash after execution. (c) Crash before execution.


    What should the client do

    What should the client do?

    • Try again: at least once semantics

    • Report back failure: at most once semantics

    • Want: exactly once semantics

    • Impossible to do in general

      • Example: Server is a print server


    Print server c rash

    Print server crash

    • Three events that can happen at the print server:

      • Send the completion message (M),

      • Print the text (P),

      • Crash (C).

    • Note: M could be sent by the server just before it sends the file to be printed to the printer or just after


    Server crashes1

    Server Crashes

    • These events can occur in six different orderings:

      • M →P →C: A crash occurs after sending the completion message and printing the text.

      • M →C (→P): A crash happens after sending the completion message, but before the text could be printed.

      • P →M →C: A crash occurs after sending the completion message and printing the text.

      • P→C(→M): The text printed, after which a crash occurs before the completion message could be sent.

      • C (→P →M): A crash happens before the server could do anything.

      • C (→M →P): A crash happens before the server could do anything.


    Client strategies

    Client strategies

    • If the server crashes and subsequently recovers, it will announce to all clients that it is running again

    • The client does not know whether its request to print some text has been carried out

    • Strategies for the client:

      • Never reissue a request

      • Always reissue a request

      • Reissue a request only if client did not receive a completion message

      • Reissue a request only if client did receive a completion message


    Server crashes2

    Server Crashes

    • Different combinations of client and server strategies in the presence of server crashes.

    • Note that exactly once semantics is not achievable under any client/server strategy.


    Akka client server communication

    Akka client-server communication

    • At most once semantics

    • Developer is left the job of implementing any additional guarantees required by the application


    Reliable group communication

    Reliable group communication


    Reliable group communication1

    Reliable group communication

    • Process replication helps in fault tolerance but gives rise to a new problem:

      • How to construct a reliable multicast service, one that provides a guarantee that all processes in a group receive a message?

  • A simple solution that does not scale:

    • Use multiple reliable point-to-point channels

  • Other problems that we will consider later:

    • Process failures

    • Processes join and leave groups

      We assume that unreliable multicasting is available

    • We assume processes are reliable, for now


  • Implementing reliable multicasting on top of unreliable multicasting

    Implementing reliable multicasting on top of unreliable multicasting

    • Solution attempt 1:

      • (Unreliably) multicast message to process group

      • A process acknowledges receipt with an ack message

      • Resend message if no ack received from one or more processes

  • Problem:

    • Sender needs to process all the acks

    • Solution does not scale


  • Implementing reliable multicasting on top of unreliable multicasting1

    Implementing reliable multicasting on top of unreliable multicasting

    • Solution attempt 2:

      • (Unreliably) multicast numbered message to process group

      • A receiving process replies with a feedback message only to inform that it is missing a message

      • Resend missing message to process

  • Problems with solution attempt:

    • Sender must keep a log of all message it multicast forever, and

    • The number of feedback message may still be huge


  • Implementing reliable multicasting on top of unreliable multicasting2

    Implementing reliable multicasting on top of unreliable multicasting

    • We look at two more solutions

      • The key issue is the reduction of feedback messages!

      • Also care about garbage collection


    Nonhierarchical feedback control

    Nonhierarchical Feedback Control

    • Feedback suppression

      Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.


    Hierarchical feedback control

    Hierarchical Feedback Control

    • Each local coordinator forwards the message to its children

    • A local coordinator handles retransmission requests

    • How to construct the tree dynamically?


    Hierarchical feedback control1

    Hierarchical Feedback Control

    • Each local coordinator forwards the message to its children.

    • A local coordinator handles retransmission requests.

    • How to construct the tree dynamically?

      • Use the multicast tree in the underlying network


    Atomic multicast

    Atomic Multicast

    • We assume now that processes could fail, while multicast communication is reliable

    • We focus on atomic (reliable) multicasting in the following sense:

      • A multicast protocol is atomic if every message multicast to group view G is delivered to each non-faulty process in G

        • if the sender crashes during the multicast, the message is delivered to all non-faulty processes or none

        • Group view: the view on the set of processes contained in the group which sender has at the time message M was multicast

        • Atomic = all or nothing


    Virtual synchrony

    Virtual Synchrony

    • The principle of virtual synchronous multicast.


    The model for atomic multicasting

    The model for atomic multicasting

    • The logical organization of a distributed system to distinguish between message receipt and message delivery


    Implementing atomic multicasting

    Implementing atomic multicasting

    • How to implement atomic multicasting using reliable multicasting

      • Reliable multicasting could be simply sending a separate message to each member using TCP or use one of the methods described in slides 43-46.

      • Note 1: Sender could fail before sending all messages

      • Note 2: A message is delivered to the application only when all non-faulty processes have received it (at which point the message is referred to as stable)

      • Note 3: A message is therefore buffered at a local process until it can be delivered


    Implementing atomic multicasting1

    Implementing atomic multicasting

    • On initialization, for every process:

      • Received = {}

        To A-multicast message m to group G:

      • R-multicast message m to group G

        When process q receives an R-multicast message m:

      • if message m not in Received set:

        • Add message m to Received set

        • R-multicast message m to group G

        • A-deliver message m


    Implementing atomic multicasting2

    Implementing atomic multicasting

    We must especially guarantee that all messages sent to view G are delivered to all non-faulty processes in G before the next group membership change takes place


    Implementing virtual synchrony

    Implementing Virtual Synchrony

    • Process 4 notices that process 7 has crashed, sends a view change

    • Process 6 sends out all its unstable messages, followed by a flush message

    • Process 6 installs the new view when it has received a flush message from everyone else


    Distributed commit

    Distributed Commit


    Distributed commit1

    Distributed Commit

    • The problem is to insure that an operation is performed by all processes of a group, or none at all

    • Safety property:

      • The same decision is made by all non-faulty processes

      • Done in a durable (persistent) way so faulty processes can learn committed value when they recover

  • Liveness property:

    • Eventually a decision is made

  • Examples:

    • Atomic multicasting

    • Distributed transactions


  • The commit problem

    The commit problem

    • An initiating process communicates with a group of actors, who vote

      • Initiator is often a group member, too

      • Ideally, if all vote to commit we perform the action

      • If any votes to abort, none does so

  • Assume synchronous model

  • To handle asynchronous model, introduce timeouts

    • If timeout occurs the leader can presume that a member wants to abort. Called the presumed abort assumption.


  • Two phase commit protocol

    Two-Phase Commit Protocol

    2PC initiator

    p

    q

    r

    s

    t


    Two phase commit protocol1

    Two-Phase Commit Protocol

    Vote?

    2PC initiator

    p

    q

    r

    s

    t


    Two phase commit protocol2

    Two-Phase Commit Protocol

    Vote?

    2PC initiator

    p

    q

    r

    s

    t

    All vote “commit”


    Two phase commit protocol3

    Two-Phase Commit Protocol

    Vote?

    Commit!

    2PC initiator

    p

    q

    r

    s

    t

    All vote “commit”


    Two phase commit protocol4

    Two-Phase Commit Protocol

    Phase 1

    Vote?

    Commit!

    2PC initiator

    p

    q

    r

    s

    t

    All vote “commit”


    Two phase commit protocol5

    Two-Phase Commit Protocol

    Phase 1

    Phase 2

    Vote?

    Commit!

    2PC initiator

    p

    q

    r

    s

    t

    All vote “commit”


    Fault tolerance

    Fault tolerance

    • If no failures, protocol works

    • However, any member can abort any time, even before the protocol runs

    • If failure, we can separate this into three cases

      • Group member fails; initiator remains healthy

      • Initiator fails; group members remain healthy

      • Both initiator and group member fail

  • Further separation

    • Handling recovery of a failed member


  • Fault tolerance1

    Fault tolerance

    • Some cases are pretty easy

      • E.g. if a member fails before voting we just treat it as an abort

      • If a member fails after voting commit, we assume that when it recovers it will finish up the commit protocol and perform whatever action was decided

        • All members keep a log of their actions in persistent memory.

  • Hard cases involve crash of initiator


  • Two phase commit

    Two-Phase Commit

    • The finite state machine for the coordinator in 2PC.

    • The finite state machine for a participant.


    Two phase commit1

    Two-Phase Commit

    • If failure(s), note that coordinator and participants block in some states: Init, Wait, Ready

    • Use timeouts

    • If participant blocked in INIT state:

      • Abort

  • If coordinator blocks in WAIT:

    • Abort and broadcast GLOBAL_ABORT

  • If participant P blocked in READY:

    • Wait for coordinator or contact another participant Q


  • Two phase commit2

    Two-Phase Commit

    State of Q

    Action by P

    COMMIT

    Make transition to COMMIT

    ABORT

    Make transition to ABORT

    INIT

    Make transition to ABORT

    READY

    Contact another participant

    • Actions taken by a participant P when residing in state READY and having contacted another participant Q.

    • If all other participants are READY, P must block!


    As a time line picture

    As a time-line picture

    Phase 1

    Phase 2

    Commit!

    Vote?

    2PC initiator

    p

    q

    r

    s

    t

    All vote “commit”


    Why do we get stuck

    Why do we get stuck?

    • If process qvoted “commit”, the coordinator may have committed the protocol

      • And qmay have learned the outcome

      • Perhaps it transferred $10M from a bank account…

      • So we want to be consistent with that

  • If qvoted “abort”, the protocol must abort

    • And in this case we can’t risk committing

      Important: In this situation we must block; we cannot restart the transaction, say.


  • Two phase commit3

    Two-Phase Commit

    actions by coordinator:

    while START _2PC to local log;multicast VOTE_REQUEST to all participants;while not all votes have been collected { wait for any incoming vote; if timeout { while GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote;}if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants;} else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants;}

    • Outline of the steps taken by the coordinator in a two phase commit protocol


    Two phase commit4

    Two-Phase Commit

    actions by participant:

    write INIT to local log;wait for VOTE_REQUEST from coordinator;if timeout { write VOTE_ABORT to local log; exit;}if participant votes COMMIT { write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout { multicast DECISION_REQUEST to other participants; wait until DECISION is received; /* remain blocked */ write DECISION to local log; } if DECISION == GLOBAL_COMMIT write GLOBAL_COMMIT to local log; else if DECISION == GLOBAL_ABORT write GLOBAL_ABORT to local log;} else { write VOTE_ABORT to local log; send VOTE ABORT to coordinator;}

    • Steps taken by participant process in 2PC.


    Two phase commit5

    Two-Phase Commit

    actions for handling decision requests: /* executed by separate thread */

    while true { wait until any incoming DECISION_REQUEST is received; /* remain blocked */ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /* participant remains blocked */

    • Steps taken for handling incoming decision requests.


    Three phase commit protocol

    Three-phase commit protocol

    • Unlike the Two Phase Commit Protocol, it is non-blocking

    • Idea is to add an extra PRECOMMIT (“prepared to commit”) stage


    3 phase commit

    3 phase commit

    Phase 1

    Phase 2

    Phase 3

    Vote?

    Prepare to commit

    Commit!

    3PC initiator

    p

    q

    r

    s

    t

    All vote “commit”

    All say “ok”

    They commit


    Three phase commit

    Three-Phase Commit

    • Finite state machine for the coordinator in 3PC

    • Finite state machine for a participant


    Why 3 phase commit

    Why 3 phase commit?

    • A process can deduce the outcomes when this protocol is used

    • Main insight?

      • Nobody can enter the commit state unless all are first in the prepared state

      • Makes it possible to determine the state, then push the protocol forward (or back)


    Overview of google s distributed systems

    Overview of Google’s distributed systems


    Original google search engine architecture

    Original Google search engine architecture


    More than just a search engine

    More than just a search engine


    Organization of google s physical infrastructure

    Organization of Google’sphysical infrastructure

    40-80 PCs

    per rack

    (terabytes of

    disk space each)

    30+ racks

    per cluster

    Hundreds

    of clusters

    spread across

    data centers

    worldwide


    System architecture requirements

    System architecture requirements

    Scalability

    Reliability

    Performance

    Openness (at the beginning, at least)


    Overall google systems architecture

    Overall Google systems architecture


    Google infrastructure

    Google infrastructure


    Design philosophy

    Design philosophy

    • Simplicity

      • Software should do one thing and do it well

  • Provable performance

    • “every millisecond counts”

    • Estimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.)

  • Testing

    • ”if it ain’t broke, you’re not trying hard enough”

    • Stringent testing


  • Data and coordination services

    Data and coordination services

    • Google File System (GFS)

      • Broadly similar to NFS and AFS

      • Optimized to type of files and data access used by Google

  • BigTable

    • A distributed database that stores (semi-)structured data

    • Just enough organization and structure for the type of data Google uses

  • Chubby

    • a locking service (and more) for GFS and BigTable


  • Gfs requirements

    GFS requirements

    • Must run reliably on the physical platform

      • Must tolerate failures of individual components

        • So application-level services can rely on the file system

  • Optimized for Google’s usage patterns

    • Huge files (100+MB, up to 1GB)

    • Relatively small number of files

    • Accesses dominated by sequential reads and appends

    • Appends done concurrently

  • Meets the requirements of the whole Google infrastructure

    • scalable, reliable, high performance, openness

    • Important: throughput has higher priority than latency


  • Gfs architecture

    GFS architecture

    File stored in 64MB chunks in a cluster with

    • a master node (operations log replicated on remote machines)

    • hundreds of chunk servers

    • Chunks replicated 3 times


    Reading and writing

    Reading and writing

    • When the client wants to access a particular offset in a file

      • The GFS client translates this to a (file name, chunk index)

      • And then send this to the master

        When the master receives the (file name, chunk index) pair

      • It replies with the chunk identifier and replica locations

        The client then accesses the closest chunk replica directly

        No client-side caching

      • Caching would not help in the type of (streaming) access GFS has


    Keeping chunk replicas consistent

    Keeping chunk replicas consistent


    Keeping chunk replicas consistent1

    Keeping chunk replicas consistent

    • When the master receives a mutation request from a client

      • the master grants a chunk replica a lease (replica is primary)

      • returns identity of primary and other replicas to client

        The client sends the mutation directly to all the replicas

      • Replicas cache the mutation and acknowledge receipt

        The client sends a write request to primary

      • Primary orders mutations and updates accordingly

      • Primary then requests that other replicas do the mutations in the same order

      • When all the replicas have acknowledged success, the primary reports an ack to the client

        What consistency model does this seem to implement?


    Gfs non guarantees

    GFS (non-)guarantees

    • Writes (at a file offset) are not atomic

      • Concurrent writes to the same location may corrupt replicated chunks

      • If any replica is left inconsistent, the write fails (and is retried a few times)

        Appends are executed atomically “at least once”

      • Offset is chosen by primary

      • May end up with non-identical replicated chunks with some having duplicate appends

      • GFS does not guarantee that the replicas are identical

      • It only guarantees that some file regions are consistent across replicas

        When needed, GFS needs an external locking service (Chubby)

      • As well as a leader election service (also Chubby) to select the primary replica


    Bigtable

    Bigtable

    • GFS provides raw data storage

    • Also needed:

      • Storage for structured data ...

      • ... optimized to handle the needs of Google’s apps ...

      • ... that is reliable, scalable, high-performance, open, etc


    Examples of structured data

    Examples of structured data

    • URLs:

      • Content, crawl metadata, links, anchors, pagerank, ...

    • Per-user data:

      • User preference settings, recent queries/search results, …

        Geographic locations:

      • Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …


    Commercial db

    Commercial DB

    • Why not use commercial database?

      • Not scalable enough

      • Too expensive

      • Full-featured relational database not required

      • Low-level optimizations may be needed


    Bigtable table

    Bigtable table

    • Implementation: Distributed multi-dimensional sparse map

      (row, column, timestamp) → cell contents


    Csc 536 lecture 8

    Rows

    • Each row has a key

      • A string up to 64KB in size

      • Access to data in a row is atomic

  • Rows ordered lexicographically

    • Rows close together lexicographically reside on one or close machines (locality)


  • Columns

    Columns

    “CNN Sports”

    “CNN world”

    “<html>…”

    ‘contents:.’

    ‘anchor:com.cnn.www/sport’

    ‘anchor:com.cnn.www/world’

    “com.cnn.www”

    • Columns have two-level name structure:

      • family:qualifier

  • Column family

    • logical grouping of data

    • groups unbounded number of columns (named with qualifiers)

    • may have a single column with no qualifier


  • Timestamps

    Timestamps

    • Used to store different versions of data in a cell

      • default to current time

      • can also be set explicitly set by client

  • Garbage Collection

    • Per-column-family GC settings

      • “Only retain most recent K values in a cell”

      • “Keep values until they are older than K seconds”

      • ...


  • Csc 536 lecture 8

    API

    • Create / delete tables and column families

      Table *T = OpenOrDie(“/bigtable/web/webtable”);

      RowMutation r1(T, “com.cnn.www”);

      r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”);

      r1.Delete(“anchor:com.cnn.www/world”);

      Operation op;

      Apply(&op, &r1);


    Bigtable architecture

    Bigtable architecture

    An instance of BigTable is a cluster that stores tables

    • library on client side

    • master server

    • tablet servers

      • table is decomposed into tablets


    Tablets

    Tablets

    • A table is decomposed into tablets

      • Tablet holds contiguous range of rows

      • 100MB - 200MB of data per tablet

      • Tablet server responsible for ~100 tablets

        Each tablet is represented by

        • A set of files stored in GFS

          • The files use the SSTable format, a mapping of (string) keys to (string) values

        • Log files


    Tablet server

    Tablet Server

    • Master assigns tablets to tablet servers

    • Tablet server

      • Handles reads / writes requests to tablets from clients

  • No data goes through master

    • Bigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata table

    • The metadata table contains metadata about actual tablets

      • including location information of associated SSTables and log files


  • Master

    Master

    Upon startup, must grab master lock to insure it is the single master

    of a set of tablet servers

    • provided by locking service (Chubby)

      Monitors tablet servers

    • periodically scans directory of tablet servers provided by naming service (Chubby)

    • keeps track of tablets assigned to its table servers

      • obtains a lock on the tablet server from locking service (Chubby)

        • lock is the communication mechanism between master and tablet server

          Assigns unassigned tablets in the cluster to tablet servers it monitors

  • and moving tablets around to achieve load balancing

    Garbage collects underlying files stored in GFS


  • The storage architecture in bigtable

    The storage architecture in Bigtable

    Each is an ordered and immutable mapping of keys to values


    Tablet serving

    Tablet Serving

    • Writes committed to log

      • Memtable: ordered log of recent commits (in memory)

      • SSTables really store a snapshot

        When Memtable gets too big

      • Create new empty Memtable

      • Merge old Memtable with SSTables and write to GFS


    Sstable

    SSTable

    • Operations

      • Look up value for key

      • Iterate over all key/value pairs in specified range

  • Relies on lock service (Chubby)

    • Ensure there is at most one active master

    • Administer table server death

    • Store column family information

    • Store access control lists


  • Chubby

    Chubby

    • Chubby provides to the infrastructure

      • a locking service

      • a file system for reliable storage of small files

      • a leader election service (e.g. to select a primary replica)

      • a name service

        Seemingly violates “simplicity” design philosophy but...

        ... Chubby really provides a distributed agreement service


    Chubby api

    Chubby API


    Overall architecture of chubby

    Overall architecture of Chubby

    Cell: single instance

    of Chubby system

    • 5 replicas

    • 1 master replica

      Each replica maintains a database

    • of directories and files/locks

  • Consistency achieved using Lamport’sPaxos consensus protocol that uses an operation log

  • Chubby internally supports snapshots to periodically GC

  • the operation log


  • Paxos distributed consensus algorithm

    Paxos distributed consensus algorithm

    • A distributed consensus protocol for asynchronous systems

    • Used by servers managing replicas in order to reach agreement on update when

      • messages may be lost, re-ordered, duplicated

      • servers may operate at arbitrary speed and fail

      • servers have access to stable persistent storage

        Fact: Consensus not always possible in asynchronous systems

      • Paxos works by insuring safety (correctness) not liveness (termination)


    Paxos algorithm step 1

    Paxos algorithm - step 1


    Paxos algorithm step 2

    Paxos algorithm - step 2


    The big picture

    The Big Picture

    • Customized solutions for Google-type problems

    • GFS: Stores data reliably

      • Just raw files

  • BigTable: provides key/value map

    • Database like, but doesn’t provide everything we need

      Chubby: locking mechanism


  • Common principles

    Common Principles

    • One master, multiple workers

      • MapReduce: master coordinates work amongst map / reduce workers

      • Bigtable: master knows about location of tablet servers

      • GFS: master coordinates data across chunkservers


  • Login