Group communication still complex after all these years
This presentation is the property of its rightful owner.
Sponsored Links
1 / 29

Group Communication, Still Complex After All These Years PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

Group Communication, Still Complex After All These Years. Position paper, by Richard Golding and Ohad Rodeh. Table of contents. Introduction Three specific examples Why GCSes are complex to build A look at two interesting research systems Summary. Disclaimer. This is a position paper

Download Presentation

Group Communication, Still Complex After All These Years

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Group communication still complex after all these years

Group Communication, Still Complex After All These Years

Position paper, by

Richard Golding and Ohad Rodeh


Table of contents

Table of contents

  • Introduction

  • Three specific examples

  • Why GCSes are complex to build

  • A look at two interesting research systems

  • Summary


Disclaimer

Disclaimer

  • This is a position paper

    • It reflects the author’s opinion

    • It does not convey “scientific truth”

  • Analysis comes from the Authors experience in

    • IBM storage research

    • HP storage research

  • Published papers on IBM and HP storage systems


Introduction

Introduction

  • An ideal GCS would be “one-size-fits-all”

    • One system that would fit the needs of all applications

  • The authors claim that this is not feasible

  • The application domain

    • storage systems that use group-communication


Three examples

Three examples

  • IBM General Parallel File SystemTM

  • IBM Total Storage File SystemTM (a.k.a. StorageTank)

  • IBM Total Storage Volume ControllerTM (SVC)


Group communication still complex after all these years

N2

N4

N3

SP2

interconnect

N1

N5

Disk3

Disk1

Disk2

GPFS

  • GPFS is the file-system at the core of the SPTM and ASCI white super-computers

  • It scales from a few nodes to a few hundreds


Gpfs 2

GPFS (2)

  • The group-services component (HACMPTM) allows cluster-nodes to

    • Heartbeat each other

    • Compute membership

    • Perform multi-round voting schemes

    • Execute merge/failure protocols

  • When a node fails, all of its file-system read/write tokens are lost

  • The recovery procedure needs to regenerate lost tokens and file-system operations to continue


Storagetank

StorageTank

control messages

  • A SAN file-system

    • Clients can perform IO directly to network-attached disks

Meta Data Server

cluster

Hosts

IP

Host1

Host2

MDS1

read/write

fencing

MDS2

Host3

Host4

SAN

MDS3

Disks

Disk3

Disk1

Disk2


Storagetank1

StorageTank

  • Contains a set of meta-data servers responsible for all file-system meta-data: the directory structure and file-attributes

  • The file-system name-space is partitioned into a forest of trees

  • The meta-data for each tree is stored in a small database on shared disks

  • Each tree is managed by one metadata server, so the servers share nothing between them.

  • This leads to a rather light-weight requirement from group-services

    • They only need to quickly (100ms) detect when one server fails and tell another to take its place

    • Recovery consists of reading the on-disk database and its logs, and recovering client state such as file-locks from the client


Group communication still complex after all these years

SVC

  • SCV is a storage controller and SAN virtualizer.

  • It is architected as a small cluster of Linux-based servers situated between a set of hosts and a set of disks.

  • The servers provide caching, virtualization, and copy services (e.g. snapshot) to hosts.

  • The servers use non-volatile memory to cache data writes so that they can immediately reply to host write operations.

Host

L1

L1

SAN

Fibre Channel, iSCSI

Infiniband

Host

L2

L2

Host

L3

L3

Disk3

Disk1

Disk2


Svc 2

SVC (2)

  • Fault tolerance requirements imply that the data must be written consistently to at least two servers.

  • This requirement has driven the programming environment in SVC

    • The cluster is implemented using a replicated state machine approach, with a Paxos type two-phase protocol for sending messages.

    • However, the system runs inside a SAN, so all communication is through the FibreChannel, which does not support multicast

    • Messages must be embedded in the SCSI over FibreChannel (FCP) protocol.

    • This is a major deviation from most GCSes, which are IP/multicast based.


Why gcses are complex to build

Why GCSes are complex to build

  • Many general-purpose GCSes have been built over the years

    • Spread, Isis

  • These systems are complex, why?

  • Three reasons:

    • Scalability

    • Semantics

    • Target programming environment


Semantics

Semantics

  • What kind of semantics should a GCS provide?

  • Some storage systems only want limited features in their group services

    • StorageTank only uses membership and failure detection.

    • SVC uses Paxos

  • Some important semantics include:

    • FIFO-order, causal-order, total-order, safe-order.

    • Messaging: all systems support multicast, but what about virtually synchronous point-to-point?

    • Lightweight groups

    • Handling state transfer

    • Channels outside the GCS: how is access to shared disk state coordinated with in-GCS communication?


Scalability

Scalability

  • Traditional GCSes have scalability limits.

  • Roughly speaking, the basic virtual synchrony model cannot scale beyond several hundred nodes

    • GPFS on cluster supercomputers pushes these limits.

  • What pieces of a GCS are required for correct functioning and what can be dropped?

    • Should we keep failure detection but leave virtual synchrony out?

    • What about state transfer?


Environment

Environment

  • The product's programming and execution environment bears on the ability to adapt an existing GCS to it.

  • Many storage systems are implemented using special purpose operating systems or programming environments with limited or eccentric functionality.

  • This makes products more reliable, more efficient, and less vulnerable to attacks.

  • SVC is implemented in a special-purpose development environment

    • Designed around its replicated state machine

    • Intended to reduce errors in implementation through rigorous design


Environment 2

Environment (2)

  • Some important aspects of the environment are:

    • Programming Language: C, C++, C#, Java, ML (Caml, SML)?

    • Threading: What is the threading model used?

      • co-routines, non-preemptive, in-kernel threads, user-space threads.

    • Kernel: is the product in-kernel, user-space, both?

    • Resource allocation: how are memory, message buffer space, thread context, and so on allocated and managed?

    • Build and execution environment: What kind of environment is used by the customer?

    • API: What does the GCS API look like: callbacks? socket-like?

  • This wide variability in requirements and environment makes it impossible to build one all encompassing solution.


Two research systems

Two research systems

  • Two systems that attempt to bypass the scalability limits of GCS

    • Palladio

    • zFS


Palladio

Client

read/write

data

layout

information

layout change,

failure detection

Storage device

volume manager

events and

reactions

System control

Palladio

  • A research block storage system [Golding & Borowsky, SRDS ’99]

    • Made of lots of small storage bricks

    • Scales from tens of bricks to thousands

    • Data striped, redundant across bricks

  • Main issues:

    • Performance of basic IO operations (read, write)

    • Reliability

    • Scaling to very large systems

  • When performing IO:

    • Expect one-copy serializability

    • Actually atomic and serialized even when accessing discontiguous data ranges


Palladio layering of concerns

IO

IO

IO

IO

IO

IO

operation

layout

management

layout epoch

layout epoch

failure

report

layout

change

system

control

System management decision process

time

Palladio layering of concerns

  • Three layers

    • Each executes in context provided by layer below

    • Optimized for the specific task, defers difficult/expensive things to layer below


Palladio layering 2

Palladio layering (2)

  • IO operations

    • Perform individual read, write, data copy operations

    • Uses optimized 2PC write, 1-phase read (lightweight transactions)

    • Executes within one layout (membership) epoch -> constant membership

    • Freezes on failure

  • Layout management

    • Detects failures

    • Coordinates membership change

    • Fault tolerant; uses AC and leader election subprotocols

    • Concerned only with a single volume

  • System control

    • Policy determination

    • Reacts to failure reports, load measures, new resources

    • Changes layout for one or more volumes

    • Global view, but hierarchically structured


Palladio as a specialized group communication system

Palladio as a specialized group communication system

  • IO processing fills role of an atomic multicast protocol

    • For arbitrary, small set of message recipients

    • With exactly two kinds of “messages” – read and write

    • Scope limited to a volume

  • Layout management is like failure suspicion + group membership

    • Specialized membership roles: manager and data storage devices

    • Works with storage-oriented policy layer below it

    • Scope limited to a volume

  • System control is policy that is outside the data storage/communication system

  • Scalability comes from

    • Limiting IO and layout to a single volume and hence few nodes

    • Structuring system control layer hierarchically


Group communication still complex after all these years

zFS

  • zFS is a server-less file-system developed at IBM research.

    • Its components are hosts and Object-Stores.

    • It is designed to be extremely scalable.

    • It can be compared to clustered/SAN file-systems that use group-services to maintain lock and meta-data consistency.

  • The challenge is to maintain the same levels of consistency, availability, and performance without a GCS.


An object store

An Object Store

  • An object store (ObS) is a storage controller that provides an object-based protocol for data access.

  • Roughly speaking, an object is a file, users can read/write/create/delete objects.

  • An object store handles allocation internally and secures its network connections.

  • In zFS we also assume an object store supports one major lease.

  • Taking a major lease allows access to all objects on the object store.

  • The ObS keeps track of the current owner of the major-lease

    • this replaces a name service for locating lease owners.


Environment and assumptions

Environment and assumptions

  • A zFS configuration can encompass a large number of ObSes and hosts.

    • Hosts can fail,

    • ObSes are assumed to be reliable

      • A large body of work invested in making storage-controllers highly reliable.

    • If an ObS fails then a client requiring a piece of data on it will wait until that disk is restored.

  • zFS assumes the timed-asynchronous message-passing model.


File system layout

ObS5

ObS2

oid=1

oid=4

File-system layout

  • A file is an object (simplification)

  • A directory is an object

  • A directory entry contains

    • A string name

    • A pointer to the object holding the file/directory

  • The pointer = (object disk, object id)

  • The file-system data and meta data is spread throughout the set of ObSes

ObS1

oid=6

oid=9

“bar” (5, 1)

“foo” (2, 4)

oid=19


Lmgrs

L1

L3

L2

LMGRs

  • For each ObS in use there is an Lease-Manager (LMGR)

    • The LMGR manages access to the ObS

    • A strict cache consistency policy is used.

  • To locate LMGRx a host access ObS X and queries who is the LMGR

    • If none exists then the host becomes LMGRx

    • There is no name-server for locating LMGRs

  • All operations are lease based

    • This makes the LMGR state-less

    • No replication of LMGR state is required.

appl

ObS2

ObS1

ObS3


Meta data operations

Meta Data operations

  • Metadata operations are difficult to perform

    • create file, delete file, create directory, …

  • For example, file creation should be an atomic operation.

    • Implemented by

      • Creating a directory entry

      • Creating a new object with the requested attributes.

    • A host can fail in between these two steps leaving the file system inconsistent.

  • A delete operation

    • involves:

      • Removing a directory entry

      • Removing an object.

    • A host performing the delete can fail midway through the operation

    • This leaves dangling objects that are never freed


Comparison

Comparison

  • It is difficult to compare zFS and Palladio, this isn't an apples-to-apples comparison

  • Palladio:

    • A block-storage system

    • Uses group-services outside the IO path.

    • Performs IO to a small set of disks per IO

    • Uses light-weight transactions

    • Does not use reliable disks

    • Uses a name-server to locate internal services

  • zFS:

    • A file-system

    • Does not use group services.

    • Performs IO to a small set of disks per transaction

    • Uses full, heavy-weight, transactions

    • Uses reliable disks

    • Does not uses an internal name-server


Summary

Summary

  • It is the authors' opinion that group communication is complex and is going to remain complex, without a one-size-fits-all solution.

  • There are three major reasons for this:

    • Semantics,

    • Scalability

    • Operating environment.

  • Different customers

    • Want different semantics from a GCS

    • Have different scalability and performance requirements

    • Have a wide range of programming environments

  • We believe that in the future users will continue to tailor their proprietary GCSes to their particular environment.


  • Login