Group Communication, Still Complex After All These Years

Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad Rodeh

Table of contents • Introduction • Three specific examples • Why GCSes are complex to build • A look at two interesting research systems • Summary

Disclaimer • This is a position paper • It reflects the author’s opinion • It does not convey “scientific truth” • Analysis comes from the Authors experience in • IBM storage research • HP storage research • Published papers on IBM and HP storage systems

Introduction • An ideal GCS would be “one-size-fits-all” • One system that would fit the needs of all applications • The authors claim that this is not feasible • The application domain • storage systems that use group-communication

Three examples • IBM General Parallel File SystemTM • IBM Total Storage File SystemTM (a.k.a. StorageTank) • IBM Total Storage Volume ControllerTM (SVC)

N2 N4 N3 SP2 interconnect N1 N5 Disk3 Disk1 Disk2 GPFS • GPFS is the file-system at the core of the SPTM and ASCI white super-computers • It scales from a few nodes to a few hundreds

GPFS (2) • The group-services component (HACMPTM) allows cluster-nodes to • Heartbeat each other • Compute membership • Perform multi-round voting schemes • Execute merge/failure protocols • When a node fails, all of its file-system read/write tokens are lost • The recovery procedure needs to regenerate lost tokens and file-system operations to continue

StorageTank control messages • A SAN file-system • Clients can perform IO directly to network-attached disks Meta Data Server cluster Hosts IP Host1 Host2 MDS1 read/write fencing MDS2 Host3 Host4 SAN MDS3 Disks Disk3 Disk1 Disk2

StorageTank • Contains a set of meta-data servers responsible for all file-system meta-data: the directory structure and file-attributes • The file-system name-space is partitioned into a forest of trees • The meta-data for each tree is stored in a small database on shared disks • Each tree is managed by one metadata server, so the servers share nothing between them. • This leads to a rather light-weight requirement from group-services • They only need to quickly (100ms) detect when one server fails and tell another to take its place • Recovery consists of reading the on-disk database and its logs, and recovering client state such as file-locks from the client

SVC • SCV is a storage controller and SAN virtualizer. • It is architected as a small cluster of Linux-based servers situated between a set of hosts and a set of disks. • The servers provide caching, virtualization, and copy services (e.g. snapshot) to hosts. • The servers use non-volatile memory to cache data writes so that they can immediately reply to host write operations. Host L1 L1 SAN Fibre Channel, iSCSI Infiniband Host L2 L2 Host L3 L3 Disk3 Disk1 Disk2

SVC (2) • Fault tolerance requirements imply that the data must be written consistently to at least two servers. • This requirement has driven the programming environment in SVC • The cluster is implemented using a replicated state machine approach, with a Paxos type two-phase protocol for sending messages. • However, the system runs inside a SAN, so all communication is through the FibreChannel, which does not support multicast • Messages must be embedded in the SCSI over FibreChannel (FCP) protocol. • This is a major deviation from most GCSes, which are IP/multicast based.

Why GCSes are complex to build • Many general-purpose GCSes have been built over the years • Spread, Isis • These systems are complex, why? • Three reasons: • Scalability • Semantics • Target programming environment

Semantics • What kind of semantics should a GCS provide? • Some storage systems only want limited features in their group services • StorageTank only uses membership and failure detection. • SVC uses Paxos • Some important semantics include: • FIFO-order, causal-order, total-order, safe-order. • Messaging: all systems support multicast, but what about virtually synchronous point-to-point? • Lightweight groups • Handling state transfer • Channels outside the GCS: how is access to shared disk state coordinated with in-GCS communication?

Scalability • Traditional GCSes have scalability limits. • Roughly speaking, the basic virtual synchrony model cannot scale beyond several hundred nodes • GPFS on cluster supercomputers pushes these limits. • What pieces of a GCS are required for correct functioning and what can be dropped? • Should we keep failure detection but leave virtual synchrony out? • What about state transfer?

Environment • The product's programming and execution environment bears on the ability to adapt an existing GCS to it. • Many storage systems are implemented using special purpose operating systems or programming environments with limited or eccentric functionality. • This makes products more reliable, more efficient, and less vulnerable to attacks. • SVC is implemented in a special-purpose development environment • Designed around its replicated state machine • Intended to reduce errors in implementation through rigorous design

Environment (2) • Some important aspects of the environment are: • Programming Language: C, C++, C#, Java, ML (Caml, SML)? • Threading: What is the threading model used? • co-routines, non-preemptive, in-kernel threads, user-space threads. • Kernel: is the product in-kernel, user-space, both? • Resource allocation: how are memory, message buffer space, thread context, and so on allocated and managed? • Build and execution environment: What kind of environment is used by the customer? • API: What does the GCS API look like: callbacks? socket-like? • This wide variability in requirements and environment makes it impossible to build one all encompassing solution.

Two research systems • Two systems that attempt to bypass the scalability limits of GCS • Palladio • zFS

Client read/write data layout information layout change, failure detection Storage device volume manager events and reactions System control Palladio • A research block storage system [Golding & Borowsky, SRDS ’99] • Made of lots of small storage bricks • Scales from tens of bricks to thousands • Data striped, redundant across bricks • Main issues: • Performance of basic IO operations (read, write) • Reliability • Scaling to very large systems • When performing IO: • Expect one-copy serializability • Actually atomic and serialized even when accessing discontiguous data ranges

IO IO IO IO IO IO operation layout management layout epoch layout epoch failure report layout change system control System management decision process time Palladio layering of concerns • Three layers • Each executes in context provided by layer below • Optimized for the specific task, defers difficult/expensive things to layer below

Palladio layering (2) • IO operations • Perform individual read, write, data copy operations • Uses optimized 2PC write, 1-phase read (lightweight transactions) • Executes within one layout (membership) epoch -> constant membership • Freezes on failure • Layout management • Detects failures • Coordinates membership change • Fault tolerant; uses AC and leader election subprotocols • Concerned only with a single volume • System control • Policy determination • Reacts to failure reports, load measures, new resources • Changes layout for one or more volumes • Global view, but hierarchically structured

Palladio as a specialized group communication system • IO processing fills role of an atomic multicast protocol • For arbitrary, small set of message recipients • With exactly two kinds of “messages” – read and write • Scope limited to a volume • Layout management is like failure suspicion + group membership • Specialized membership roles: manager and data storage devices • Works with storage-oriented policy layer below it • Scope limited to a volume • System control is policy that is outside the data storage/communication system • Scalability comes from • Limiting IO and layout to a single volume and hence few nodes • Structuring system control layer hierarchically

zFS • zFS is a server-less file-system developed at IBM research. • Its components are hosts and Object-Stores. • It is designed to be extremely scalable. • It can be compared to clustered/SAN file-systems that use group-services to maintain lock and meta-data consistency. • The challenge is to maintain the same levels of consistency, availability, and performance without a GCS.

An Object Store • An object store (ObS) is a storage controller that provides an object-based protocol for data access. • Roughly speaking, an object is a file, users can read/write/create/delete objects. • An object store handles allocation internally and secures its network connections. • In zFS we also assume an object store supports one major lease. • Taking a major lease allows access to all objects on the object store. • The ObS keeps track of the current owner of the major-lease • this replaces a name service for locating lease owners.

Environment and assumptions • A zFS configuration can encompass a large number of ObSes and hosts. • Hosts can fail, • ObSes are assumed to be reliable • A large body of work invested in making storage-controllers highly reliable. • If an ObS fails then a client requiring a piece of data on it will wait until that disk is restored. • zFS assumes the timed-asynchronous message-passing model.

ObS5 ObS2 oid=1 oid=4 File-system layout • A file is an object (simplification) • A directory is an object • A directory entry contains • A string name • A pointer to the object holding the file/directory • The pointer = (object disk, object id) • The file-system data and meta data is spread throughout the set of ObSes ObS1 oid=6 oid=9 “bar” (5, 1) “foo” (2, 4) oid=19

L1 L3 L2 LMGRs • For each ObS in use there is an Lease-Manager (LMGR) • The LMGR manages access to the ObS • A strict cache consistency policy is used. • To locate LMGRx a host access ObS X and queries who is the LMGR • If none exists then the host becomes LMGRx • There is no name-server for locating LMGRs • All operations are lease based • This makes the LMGR state-less • No replication of LMGR state is required. appl ObS2 ObS1 ObS3

Meta Data operations • Metadata operations are difficult to perform • create file, delete file, create directory, … • For example, file creation should be an atomic operation. • Implemented by • Creating a directory entry • Creating a new object with the requested attributes. • A host can fail in between these two steps leaving the file system inconsistent. • A delete operation • involves: • Removing a directory entry • Removing an object. • A host performing the delete can fail midway through the operation • This leaves dangling objects that are never freed

Comparison • It is difficult to compare zFS and Palladio, this isn't an apples-to-apples comparison • Palladio: • A block-storage system • Uses group-services outside the IO path. • Performs IO to a small set of disks per IO • Uses light-weight transactions • Does not use reliable disks • Uses a name-server to locate internal services • zFS: • A file-system • Does not use group services. • Performs IO to a small set of disks per transaction • Uses full, heavy-weight, transactions • Uses reliable disks • Does not uses an internal name-server

Summary • It is the authors' opinion that group communication is complex and is going to remain complex, without a one-size-fits-all solution. • There are three major reasons for this: • Semantics, • Scalability • Operating environment. • Different customers • Want different semantics from a GCS • Have different scalability and performance requirements • Have a wide range of programming environments • We believe that in the future users will continue to tailor their proprietary GCSes to their particular environment.

Group Communication, Still Complex After All These Years