Reliable Distributed Systems

Reliable Distributed Systems RPC and Client-Server Computing

Remote Procedure Call • Basic concepts • Implementation issues, usual optimizations • Where are the costs? • Firefly RPC, Lightweight RPC, Winsock Direct and VIA • Reliability and consistency • Multithreading debate

A brief history of RPC • Introduced by Birrell and Nelson in 1985 • Pre-RPC: Most applications were built directly over the Internet primitives • Their idea: mask distributed computing system using a “transparent” abstraction • Looks like normal procedure call • Hides all aspects of distributed interaction • Supports an easy programming model • Today, RPC is the core of many distributed systems

More history • Early focus was on RPC “environments” • Culminated in DCE (Distributed Computing Environment), standardizes many aspects of RPC • Then emphasis shifted to performance, many systems improved by a factor of 10 to 20 • Today, RPC often used from object-oriented systems employing CORBA or COM standards. Reliability issues are more evident than in the past.

The basic RPC protocol client server “binds” to server registers with name service

The basic RPC protocol client server “binds” to server prepares, sends request registers with name service receives request

The basic RPC protocol client server “binds” to server prepares, sends request registers with name service receives requestinvokes handler

The basic RPC protocol client server “binds” to server prepares, sends request registers with name service receives requestinvokes handlersends reply

The basic RPC protocol client server “binds” to server prepares, sends request unpacks reply registers with name service receives requestinvokes handlersends reply

Compilation stage • Server defines and “exports” a header file giving interfaces it supports and arguments expected. Uses “interface definition language” (IDL) • Client includes this information • Client invokes server procedures through “stubs” • provides interface identical to the server version • responsible for building the messages and interpreting the reply messages • passes arguments by value, never by reference • may limit total size of arguments, in bytes

Binding stage • Occurs when client and server program first start execution • Server registers its network address with name directory, perhaps with other information • Client scans directory to find appropriate server • Depending on how RPC protocol is implemented, may make a “connection” to the server, but this is not mandatory

Data in messages • We say that data is “marshalled” into a message and “demarshalled” from it • Representation needs to deal with byte ordering issues (big-endian versus little endian), strings (some CPUs require padding), alignment, etc • Goal is to be as fast as possible on the most common architectures, yet must also be very general

Request marshalling • Client builds a message containing arguments, indicates what procedure to invoke • Do to need for generality, data representation a potentially costly issue! • Performs a send I/O operation to send the message • Performs a receive I/O operation to accept the reply • Unpacks the reply from the reply message • Returns result to the client program

Costs in basic protocol? • Allocation and marshalling data into message (can reduce costs if you are certain client, server have identical data representations) • Two system calls, one to send, one to receive, hence context switching • Much copying all through the O/S: application to UDP, UDP to IP, IP to ethernet interface, and back up to application

Schroeder and Burroughs • Studied RPC performance in O/S kernel • Suggested a series of major optimizations • Resulted in performance improvments of about 10-fold for Xerox firefly workstation (from 10ms to below 1ms)

Typical optimizations? • Compile the stub “inline” to put arguments directly into message • Two versions of stub; if (at bind time) sender and dest. found to have same data representations, use host-specific rep. • Use a special “send, then receive” system call (requires O/S extension) • Optimize the O/S kernel path itself to eliminate copying – treat RPC as the most important task the kernel will do

Fancy argument passing • RPC is transparent for simple calls with a small amount of data passed • “Transparent” in the sense that the interface to the procedure is unchanged • But exceptions thrown will include new exceptions associated with network • What about complex structures, pointers, big arrays? These will be very costly, and perhaps impractical to pass as arguments • Most implementations limit size, types of RPC arguments. Very general systems less limited but much more costly.

Overcoming lost packets client server sends request

Overcoming lost packets client server sends request Timeout! retransmit duplicate request: ignored ack for request

Overcoming lost packets client server sends request Timeout! retransmit ack for request reply

Overcoming lost packets client server sends request Timeout! retransmit ack for request reply ack for reply

Costs in fault-tolerant version? • Acks are expensive. Try and avoid them, e.g. if the reply will be sent quickly supress the initial ack • Retransmission is costly. Try and tune the delay to be “optimal” • For big messages, send packets in bursts and ack a burst at a time, not one by one

Big packets client server sends request as a burst ack entire burst reply ack for reply

RPC “semantics” • At most once: request is processed 0 or 1 times • Exactly once: request is always processed 1 time • At least once: request processed 1 or more times ... but exactly once is impossible because we can’t distinguish packet loss from true failures! In both cases, RPC protocol simply times out.

Implementing at most/least once • Use a timer (clock) value and a unique id, plus sender address • Server remembers recent id’s and replies with same data if a request is repeated • Also uses id to identify duplicates and reject them • Very old requests detected and ignored by checking time • Assumes that the clocks are working • In particular, requires “synchronized” clocks

RPC versus local procedure call • Restrictions on argument sizes and types • New error cases: • Bind operation failed • Request timed out • Argument “too large” can occur if, e.g., a table grows • Costs may be very high • ... so RPC is actually not very transparent!

RPC costs in case of local destination process • Often, the destination is right on the caller’s machine! • Caller builds message • Issues send system call, blocks, context switch • Message copied into kernel, then out to dest. • Dest is blocked... wake it up, context switch • Dest computes result • Entire sequence repeated in reverse direction • If scheduler is a process, context switch 6 times!

RPC example Dest on same site O/S Source does xyz(a, b, c)

RPC in normal case Destination and O/S are blocked Dest on same site O/S Source does xyz(a, b, c)

RPC in normal case Source, dest both block. O/S runs its scheduler, copies message from source out-queue to dest in-queue Dest on same site O/S Source does xyz(a, b, c)

RPC in normal case Dest runs, copies in message Dest on same site O/S Source does xyz(a, b, c) Same sequence needed to return results

Important optimizations: LRPC • Lightweight RPC (LRPC): for case of sender, dest on same machine (Bershad et. al.) • Uses memory mapping to pass data • Reuses same kernel thread to reduce context switching costs (user suspends and server wakes up on same kernel thread or “stack”) • Single system call: send_rcv or rcv_send

LRPC O/S and dest initially are idle Dest on same site O/S Source does xyz(a, b, c)

LRPC Control passes directly to dest Dest on same site O/S Source does xyz(a, b, c) arguments directly visible through remapped memory

LRPC performance impact • On same platform, offers about a 10-fold improvement over a hand-optimized RPC implementation • Does two memory remappings, no context switch • Runs about 50 times faster than standard RPC by same vendor (at the time of the research) • Semantics stronger: easy to ensure exactly once

Fbufs • Peterson: tool for speeding up layered protocols • Observation: buffer management is a major source of overhead in layered protocols (ISO style) • Solution: uses memory management, protection to “cache” buffers on frequently used paths • Stack layers effectively share memory • Tremendous performance improvement seen

Fbufs control flows through stack of layers, or pipeline of processes data copied from “out” buffer to “in” buffer

Fbufs control flows through stack of layers, or pipeline of processes data placed into “out” buffer, shaded buffers are mapped into address space but protected against access

Fbufs control flows through stack of layers, or pipeline of processes buffer remapped to eliminate copy

Fbufs control flows through stack of layers, or pipeline of processes in buffer reused as out buffer

Fbufs control flows through stack of layers, or pipeline of processes buffer remapped to eliminate copy

Where are Fbufs used? • Although this specific system is not widely used • Most kernels use similar ideas to reduce costs of in-kernel layering • And many application-layer libraries use the same sorts of tricks to achieve clean structure without excessive overheads from layer crossing

Active messages • Concept developed by Culler and von Eicken for parallel machines • Assumes the sender knows all about the dest, including memory layout, data formats • Message header gives address of handler • Applications copy directly into and out of the network interface

Performance impact? • Even with optimizations, standard RPC requires about 1000 instructions to send a null message • Active messages: as few as 6 instructions! One-way latency as low as 35usecs • But model works only if “same program” runs on all nodes and if application has direct control over communication hardware

U/Net • Low latency/high performance communication for ATM on normal UNIX machines, later extended to fast Ethernet • Developed by Von Eicken, Vogels and others at Cornell (1995) • Idea is that application and ATM controller share memory-mapped region. I/O done by adding messages to queue or reading from queue • Latency 50-fold reduced relative to UNIX, throughput 10-fold better for small messages!

U/Net concepts • Normally, data flows through the O/S to the driver, then is handed to the device controller • In U/Net the device controller sees the data directly in shared memory region • Normal architecture gets protection from trust in kernel • U/Net gets protection using a form of cooperation between controller and device driver

U/Net implementation • Reprogram ATM controller to understand special data structures in memory-mapped region • Rebuild ATM device driver to match this model • Pin shared memory pages, leave mapped into I/O DMA map • Disable memory caching for these pages (else changes won’t be visible to ATM)

U-Net Architecture ATM device controller sees whole region and can transfer directly in and out of it ... organized as an in-queue, out-queue, freelist User’s address space has a direct-mapped communication region

U-Net protection guarantees • No user can see contents of any other user’s mapped I/O region (U-Net controller sees whole region but not the user programs) • Driver mediates to create “channels”, user can only communicate over channels it owns • U-Net controller uses channel code on incoming/outgoing packets to rapidly find the region in which to store them

U-Net reliability guarantees • With space available, has the same properties as the underlying ATM (which should be nearly 100% reliable) • When queues fill up, will lose packets • Also loses packets if the channel information is corrupted, etc

Reliable Distributed Systems