DISTRIBUTED COMPUTING

DISTRIBUTED COMPUTING Fall 2007

ROAD MAP: OVERVIEW • Why are distributed systems interesting? • Why are they hard?

GOALS OF DISTRIBUTED SYSTEMS Take advantage of cost/performance difference between microprocessors and shared memory multiprocessors Build systems: 1. with a single system image 2. with higher performance 3. with higher reliability 4. for less money than uniprocessor systems In wide-area distributed systems, information and work are physically distributed, implying that computing needs should be distributed. Besides improving response time, this contributes to political goals such as local control over data.

WHY SO HARD? A distributed system is one in which each process has imperfect knowledge of the global state. Reasons: Asynchrony and failures We discuss problems that these two features raise and algorithms to address these problems. Then we discuss implementation issues for real distributed systems.

ANATOMY OF A DISTRIBUTED SYSTEM A set of asynchronous computing devices connected by a network. Normally, no global clock. Communication is either through messages or shared memory. Shared memory is usually harder to implement.

ANATOMY OF A DISTRIBUTED SYSTEM (cont.) EACH PROCESSOR HAS ITS OWN CLOCK + ARBITRARY NETWORK BROADCAST MEDIUM Special protocols will be possible for the broadcast medium.

COURSE GOALS 1. To help you understand which system assumptions are important. 2. To present some interesting and useful distributed algorithms and methods of analysis then have you apply them under challenging conditions. 3. To explore the sources for distributed intelligence.

BASIC COMMUNICATION PRIMITIVE: MESSAGE PASSING Paradigm: • Send message to destination • Receive message from origin Nice property: can make distribution transparent, since it does not matter whether destination is at a local computer or at a remote one (except for failures). Clean framework: “Paradigms for Process Interaction in Distributed Programs,” G. R. Andrews, ACM Computing Surveys 23:1 (March 1991) pp. 49-90.

BLOCKING (SYNCHRONOUS) VS. NON-BLOCKING (ASYNCHRONOUS) COMMUNICATION For sender: Should the sender wait for the receiver to receive a message or not? For receiver: When arriving at a reception point and there is no message waiting, should the receiver wait or proceed? Blocking receive is normal (i.e., receiver waits).

sender sender receiver receiver send BLOCKING send NON-BLOCKING send NO COMPUTATION ACK (?) ACK

CLIENT server call return REMOTE PROCEDURE CALL Client calls the server using a call server (in parameters; out parameters). The call can appear anywhere that a normal procedure call can. Server returns the result to the client. Client blocks while waiting for response from server.

sender receiver send accept accepted RENDEZVOUS FACILITY • One process sends a message to another process and blocks at least until that process accepts the message. • The receiving process blocks when it is waiting to accept a request. Thus, the name: Only when both processes are ready for the data transfer, do they proceed. We will see examples of rendezvous interactions in CSP and Ada.

Beyond send-receive: Conversations Needed when a continuous connection is more efficient and/or only some data at a time. Bob and Alice: Bob initiates, Alice responds, then Bob, then Alice, … But what if Bob wants Alice to send messages as they arrive without Bob’s doing more than an ack? Sendonly or receiveonly mode. Others?

SEPARATION OF CONCERNS Separation of concerns is the software engineering principle that each component should have a single small job to do so it can do it well. In distributed systems, there are at least three concerns having to do with remote services: what to request, where to do it, how to ask for it.

IDEAL SEPARATION • What to request: application programmer must figure this out, e.g. access customer database. • Where to do it: application programmer should not need to know where, because this adds complexity + if location changes, application break. • How to ask for it: want a uniform interface.

client…client client Service broker server…server server WHERE TO DO IT: ORGANIZATION OF CLIENTS AND SERVERS A service is a piece of work to do. Will be done by a server. A client who wants a service sends a message to a service broker for that service. The server gets work from the broker and commonly responds directly to the client. A server is a process. More basic approach: Each server has a port from which it can receive requests. Difference: In client-broker-server model, many servers can offer the same service. In direct client-server approach, client must request a service from a particular server.

client…client client Service broker Client … client server ALTERNATIVE: NAME SERVER A service is a piece of work to do. Will be done by a server. Name Server knows where services are done Example: Client requests address of server from the Name Server and then communicates directly with that server.. Difference: Client-server communication is direct, so may be more efficient.

HOW TO ASK FOR IT:OBJECT-BASED • Encapsulation of data behind functional interface. • Inheritance is optional but interface is the contract. • So need a technique for both synchronous and asynchronous procedure calls.

REFERENCE EXAMPLE:CORBA OBJECT REQUEST BROKER • Send operation to ORB with its parameters. • ORB routes operation to proper site for execution. • Arranges for response to be sent to you directly or indirectly. • Operations can be “events” so can allow interrupts from servers to clients.

SUCCESSORS TO CORBA Microsoft Products • COM: allow objects to call one another in a centralized setting: classes + objects of those classes. Can create objects and then invoke them. • DCOM: COM + Object Request Broker. • ActiveX: DCOM for the Web.

SUCCESSORS TO CORBA Java RMI • Remote Method invocation (RMI): Define a service interface in Java. • Register the server in RMI repository, i.e., an object request broker. • Client may access Server through repository. • Notion of distributed garbage collection

SUCCESSORS TO CORBA Enterprise Java Beans • Beans are again objects but can be customized at runtime. • Support distributed transaction notion (later) as well as backups. • So transaction notion for persistent storage is another concern it is nice to separate.

REDUCING BUREAUCRACY:automatic registration • SUN also developed an abstraction known as JINI. • New device finds a lookup service (like an ORB), uploads its interface, and then everyone can access. • No need to register. • Requires a trusted environment.

TUPLE SPACE PROCESSES COOPERATING DISTRIBUTED SYSTEMS: LINDA • Linda supports a shared data structure called a tuple space. • Linda tuples, like database system records, consists of strings and integers. We will see that in the matrix example below.

LINDA OPERATIONS The operations are out (add a tuple to the space); in (read and remove a tuple from the space); and read (read but don’t remove a tuple from the tuple space). A pattern-matching mechanism is used so that tuples can be extracted selectively by specifying values or data types of some fields. in (“dennis”, ?x, ?y, ….) • gets tuple whose first field contains “dennis,” assigns values in second and third fields of the tuple to x and y, respectively.

EXAMPLE: MATRIX MULTIPLICATION There are two matrices A and B. We store A’s rows and B’s columns as tuples. (“A”, 1, A’s first row), (“A”, 2, A’s second row) …. (“B”, 1, B’s first column), (“B”, 2, B’s second column) …. (“Next”, 15) There is a global counter called Next in the range 1 .. number of rows of A x number of columns of B. A process performs an “in” on Next, records the value, and performs an “out” on Next+1, provided Next is still in its range. Convert Next into the row number I and column number j such that Next = i x total number of columns + j.

ACTUAL MULTIPLICATION First find i and j. in (“Next”, ?temp); out (“Next”, temp +1); convert (temp, i, j); Given i and j, a process just reads the values and outputs the result. read (“A”, i, ?row_values) read (“B”, j, ?col_values) out (“result”, i, j, Dotproduct(row, col)).

LINDA IMPLEMENTATION OF SHARED TUPLE SPACE The implementers assert that the work represented by the tuples is large enough so that there is no need for shared memory hardware. The question is how to implement out, in, and read (as well as inp and readp).

out BROADCAST IMPLEMENTATION 1 Implement out by broadcasting the argument of out to all sites. (Use a negative acknowledgement protocol for the broadcast.) To implement read, perform the read from the local memory. To implement in, perform a local read and then attempt to delete the tuple from all other sites. If several sites perform an in, only one site should succeed. One approach is to have the site originating the tuple decide which site deletes. Summary: good for reads and outs, not so good for ins.

in, read BROADCAST IMPLEMENTATION 2 Implement out by writing locally. Implement in and read by a global query. (This may have to be repeated if the data is not present.) Summary: better for out. Worse for read. Same for in.

COMMUNICATION REVIEW Basic distributed communication when no shared memory: send/receive. Location transparency: broker or name server or tuple space. Synchrony and asynchrony are both useful (e.g. real-time vs. informational sensors). Other mechanisms are possible

COMMUNICATION BY SHARED MEMORY: beyond locks Framework: Herlihy, Maurice. “Impossibility and Universality Results for Wait-Free Synchronization,” ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computed (PODC), 1988. In a system that uses mutual exclusion, it is possible that one process may stop while holding a critical resources and hang the entire system. It is of interest to find “wait-free” primitives, in which no process ever waits for another one. The primitive operations include test-and-set, fetch-and-add, and fetch-and-cons. Herlihy shows that certain operations are strictly more powerfully wait-free than others.

CAN MAKE ANYTHING WAIT-FREE (at a time price) Don’t maintain the data structure at all. Instead, just keep a history of the operations. enq(x) put enq(x) on end of history list (fetch-and-cons) end enq(x) deq put deq on end of history list (fetch-and-cons) “replay the array” and figure out what to return end deq Not extremely practical: the deq takes O(number of deq’s + number of enq’s) time. Suggestion is to have certain operations reconstruct the state in an efficient manner.

GENERAL METHOD: COMPARE-AND-SWAP Compare-and-swap takes two values: v and v’. If the register’s current value is v, it is replaced by v’, otherwise it is left unchanged. The register’s old value is returned. temp := compare-and-swap (register, 0, i) if register = 0 then register := i else register is unchanged Use this primitive to perform atomic updates to a data structure. In the following figure, what should the compare-and-swap do?

current current Original Data Structure PERSISTENT DATA STRUCTURES AND WAIT-FREEDOM One node added, one node removed. To establish change, change the current pointer. Old tree would still be available. Important point: If process doing change should abort, then no other process is affected.

Logical Level is Nice, but… • We have talked about some programming constructs one can use above a communications infrastructure. • Understanding that infrastructure will be necessary to understand performance and fault tolerance considerations. • Our discussion of that will come from Joe Conron’s lecture notes datacomessence.ppt

ORDER-PRESERVING BROADCAST PROTOCOLS ON BROADCAST NET • Proposes a virtual distributed system that implements ordered atomic broadcast and failure detection. • Shows that this makes designing the rest of system easier. • Shows that implementing these two primitives isn’t so hard. Paradigm: find an appropriate intermediate level of abstraction that can be implemented and that facilitates the higher functions. Build Facilities that use Broadcast Network. Implement Atomic Broadcast Network. Framework: Chang, Jo-Mei. “Simplifying Distributed Database Systems Design by Using a Broadcast Network,” ACM SIGMOD, June 1984.

RATIONALE • Use property of current networks, which are naturally broadcast, although not so reliable. • Common tasks of distributed systems: Send same information to many sites participating in a transaction (update all copies); reach agreement (e.g. transaction commitment).

DESCRIPTION OF ABSTRACT MACHINE Services and assurances it provides: • Atomic broadcast: failure atomicity. If a message is received by an application program at one site, it will be received at all operational sites. • System-wide clock and all messages are timestamped in sequence. This is the effective message order. Assumptions: Failures are fail-stop, not malicious. So, for example token site will not lie about messages or sequence numbers. Network failures require extra memory.

Sender Token Site Broadcast Increment counter Commit message Ack with counter CHANG SCHEME Tools: Token-passing scheme + positive acknowledgments + negative acknowledgements.

BEAUTY OF NEGATIVE ACKNOWLEDGMENT How does a site discover that it hasn’t received a message? Non-token site knows that it has missed a message if there is a gap in the counter values that it has received. In that case, it requests that information from the token site (negative ack). Overhead: one positive acknowledgment per broadcast message vs. one acknowledgment per site per message in naïve implementation.

Here is token I can take it token site token TOKEN TRANSFER Token transfer is a standard message. The target site must acknowledge. To become a token site, the target site must guarantee that it has received all messages since the last time it was a token site. Detect failure at a non-token site, when it fails to accept token responsibility.

REVISIT ASSUMPTIONS Sites do not lie about their state (i.e., no malicious sites; could use authentication). Sites tell you when they fail (e.g. through redundant circuitry) or by not responding. If there is a network partition, then no negative ack would occur, so must keep message m around until everyone has acquired the token after m was sent.

LAMPORT Times, Clocks paper • What is the proper notion of time for Distributed Systems? • Time Is a Partial Order • The Arrow Relation • Logical Clocks • Ordering All Events using a tie-breaking Clock • Achieving Mutual Exclusion Using This Clock • Correctness • Criticisms • Need for Physical Clocks • Conditions for Physical Clocks • Assumptions About Clocks and Messages • How Do We Achieve Physical Clock Goal?

Languages & Constructs for Synchronization How to model time in distributed systems ROAD MAP: TIME ACCORDING TO LAMPORT

TIME Assuming there are no failures, the most important difference between distributed systems and centralized ones is that distributed systems have no natural notion of global time. • Lamport was the first who built a theory around accepting this fact. • That theory has proven to be surprisingly useful, since the partial order that Lamport proposed is enough for many applications.

WHAT LAMPORT DOES • Paper (reference on next slide) describes a message-based criterion for obtaining a time partial order. 2. It converts this time partial order to a total order. 3. It uses the total order to solve the mutual exclusion problem. 4. It describes a stronger notion of physical time and gives an algorithm that sometimes achieves it (depending on quality of local clocks and message delivery).

NOTIONS OF TIME IN DISTRIBUTED SYSTEMS Lamport, L. “Times, Clocks, and the Ordering of Events in a Distributed System,” Communications of the ACM, vol. 21, no. 7 (July 1978). • Distributed system consists of a collection of distinct processes, which are spatially separated. (Each process has a unique identifier.) • Communicate by exchanging messages. • Messages arrive in the order they are sent. (Could be achieved by hand-shaking protocol.) • Consequence: Time is partial order in distributed systems. Some events may not be ordered.

THE ARROW (partial order) RELATION We say A happens before B or A  B, if: 1. A and B are in the same process and A happens before B in that process (Assume processes are sequential.) 2. A is the sending of a message at one process and B is the receiving of that message at another process, then A  B. 3. There is a C such that A  C and C  B. In the jargon,  is an irreflexive partial ordering.

LOGICAL CLOCKS Clocks are a way of assigning a number to an event. Each process has its own clock. For now, clocks will have nothing to do with real time, so they can be implemented by counters with no actual timing mechanism. Clock condition: For any events A and B, if A  B, then C(A) < C(B).

DISTRIBUTED COMPUTING