Computing in the RAIN: A Reliable Array of Independent Nodes

Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Introduction Presenter: Ka Hou Wong

Introduction • RAIN • Research collaboration between Caltech and Jet Propulsion Laboratory • Goal • Identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components

C0 C1 C2 C3 C4 S0 S1 C = Computer S = Switch S2 S3 C5 C6 C7 C8 C9 Hardware Platform • Heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces through a network of switches

Application MPI/PVM RAIN TCP/IP Ethernet Myrinet ATM Servernet Network Connections Software Platform • Collection of software modules that run in conjunction with operating system services and standard network protocols

Key Building Blocks For Distributed Computer Systems • Communication • Fault-tolerant communication topologies • Reliable communication protocols • Fault Management • Group membership techniques • Storage • Distributed data storage schemes based on error-control codes

Features of RAIN • Communication • Provides fault tolerance in the network via the following mechanisms • Bundled interfaces • Link monitoring • Fault-tolerant interconnect topologies

Features of RAIN (cont’d) • Group membership • Identifies healthy nodes that are participating in the cluster • Data storage • Uses redundant storage schemes over multiple disks for fault tolerance

Communication Presenter: Jahanzeb Faizan

Communication • Fault-tolerant interconnect topologies • Network interfaces

C C S S S C C S S C C How do you connect n nodes to a ring of n switches? S S S C C Fault-tolerant Interconnect Technologies • Goal • To connect computer nodes to a network of switches in order to maximize the network’s resistance to partitioning

C C S S S C C S S 1-fault-tolerant C C The network is easily partitioned with two switch failures S S S C C Naïve Approach • Connect the computer nodes to the nearest switches in a regular fashion

Diameter Construction Approach • Connect computer nodes to the switching network in the most non-local way possible • Computer nodes are connected to maximally distant switches • Nodes of degree 2 connected between switches should form a diameter

S0 S0 S7 S1 S1 S6 C3 C3 C2 C2 C4 C4 C1 C1 S6 S2 S5 S2 C5 C5 C0 C0 C6 C6 C7 S3 S5 S4 S3 S4 n = 8 n = 7 Diameter Construction Approach (cont’d) Construction (Diameters). Let ds = 4 and dc = 2. i, 0 < i < n, label all compute nodes ci and switches si. Connect switch si to s(i+1)mod n, i.e., in a ring. Connect node ci to switches si and s(i+ n/2 +1)mod n. Can tolerate 3 faults of any kind without partitioning the network

Protocol for Link Failure • Goal • Monitoring of available paths • Requirements • Correctness • Bounded Slack • Stability

Bi-directional Communication B A If one side sees timeouts… Both sides should mark the channel as being down Correctness • Must correctly reflect the true state of the channel

A B A B Link History U U U U Time D D D U U U D D D U U U U = link up D = link down D D D D Bounded Slack • Ensure that both have a maximum slack of n transactions Node A sees many more transactions than node B Nodes A and B see tightly coupled views of the channel

Stability • Each real channel event (i.e. time-out) should cause at most some bounded number state transactions at each endpoint

Consistent-History Protocol for Link Failures • Monitor available paths in the network for proper functioning • Modified Ping Protocol guarantees each side of communication channel sees the same history (bounded slack)

The Protocol • Reliable Message Passing • Implementation: • Sliding window protocol • Existing reliable communication layer not needed • Reliable messaging built on top of ping messages

Sending and receiving of Ping messages using unreliable messaging Sending and receiving of token using reliable messaging Implemented by Pings or hardware feedback Detect when link is up or down Consistent history maintained Tokens are sent on request The Protocol (cont’d) Protocol

Start T/1 Up t = 2 Down t = 2 Down t = 1 tout/1 T/1 T/0 T/1 T/0 t: token count T: token arrival event tout: time-out event Up t = 1 Down t = 0 tout/1 trigger event / token sent Demonstration

Group Membership Presenter: Jonathan Sippel

Group Membership • Provides a level of agreement between non-faulty processes in a distributed application • Tolerates permanent and transient failures in both nodes and links • Based on two mechanisms • Token Mechanism • 911 Mechanism

Token Mechanism • Nodes in the membership are ordered in a logical ring • Token passed at a regular interval from one node to the next • Token carries the authoritative knowledge of the membership • Node updates its local membership information according to the received token

D A D A B C B C Token Mechanism (cont’d) • Aggressive Failure Detection

D A D A B C B C Token Mechanism (cont’d) • Conservative Failure Detection

911 Mechanism • When is the 911 Mechanism used? • Token Regeneration - Regenerate a token that is lost if a node or a link fails • Dynamic Scalability - Add a new node to the system • What is a 911 message? • Request for the right to regenerate the lost token • Must be approved by all the live nodes in the membership

Token Regeneration • Only one node is allowed to regenerate the token • Token sequence number is used to guarantee mutual exclusivity and is incremented every time the token is passed from one node to the next • Each node makes a local copy of the token on receipt • Sequence number on the node’s local copy of the token is added to the 911 message and compared to all the sequence numbers on the local copies of the token on the other live nodes • 911 request is denied by any node with a more recent copy of the token

Dynamic Scalability • 911 message sent by a new node to join the group • Receiving node • Treats the message as a join request because the originating node is not in the membership • Updates the membership the next time it receives the token and sends it to the new node

Data Storage • The RAIN system provides a distributed storage system based on a class of erasure-correcting codes called array codes that provide a mathematical means of representing data so lost information can be recovered

Data Storage (cont’d) • Array codes • With an (n, k) erasure-correcting code, k symbols of original data are represented with n symbols of encoded data • With an m-erasure-correcting code, the original data can be recovered even if m symbols of the encoded data are lost • A code is said to be Maximum Distance Separable (MDS) if m = n – k • The only operations necessary to encode/decode an array code are simple binary XOR operations

a b c d e f A B C D D F B+D+e+f C+E+f+a D+F+a+b E+A+b+c F+B+c+d A+C+d+e Data Placement Scheme for a (6, 4) Array Code Data Storage (cont’d)

? ? c d e f ? ? C D D F ? ? D+F+a+b E+A+b+c F+B+c+d A+C+d+e Data Placement Scheme for a (6, 4) Array Code Data Storage (cont’d) A = C + d + e + (A + C + d + e) b = A + (E + A + b + c) + c + E a = b + (D + F + a + b) + D + F B = a + c + (F + B + c + d) + d

Data Storage (cont’d) • Distributed store/retrieve operations • For a store operation a block of data of size d is encoded into n symbols, each of size d/k, using an (n, k) MDS array code • For a retrieve operation, symbols are collected from any k nodes and decoded • The original data can be recovered with up to n – k node failures • The encoding scheme provides for dynamic reconfigurability and load balancing

RAIN Contributions to Distributed Computing Systems • Fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures • Fault management techniques based on group membership • Data storage schemes based on computationally efficient error-control codes

References • Vasken Bohossian, Chenggong C. Fan, Paul S. LeMahieu, Marc D. Riedel, Lihao Xu, Jehoshua Bruck, “Computing in the RAIN: A Reliable Array of Independent Nodes,” IEEE Transactions On Parallel and Distributed Systems, Vol. 12, No. 2, February 2001 • http://www.rainfinity.com/

Computing in the RAIN: A Reliable Array of Independent Nodes