1 / 37

Computing in the RAIN: A Reliable Array of Independent Nodes

Computing in the RAIN: A Reliable Array of Independent Nodes. Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel. Introduction. Presenter: Ka Hou Wong. Introduction. RAIN Research collaboration between Caltech and Jet Propulsion Laboratory Goal

andres
Download Presentation

Computing in the RAIN: A Reliable Array of Independent Nodes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

  2. Introduction Presenter: Ka Hou Wong

  3. Introduction • RAIN • Research collaboration between Caltech and Jet Propulsion Laboratory • Goal • Identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components

  4. C0 C1 C2 C3 C4 S0 S1 C = Computer S = Switch S2 S3 C5 C6 C7 C8 C9 Hardware Platform • Heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces through a network of switches

  5. Application MPI/PVM RAIN TCP/IP Ethernet Myrinet ATM Servernet Network Connections Software Platform • Collection of software modules that run in conjunction with operating system services and standard network protocols

  6. Key Building Blocks For Distributed Computer Systems • Communication • Fault-tolerant communication topologies • Reliable communication protocols • Fault Management • Group membership techniques • Storage • Distributed data storage schemes based on error-control codes

  7. Features of RAIN • Communication • Provides fault tolerance in the network via the following mechanisms • Bundled interfaces • Link monitoring • Fault-tolerant interconnect topologies

  8. Features of RAIN (cont’d) • Group membership • Identifies healthy nodes that are participating in the cluster • Data storage • Uses redundant storage schemes over multiple disks for fault tolerance

  9. Communication Presenter: Jahanzeb Faizan

  10. Communication • Fault-tolerant interconnect topologies • Network interfaces

  11. C C S S S C C S S C C How do you connect n nodes to a ring of n switches? S S S C C Fault-tolerant Interconnect Technologies • Goal • To connect computer nodes to a network of switches in order to maximize the network’s resistance to partitioning

  12. C C S S S C C S S 1-fault-tolerant C C The network is easily partitioned with two switch failures S S S C C Naïve Approach • Connect the computer nodes to the nearest switches in a regular fashion

  13. Diameter Construction Approach • Connect computer nodes to the switching network in the most non-local way possible • Computer nodes are connected to maximally distant switches • Nodes of degree 2 connected between switches should form a diameter

  14. S0 S0 S7 S1 S1 S6 C3 C3 C2 C2 C4 C4 C1 C1 S6 S2 S5 S2 C5 C5 C0 C0 C6 C6 C7 S3 S5 S4 S3 S4 n = 8 n = 7 Diameter Construction Approach (cont’d) Construction (Diameters). Let ds = 4 and dc = 2. i, 0 < i < n, label all compute nodes ci and switches si. Connect switch si to s(i+1)mod n, i.e., in a ring. Connect node ci to switches si and s(i+ n/2 +1)mod n. Can tolerate 3 faults of any kind without partitioning the network

  15. Protocol for Link Failure • Goal • Monitoring of available paths • Requirements • Correctness • Bounded Slack • Stability

  16. Bi-directional Communication B A If one side sees timeouts… Both sides should mark the channel as being down Correctness • Must correctly reflect the true state of the channel

  17. A B A B Link History U U U U Time D D D U U U D D D U U U U = link up D = link down D D D D Bounded Slack • Ensure that both have a maximum slack of n transactions Node A sees many more transactions than node B Nodes A and B see tightly coupled views of the channel

  18. Stability • Each real channel event (i.e. time-out) should cause at most some bounded number state transactions at each endpoint

  19. Consistent-History Protocol for Link Failures • Monitor available paths in the network for proper functioning • Modified Ping Protocol guarantees each side of communication channel sees the same history (bounded slack)

  20. The Protocol • Reliable Message Passing • Implementation: • Sliding window protocol • Existing reliable communication layer not needed • Reliable messaging built on top of ping messages

  21. Sending and receiving of Ping messages using unreliable messaging Sending and receiving of token using reliable messaging Implemented by Pings or hardware feedback Detect when link is up or down Consistent history maintained Tokens are sent on request The Protocol (cont’d) Protocol

  22. Start T/1 Up t = 2 Down t = 2 Down t = 1 tout/1 T/1 T/0 T/1 T/0 t: token count T: token arrival event tout: time-out event Up t = 1 Down t = 0 tout/1 trigger event / token sent Demonstration

  23. Group Membership Presenter: Jonathan Sippel

  24. Group Membership • Provides a level of agreement between non-faulty processes in a distributed application • Tolerates permanent and transient failures in both nodes and links • Based on two mechanisms • Token Mechanism • 911 Mechanism

  25. Token Mechanism • Nodes in the membership are ordered in a logical ring • Token passed at a regular interval from one node to the next • Token carries the authoritative knowledge of the membership • Node updates its local membership information according to the received token

  26. D A D A B C B C Token Mechanism (cont’d) • Aggressive Failure Detection

  27. D A D A B C B C Token Mechanism (cont’d) • Conservative Failure Detection

  28. 911 Mechanism • When is the 911 Mechanism used? • Token Regeneration - Regenerate a token that is lost if a node or a link fails • Dynamic Scalability - Add a new node to the system • What is a 911 message? • Request for the right to regenerate the lost token • Must be approved by all the live nodes in the membership

  29. Token Regeneration • Only one node is allowed to regenerate the token • Token sequence number is used to guarantee mutual exclusivity and is incremented every time the token is passed from one node to the next • Each node makes a local copy of the token on receipt • Sequence number on the node’s local copy of the token is added to the 911 message and compared to all the sequence numbers on the local copies of the token on the other live nodes • 911 request is denied by any node with a more recent copy of the token

  30. Dynamic Scalability • 911 message sent by a new node to join the group • Receiving node • Treats the message as a join request because the originating node is not in the membership • Updates the membership the next time it receives the token and sends it to the new node

  31. Data Storage • The RAIN system provides a distributed storage system based on a class of erasure-correcting codes called array codes that provide a mathematical means of representing data so lost information can be recovered

  32. Data Storage (cont’d) • Array codes • With an (n, k) erasure-correcting code, k symbols of original data are represented with n symbols of encoded data • With an m-erasure-correcting code, the original data can be recovered even if m symbols of the encoded data are lost • A code is said to be Maximum Distance Separable (MDS) if m = n – k • The only operations necessary to encode/decode an array code are simple binary XOR operations

  33. a b c d e f A B C D D F B+D+e+f C+E+f+a D+F+a+b E+A+b+c F+B+c+d A+C+d+e Data Placement Scheme for a (6, 4) Array Code Data Storage (cont’d)

  34. ? ? c d e f ? ? C D D F ? ? D+F+a+b E+A+b+c F+B+c+d A+C+d+e Data Placement Scheme for a (6, 4) Array Code Data Storage (cont’d) A = C + d + e + (A + C + d + e) b = A + (E + A + b + c) + c + E a = b + (D + F + a + b) + D + F B = a + c + (F + B + c + d) + d

  35. Data Storage (cont’d) • Distributed store/retrieve operations • For a store operation a block of data of size d is encoded into n symbols, each of size d/k, using an (n, k) MDS array code • For a retrieve operation, symbols are collected from any k nodes and decoded • The original data can be recovered with up to n – k node failures • The encoding scheme provides for dynamic reconfigurability and load balancing

  36. RAIN Contributions to Distributed Computing Systems • Fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures • Fault management techniques based on group membership • Data storage schemes based on computationally efficient error-control codes

  37. References • Vasken Bohossian, Chenggong C. Fan, Paul S. LeMahieu, Marc D. Riedel, Lihao Xu, Jehoshua Bruck, “Computing in the RAIN: A Reliable Array of Independent Nodes,” IEEE Transactions On Parallel and Distributed Systems, Vol. 12, No. 2, February 2001 • http://www.rainfinity.com/

More Related