140 likes | 158 Views
ECE 753: FAULT-TOLERANT COMPUTING. Kewal K.Saluja Department of Electrical and Computer Engineering Reconfiguration. Overview. Introduction and basic concept Fault model and fault coverage Two example architectures n-cubes de Bruijn networks Summary. Introduction and basic concept.
E N D
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Reconfiguration
Overview • Introduction and basic concept • Fault model and fault coverage • Two example architectures • n-cubes • de Bruijn networks • Summary ECE 753 Fault Tolerant Computing
Introduction and basic concept • References • Text and some other material • Basic concept • Must avoid using the faulty unit(s) – whether it be a process, processor, program, data, link between a pair of units, etc. • Two types of re-configurations • Fault tolerance via degraded performance • fault tolerance provided by sufficient redundancy at design stage ECE 753 Fault Tolerant Computing
Fault model and fault coverage • Candidate architectures • Bus bases systems • Crossbar based systems • Mash connected systems • Hypercube networks • de Bruijn Networks • Tree networks • Hexagonal networks • Other regular architectures ECE 753 Fault Tolerant Computing
Fault model and Fault coverage (contd.) • System Model • Units are represented as nodes • Interconnects are represented as links between nodes • Failure models • Nodes may fail or go down – the corresponding unit unable to interact with other units • Interconnect may fail or go down – no units can communicate using the failed or down link ECE 753 Fault Tolerant Computing
Fault model and Fault coverage (contd.) • Objective of fault tolerance • Any pair of units must be able to interact in the presence of • Node failures • Link failures • Performance metrics • How many faults (node or link failures) can be tolerated (fault coverage) • Impact on the route length – number of hops between pairs of nodes (same as the length of the shortened path between a pair of nodes) • Can pay attention to the worst case scenario or impact on the average length of the paths ECE 753 Fault Tolerant Computing
10 11 3-cube 00 01 2-cube Two example architectures • Hypercube architecture • A n-cube • Contains 2n nodes • Encode the 2n nodes as n-tuples • Two nodes are connected using a bi-directional link if and only if the Hamming distance between them is exactly 1 ECE 753 Fault Tolerant Computing
Two example architectures (contd.) • Hypercube architecture (contd.) • A method of sending message between a pair of nodes • Find a route between two nodes • An algorithm for finding a route between nodes n1 and n2 • Use binary encoding of n1 and n2 Let it be a1 a2 … ak and b1 b2 … bk • Determine the locations these two string differ and complement one bit at a time to find a route between the two nodes • Length of such a path can be no larger than k ECE 753 Fault Tolerant Computing
Two example architectures (contd.) • Hypercube architecture (contd.) • Finding a route in the presence of a faulty node • Consider an example – find path between nodes 0011 and 0101 in the presence of 0111 being faulty • A possible path is 0011 0001 0101 • Result: between every pair of nodes there are k node disjoint paths • The paths are • Complement one bit at a time starring from the left most bit and keeping it that way. Thus we will have n starts and these will lead to n disjoint paths with some careful construction of paths ECE 753 Fault Tolerant Computing
Two example architectures (contd.) • Hypercube architecture (contd.) • In a hypercube of dimension k, upto k-1 node faults can be tolerated • Some faults cause a degradation as the path length starts to increase after certain faults • Number of link faults that can be tolerated is at least the number of tolerable node faults • Problems that have been addresses in literature • Centralized observer (as discussed above) • Distributed algorithm in which every node knows the location of the faulty node • Distributed algorithms in which only the neighbors of faulty node know its status ECE 753 Fault Tolerant Computing
Two example architectures (contd.) • de Bruijn networks • Contains 2n nodes • Encode the 2n nodes as n-tuples • Two nodes are connected using a bi-directional link if and only if the second node can be derived by logical left or right shift of the first node • An example de Bruijn network for k-3 is given next ECE 753 Fault Tolerant Computing
000 001 100 110 011 111 Two example architectures (contd.) • de Bruijn networks (contd.) 010 101 ECE 753 Fault Tolerant Computing
Two example architectures (contd.) • de Bruijn networks (contd.) • There are at least two node disjoint paths between any pair of node • Hence, in the presence of a single node failure nodes can continue to interact • Many such results are known for de Bruijn networks ECE 753 Fault Tolerant Computing
Summary • Described two network architectures in which messages can be re-configured to maintain the network connectivity in the presence of faulty nodes and/or links ECE 753 Fault Tolerant Computing