ece 753 fault tolerant computing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
ECE 753: FAULT-TOLERANT COMPUTING PowerPoint Presentation
Download Presentation
ECE 753: FAULT-TOLERANT COMPUTING

Loading in 2 Seconds...

play fullscreen
1 / 40

ECE 753: FAULT-TOLERANT COMPUTING - PowerPoint PPT Presentation


  • 221 Views
  • Uploaded on

ECE 753: FAULT-TOLERANT COMPUTING. Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolereance: ECC. Overview. Introduction Motivation and Background Hamming Codes – by example SEC-DED Codes – Algebraic method SEC-DED Codes – Hardware SEC-DED-SBD Codes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ECE 753: FAULT-TOLERANT COMPUTING' - Mia_John


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ece 753 fault tolerant computing

ECE 753: FAULT-TOLERANT COMPUTING

Kewal K.Saluja

Department of Electrical and Computer Engineering

Low Level Fault-Tolereance: ECC

overview
Overview
  • Introduction
  • Motivation and Background
  • Hamming Codes – by example
  • SEC-DED Codes – Algebraic method
  • SEC-DED Codes – Hardware
  • SEC-DED-SBD Codes
  • Cyclic Codes – (time permitting)
  • Summary

ECE 753 Fault Tolerant Computing

introduction
Introduction
  • References
    • Chapter 3 of Koren and Krishna
    • Appendix A of the book [siew:92] – also included in the set of reading material
    • Following references
      • Reddy – “A class of linear codes …” IEEETC, May 1978
      • Any book on coding theory

ECE 753 Fault Tolerant Computing

motivation and background
Motivation and Background
  • Memories are integral part of digital systems (computers)
  • Majority of chip and/or board area is taken by memories
  • Hence – reliability improvement methods must pay attention to memories (RAMs, ROMs, etc.)

ECE 753 Fault Tolerant Computing

motivation and background contd
Motivation and Background (contd.)
  • Types of faults prevalent in memories
  • During manufacturing
    • Stuck-at
    • Timing faults
    • Coupling and pattern sensitive faults
  • During operation
    • Cell failures due to life, stress – same as stuck-at
    • Alpha particle hits – cell content change
      • Sensitive to system location. Higher hits at altitudes and in flight
    • Need non-testing based solutions
    • Random failures – bit/nibble/byte/card failures

ECE 753 Fault Tolerant Computing

motivation and background contd1
Motivation and Background (contd.)
  • Theoretical Foundation
    • Linear and modern algebra
      • Concept of groups, fields, and vector spaces
      • We will focus on binary codes but will have to include polynomial algebra
  • Theory – Informal definitions and results
    • Vector: A collection of bits represented as a string
    • Information bits - collection of k-bits
    • Code word: encoded information bit string
      • k information bits encoded to n bits. Encoded information word is a code word.
    • Check bits: r (= n-k) extra bits used to encode information bits

ECE 753 Fault Tolerant Computing

motivation and background contd2
Motivation and Background (contd.)
  • Theory – Informal definitions and results
    • Hamming weight of a vector v: Number of 1’s in v
    • Hamming distance (HD) between a pair of vectors v1 and v2: number of places two vectors differ from each other.

HD(v1, v2) = HW(v1v2)

    • Code: Collection of code words.
    • Block code: each code word contains same number of bits.
    • Minimum Hamming distance of a code: Minimum of all HDs between all pairs of code words in a code.

ECE 753 Fault Tolerant Computing

motivation and background contd3
Motivation and Background (contd.)

Theory – Informal definitions and results (contd.)

    • Error detection: Erroneous word (a code word with one or more bit errors) is not a code word
  • Basic results 1: A code is capable of t error detection if and only if min HD of the code is at least t+1.
    • Proof: use sphere packing argument to show this.
  • Example: Use of parity –we know that we can detect single error.

What is the minimum HD for such a code?

Prove that the min HD is 2 using the argument that no two binary strings with even (odd) Hamming weight can have a HD of 1.

ECE 753 Fault Tolerant Computing

motivation and background contd4
Motivation and Background (contd.)

Theory – Informal definitions and results (contd.)

  • Basic results 2: A code is capable of correcting t errors if and only if min HD of the code is at least 2t+1.
    • Proof: use sphere packing argument as before.
  • Combine the two results: A code is a capable of correcting t errors and detecting d errors (d  t) if and only if min HD of the code is at least t+d+1.

ECE 753 Fault Tolerant Computing

hamming codes by example
Hamming Codes – by example
  • A linear block code
  • Consider a (7,4) Hamming code
  • Let i1 i2 i3 i4 be information symbols
  • Let p1p2 p4 be check symbols
  • The parity equations:

p1 = i1 i2 i4

p2 = i1 i3 i4

p4 = i2 i3 i4

ECE 753 Fault Tolerant Computing

hamming codes by example contd
Hamming Codes – by example (contd.)
  • Can write the equations as follows (easy to remember)

p1 p2 i1 p4 i2 i3 i4

1 0 1 0 1 0 1

0 1 1 0 0 1 1

0 0 0 1 1 1 1

1 2 3 4 5 6 7

This encodes a 4-bit information word into a 7-bit codeword

ECE 753 Fault Tolerant Computing

hamming codes by example contd1
Hamming Codes – by example (contd.)
  • Properties of the code
    • If there is no error, all parity equations will be satisfied
    • Denote the outcomes of these equation checks as c1, c2, c4
    • If there is exactly one error, then c1, c2, c4 point to the error
    • The vector c1, c2, c4 is called syndrome
    • The above (7,4) Hamming code is SEC code

ECE 753 Fault Tolerant Computing

hamming codes by example contd2
Hamming Codes – by example (contd.)
  • The above method of construction can be generalized to construct an (n,k) Hamming code
  • Simple bound

k = number of information bits

r = number of check bits

n = k + r = total number of bits

n + 1 = number of single or fewer errors

Each error (including no error) must have a distinct syndrome

With r check bits max possible syndrome = 2r

Hence: 2r n + 1

ECE 753 Fault Tolerant Computing

hamming codes by example contd3
Hamming Codes – by example (contd.)

Simple bound

When: 2r= n + 1 the corresponding Hamming code is a perfect code

  • Perfect Hamming codes can be constructed as follows:

p1 p2 i1 p4 i2 i3 i4 p8 i5 . . . . . .

20 21 3 22 5 6 7 23 9 . . . . . .

Parity equations can be written as before from the above matrix representation

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method
SEC-DED Codes – Algebraic method
  • Definitions
    • (G, *) – An abelian (commutative) Group
      • There is a 0 in G (identity)
      • For every a in G a-1 is also in G (inverses)
      • For all a and b in a*b = b*a is also in G (closed)
    • Examples
      • G = (0, 1); * =  (Exclusive-OR)
      • (Z3, +3) is a commutative group

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd
SEC-DED Codes – Algebraic method (contd.)
  • Definitions (contd.)
    • (F, +, .) – A Field if
      • (F, +) is an abelian group with identity of 0
      • (F - 0, .) is an abelian group
    • Examples
      • (F, , .) is a Field
      • F = (0, 1);  = Exclusive-OR; . = AND
      • The above Field is called GF(2)

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd1
SEC-DED Codes – Algebraic method (contd.)
  • Definitions (contd.)
    • Vector space over a field F
      • (V, +) is an abelian group
      • v in V and c in F  cv is V
      • c(u + v) = cu + cv
      • (c+d)v = cv + dv
      • C(dv) = (cd)v
    • S  V is a subspace if S is a vector space
    • A linear combination of vectors is a vector
      • u = c1v1 + c2v2 + c3v3 + … + cnvn

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd2
SEC-DED Codes – Algebraic method (contd.)
  • Some results and more definitions
    • Over GF(2) a collection of all n-bit vectors forms a vector space
    • Let v1, v2, … , vk be n-bit vectors each. Then all 2k linear combinations of these k vectors form a subspace
    • A set of k vectors v1, v2, … , vk is linearly independent if for not all ci = 0, i = 1, …, k

c1v1 + c2v2 + c3v3 + … + ckvk  0

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd3
SEC-DED Codes – Algebraic method (contd.)
  • Some results and more definitions (contd.)
    • Largest number of linearly independent vectors in a vector space is the dimension of the space.
      • Dimension of the space containing all n-bit vectors is n
      • Dimension of the space containing all 2k linear combinations of k vectors was no more than k.
    • A binary (n,k) linear block code is a k-dimensional subspace of an n-dimensional vector space

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd4
SEC-DED Codes – Algebraic method (contd.)
  • A binary (n,k) linear block code can be described by a collection of k carefully chosen vectors. Each code word is a linear combination of these k-vectors, thus forming a k-dimensional subspace.
  • These k-vectors can be written as a kn matrix G, called Generator matrix. A code word for a k-bit information word, say vector a, is obtained by aG
  • Example: For the (7,4) Hamming code described earlier

p1 p2 i1 p4 i2 i3 i4

1 1 1 0 0 0 0

1 0 0 1 1 0 0 = G

0 1 0 1 0 1 0

1 1 0 1 0 0 1

Note: a code word is a linear combination of rows of G

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd5
SEC-DED Codes – Algebraic method (contd.)
  • Two vectors v1 and v2 are orthogonal if v1 . v2 = 0
  • The G matrix can also be represented by an rn matrix H in which each n vector of H is orthogonal to every vector of G.
  • Hence GHT = 0
  • dim G + dim H = n
  • Example: For the (7,4) Hamming code described earlier the H matrix is:

p1 p2 i1 p4 i2 i3 i4

1 0 1 0 1 0 1

0 1 1 0 0 1 1 = H

0 0 0 1 1 1 1

  • Check that GHT = 0

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd6
SEC-DED Codes – Algebraic method (contd.)
  • There are two ways to encode data words
    • Use G (generator) matrix
    • Use H (parity check) matrix
  • We will use H – being of lower dimensionality
  • Consider the following representation of H

H = [ Pr| Ir ], where Pr is rk matrix and Ir is rr matrix

  • Consider a code word (a1, a2, … , ak, p1, p2 … pr)
  • We can wirite parity check equations from the above H, i.e. from HaT
  • Example: For the (7,4) Hamming code we can write H matrix as:

a1 a2 a3 a4 p1 p2 p4

1 0 1 1 1 0 0

1 1 0 1 0 1 0 = H

0 1 1 1 0 0 1

  • Can obtain previous parity equations from this H in a simple manner

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd7
SEC-DED Codes – Algebraic method (contd.)
  • Note the H is specified such that all information bits stay intact & together and check bits stay together and depend only on information bits
  • A code specified by an H of the above type is called a systematic code
    • Data bits and check bits stay separate from each other
    • It is easy to extract data bits from a code word
  • Statement: rearrangement of columns of H does not change the code. All it does is that it changes the position of the check bits and information bits
  • Question: when can we write an arbitrary H in systematic form?

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd8
SEC-DED Codes – Algebraic method (contd.)
  • Theorem: H, an rn matrix and rank(H) = r (rank r means H contains r linearly independent columns), then H can be transformed to a systematic form
    • Row operation on H means linear combination of parity check equations. Thus solution of equations does not change
    • First rearrange columns of H such that last r columns are linearly independant
    • Next find a matrix M such that M performs row operations on H such that M when multiplies the last r columns, it gives an unity rr matrix. Thus M in fact is the inverse of the matrix that consists of the last r columns of H
    • Now the the matrix MH will be in systematic form
  • Example in class

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd9
SEC-DED Codes – Algebraic method (contd.)
  • Definition: Syndrome S of an n-bit x word is

S = HxT Note – S is an r-bit vector

  • Note also in the above equation xT provides a linear combination of columns of H
  • Example consider a (6,3) systematic H and consider a 6-bit vector x
  • Theorem: for an (n,k) linear block code represented by H the syndrome of every code word is 0
    • Proof is more or less based on the way we have defined a block code and H matrix
  • Definition: Error word, E, is a vector that represents where a codeword is erroneous
  • Example in class to define all these terms

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd10
SEC-DED Codes – Algebraic method (contd.)
  • Theorem: let C be a code word and E be an error word, i.e. C’ = C + E is the erroneous word (code word with error in it). Let S’ be the syndrome of the word C’ then

S’ = HET

  • Theorem: A linear block code represented by H is SEC if and only if the columns of H are distinct and non zero
  • Theorem: A linear block code represented by H is SEC-DED if:
    • All columns of H are distinct and non zero
    • Sum of any two columns of H is non zero and is not equal to a third column of H

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd11
SEC-DED Codes – Algebraic method (contd.)
  • Consider an H matrix in which each column has odd number of 1’s code generated by such an H matrix is called odd weight column code
  • Example: consider r = 4. Let us consider an H, a 48 matrix:

1 0 0 0 0 1 1 1

0 1 0 0 1 0 1 1 = H

0 0 1 0 1 1 0 1

0 0 0 1 1 1 1 0

wt = 1 columns wt = 3 columns

This is a (8,4) SEC-DED code

  • Theorem: Odd weight column code is a SEC-DED code
  • Theorem: Hamming code with overall parity is a SEC-DED code

ECE 753 Fault Tolerant Computing

sec ded codes algebraic method contd12
SEC-DED Codes – Algebraic method (contd.)
  • Shortened codes
    • Some times we are interested in code that do not exactly satisfy the bound derived for perfect Hamming codes. For example consider the case when k=8. Clearly we will need r=5. But we do not want to have a (15,11). What we want a (12,8) code. Following result comes handy to design such codes and still have error correction capability
    • Result: Deleting columns of H does not alter the error correction capability of the corresponding code
      • Proof: the conditions stated in the theorem (for example columns remaining odd weight columns, or no two columns being identical) do not change by deleting columns of H.
  • What columns to delete? See next hardware issue.

ECE 753 Fault Tolerant Computing

sec ded codes hardware

K inf

bits

XOR Tree

K inf

bits

R check

bits

SEC-DED Codes –Hardware
  • Encoding hardware

ECE 753 Fault Tolerant Computing

sec ded codes hardware contd
SEC-DED Codes –Hardware (contd.)
  • Decoding hardware – Algorithm
    • Compute syndrome S
    • If S = 0 then no error
    • If S  0 { decode S
        • If S is in range (decoded S  n) then correct sth bit
        • Else there is an uncorrectable error

}

  • Note: it is easy to determine if S is 0
  • Decoding S is also straight forward
  • Correction implies a bit flip (EOR operation)

ECE 753 Fault Tolerant Computing

sec ded codes hardware contd1

r

k

EOR tree

Syndrome

or

and

decoder

. . .

n

nor

Error corrector

n EORs

Corrected word

SEC-DED Codes –Hardware (contd.)
  • Decoding hardware – Implementation

ECE 753 Fault Tolerant Computing

sec ded codes hardware contd2
SEC-DED Codes –Hardware (contd.)
  • Hardware simplification
    • Reduce number of EORs
      • Have as few 1s in the matrix as possible
    • Reduce delay – depth of EOR tree
      • Have as few 1s in each row of H as possible

ECE 753 Fault Tolerant Computing

sec ded sbd codes
SEC-DED-SBD Codes
  • Motivation
    • Many memories are organizes as byte oriented
    • Failures manifest themselves as follows
      • Random failure – bit error
      • Chip failure – byte error
    • Objective is to detect such byte errors while detect and correct random errors. Hence the error model
      • Single random error
      • Multiple errors limited within a byte

ECE 753 Fault Tolerant Computing

sec ded sbd codes contd
SEC-DED-SBD Codes (contd.)
  • Theorem (Reddy): Let E1 and E2 be two sets of error patterns and E1E2 = . A linear block described by H can correct all errors in E1 and detect all errors in E2 if and only if
    • For e in E1E2 HeT  0
    • For ei, ej in E1 HeiT  HejT and
    • For an ei in E2 there is no ej in E1 such that HeiT = HejT

ECE 753 Fault Tolerant Computing

sec ded sbd codes contd1
SEC-DED-SBD Codes (contd.)
  • To demonstrate the use of the theorem, let us look at an example H matrix and its capabilities for a small byte (nibble) size
  • b = number of bits in each memory card
  • n = total number of bits in a code word
  • r = number of check bits
  • n = b(2r-b+1 –1)
  • For b = 4 and r = 5 we have n = 12. Thus we will construct a (12,7) code which will be able to correct any single error and detect errors confined to 4-bit nibbles

ECE 753 Fault Tolerant Computing

sec ded sbd codes contd2
SEC-DED-SBD Codes (contd.)
  • Many parts of the code are shown as blocks in the following figure

Correction

part

Detect mult

Errors in byte

ECE 753 Fault Tolerant Computing

sec ded sbd codes contd3
SEC-DED-SBD Codes (contd.)
  • Now let us look at the complete matrix

0 0 0 0 1 1 1 1 1 1 1 1

1 1 1 1 0 0 0 0 1 1 1 1

1 0 0 0 1 0 0 0 1 0 0 0

0 1 0 1 0 1 0 1 0 1 0 1

0 0 1 1 0 0 1 1 0 0 1 1

ECE 753 Fault Tolerant Computing

sec ded sbd codes contd4
SEC-DED-SBD Codes (contd.)
  • The capability can be proven as follows
  • E1 single error, E2 errors limited to 4-bit nibbles
  • All columns are non-zero and any combinations of columns within 4-bit nibble are also non-zero
  • All columns are distinct – providing single error correction capability
  • The last 3 rows provide guarantee that no combination of errors limited to a nibble will have a syndrome identical to single error syndrome

ECE 753 Fault Tolerant Computing

sec ded sbd codes contd5
SEC-DED-SBD Codes (contd.)
  • Two comments
    • The code can be converted to a systematic code
    • Distance of the code can be increased by 1 to make it a DED code
    • This code can also be shortened

ECE 753 Fault Tolerant Computing

summary
Summary
  • Why ECC in Fault tolerance
  • Hamming code – by example
  • Algebra and Algebraic coding
    • Codes
    • Hardware
  • SEC-SBD code

ECE 753 Fault Tolerant Computing