- 221 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'ECE 753: FAULT-TOLERANT COMPUTING' - Mia_John

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### ECE 753: FAULT-TOLERANT COMPUTING

Kewal K.Saluja

Department of Electrical and Computer Engineering

Low Level Fault-Tolereance: ECC

Overview

- Introduction
- Motivation and Background
- Hamming Codes – by example
- SEC-DED Codes – Algebraic method
- SEC-DED Codes – Hardware
- SEC-DED-SBD Codes
- Cyclic Codes – (time permitting)
- Summary

ECE 753 Fault Tolerant Computing

Introduction

- References
- Chapter 3 of Koren and Krishna
- Appendix A of the book [siew:92] – also included in the set of reading material
- Following references
- Reddy – “A class of linear codes …” IEEETC, May 1978
- Any book on coding theory

ECE 753 Fault Tolerant Computing

Motivation and Background

- Memories are integral part of digital systems (computers)
- Majority of chip and/or board area is taken by memories
- Hence – reliability improvement methods must pay attention to memories (RAMs, ROMs, etc.)

ECE 753 Fault Tolerant Computing

Motivation and Background (contd.)

- Types of faults prevalent in memories
- During manufacturing
- Stuck-at
- Timing faults
- Coupling and pattern sensitive faults
- During operation
- Cell failures due to life, stress – same as stuck-at
- Alpha particle hits – cell content change
- Sensitive to system location. Higher hits at altitudes and in flight
- Need non-testing based solutions
- Random failures – bit/nibble/byte/card failures

ECE 753 Fault Tolerant Computing

Motivation and Background (contd.)

- Theoretical Foundation
- Linear and modern algebra
- Concept of groups, fields, and vector spaces
- We will focus on binary codes but will have to include polynomial algebra
- Theory – Informal definitions and results
- Vector: A collection of bits represented as a string
- Information bits - collection of k-bits
- Code word: encoded information bit string
- k information bits encoded to n bits. Encoded information word is a code word.
- Check bits: r (= n-k) extra bits used to encode information bits

ECE 753 Fault Tolerant Computing

Motivation and Background (contd.)

- Theory – Informal definitions and results
- Hamming weight of a vector v: Number of 1’s in v
- Hamming distance (HD) between a pair of vectors v1 and v2: number of places two vectors differ from each other.

HD(v1, v2) = HW(v1v2)

- Code: Collection of code words.
- Block code: each code word contains same number of bits.
- Minimum Hamming distance of a code: Minimum of all HDs between all pairs of code words in a code.

ECE 753 Fault Tolerant Computing

Motivation and Background (contd.)

Theory – Informal definitions and results (contd.)

- Error detection: Erroneous word (a code word with one or more bit errors) is not a code word
- Basic results 1: A code is capable of t error detection if and only if min HD of the code is at least t+1.
- Proof: use sphere packing argument to show this.
- Example: Use of parity –we know that we can detect single error.

What is the minimum HD for such a code?

Prove that the min HD is 2 using the argument that no two binary strings with even (odd) Hamming weight can have a HD of 1.

ECE 753 Fault Tolerant Computing

Motivation and Background (contd.)

Theory – Informal definitions and results (contd.)

- Basic results 2: A code is capable of correcting t errors if and only if min HD of the code is at least 2t+1.
- Proof: use sphere packing argument as before.
- Combine the two results: A code is a capable of correcting t errors and detecting d errors (d t) if and only if min HD of the code is at least t+d+1.

ECE 753 Fault Tolerant Computing

Hamming Codes – by example

- A linear block code
- Consider a (7,4) Hamming code
- Let i1 i2 i3 i4 be information symbols
- Let p1p2 p4 be check symbols
- The parity equations:

p1 = i1 i2 i4

p2 = i1 i3 i4

p4 = i2 i3 i4

ECE 753 Fault Tolerant Computing

Hamming Codes – by example (contd.)

- Can write the equations as follows (easy to remember)

p1 p2 i1 p4 i2 i3 i4

1 0 1 0 1 0 1

0 1 1 0 0 1 1

0 0 0 1 1 1 1

1 2 3 4 5 6 7

This encodes a 4-bit information word into a 7-bit codeword

ECE 753 Fault Tolerant Computing

Hamming Codes – by example (contd.)

- Properties of the code
- If there is no error, all parity equations will be satisfied
- Denote the outcomes of these equation checks as c1, c2, c4
- If there is exactly one error, then c1, c2, c4 point to the error
- The vector c1, c2, c4 is called syndrome
- The above (7,4) Hamming code is SEC code

ECE 753 Fault Tolerant Computing

Hamming Codes – by example (contd.)

- The above method of construction can be generalized to construct an (n,k) Hamming code
- Simple bound

k = number of information bits

r = number of check bits

n = k + r = total number of bits

n + 1 = number of single or fewer errors

Each error (including no error) must have a distinct syndrome

With r check bits max possible syndrome = 2r

Hence: 2r n + 1

ECE 753 Fault Tolerant Computing

Hamming Codes – by example (contd.)

Simple bound

When: 2r= n + 1 the corresponding Hamming code is a perfect code

- Perfect Hamming codes can be constructed as follows:

p1 p2 i1 p4 i2 i3 i4 p8 i5 . . . . . .

20 21 3 22 5 6 7 23 9 . . . . . .

Parity equations can be written as before from the above matrix representation

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method

- Definitions
- (G, *) – An abelian (commutative) Group
- There is a 0 in G (identity)
- For every a in G a-1 is also in G (inverses)
- For all a and b in a*b = b*a is also in G (closed)
- Examples
- G = (0, 1); * = (Exclusive-OR)
- (Z3, +3) is a commutative group

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Definitions (contd.)
- (F, +, .) – A Field if
- (F, +) is an abelian group with identity of 0
- (F - 0, .) is an abelian group
- Examples
- (F, , .) is a Field
- F = (0, 1); = Exclusive-OR; . = AND
- The above Field is called GF(2)

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Definitions (contd.)
- Vector space over a field F
- (V, +) is an abelian group
- v in V and c in F cv is V
- c(u + v) = cu + cv
- (c+d)v = cv + dv
- C(dv) = (cd)v
- S V is a subspace if S is a vector space
- A linear combination of vectors is a vector
- u = c1v1 + c2v2 + c3v3 + … + cnvn

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Some results and more definitions
- Over GF(2) a collection of all n-bit vectors forms a vector space
- Let v1, v2, … , vk be n-bit vectors each. Then all 2k linear combinations of these k vectors form a subspace
- A set of k vectors v1, v2, … , vk is linearly independent if for not all ci = 0, i = 1, …, k

c1v1 + c2v2 + c3v3 + … + ckvk 0

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Some results and more definitions (contd.)
- Largest number of linearly independent vectors in a vector space is the dimension of the space.
- Dimension of the space containing all n-bit vectors is n
- Dimension of the space containing all 2k linear combinations of k vectors was no more than k.
- A binary (n,k) linear block code is a k-dimensional subspace of an n-dimensional vector space

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- A binary (n,k) linear block code can be described by a collection of k carefully chosen vectors. Each code word is a linear combination of these k-vectors, thus forming a k-dimensional subspace.
- These k-vectors can be written as a kn matrix G, called Generator matrix. A code word for a k-bit information word, say vector a, is obtained by aG
- Example: For the (7,4) Hamming code described earlier

p1 p2 i1 p4 i2 i3 i4

1 1 1 0 0 0 0

1 0 0 1 1 0 0 = G

0 1 0 1 0 1 0

1 1 0 1 0 0 1

Note: a code word is a linear combination of rows of G

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Two vectors v1 and v2 are orthogonal if v1 . v2 = 0
- The G matrix can also be represented by an rn matrix H in which each n vector of H is orthogonal to every vector of G.
- Hence GHT = 0
- dim G + dim H = n
- Example: For the (7,4) Hamming code described earlier the H matrix is:

p1 p2 i1 p4 i2 i3 i4

1 0 1 0 1 0 1

0 1 1 0 0 1 1 = H

0 0 0 1 1 1 1

- Check that GHT = 0

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- There are two ways to encode data words
- Use G (generator) matrix
- Use H (parity check) matrix
- We will use H – being of lower dimensionality
- Consider the following representation of H

H = [ Pr| Ir ], where Pr is rk matrix and Ir is rr matrix

- Consider a code word (a1, a2, … , ak, p1, p2 … pr)
- We can wirite parity check equations from the above H, i.e. from HaT
- Example: For the (7,4) Hamming code we can write H matrix as:

a1 a2 a3 a4 p1 p2 p4

1 0 1 1 1 0 0

1 1 0 1 0 1 0 = H

0 1 1 1 0 0 1

- Can obtain previous parity equations from this H in a simple manner

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Note the H is specified such that all information bits stay intact & together and check bits stay together and depend only on information bits
- A code specified by an H of the above type is called a systematic code
- Data bits and check bits stay separate from each other
- It is easy to extract data bits from a code word
- Statement: rearrangement of columns of H does not change the code. All it does is that it changes the position of the check bits and information bits
- Question: when can we write an arbitrary H in systematic form?

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Theorem: H, an rn matrix and rank(H) = r (rank r means H contains r linearly independent columns), then H can be transformed to a systematic form
- Row operation on H means linear combination of parity check equations. Thus solution of equations does not change
- First rearrange columns of H such that last r columns are linearly independant
- Next find a matrix M such that M performs row operations on H such that M when multiplies the last r columns, it gives an unity rr matrix. Thus M in fact is the inverse of the matrix that consists of the last r columns of H
- Now the the matrix MH will be in systematic form
- Example in class

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Definition: Syndrome S of an n-bit x word is

S = HxT Note – S is an r-bit vector

- Note also in the above equation xT provides a linear combination of columns of H
- Example consider a (6,3) systematic H and consider a 6-bit vector x
- Theorem: for an (n,k) linear block code represented by H the syndrome of every code word is 0
- Proof is more or less based on the way we have defined a block code and H matrix
- Definition: Error word, E, is a vector that represents where a codeword is erroneous
- Example in class to define all these terms

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Theorem: let C be a code word and E be an error word, i.e. C’ = C + E is the erroneous word (code word with error in it). Let S’ be the syndrome of the word C’ then

S’ = HET

- Theorem: A linear block code represented by H is SEC if and only if the columns of H are distinct and non zero
- Theorem: A linear block code represented by H is SEC-DED if:
- All columns of H are distinct and non zero
- Sum of any two columns of H is non zero and is not equal to a third column of H

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Consider an H matrix in which each column has odd number of 1’s code generated by such an H matrix is called odd weight column code
- Example: consider r = 4. Let us consider an H, a 48 matrix:

1 0 0 0 0 1 1 1

0 1 0 0 1 0 1 1 = H

0 0 1 0 1 1 0 1

0 0 0 1 1 1 1 0

wt = 1 columns wt = 3 columns

This is a (8,4) SEC-DED code

- Theorem: Odd weight column code is a SEC-DED code
- Theorem: Hamming code with overall parity is a SEC-DED code

ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.)

- Shortened codes
- Some times we are interested in code that do not exactly satisfy the bound derived for perfect Hamming codes. For example consider the case when k=8. Clearly we will need r=5. But we do not want to have a (15,11). What we want a (12,8) code. Following result comes handy to design such codes and still have error correction capability
- Result: Deleting columns of H does not alter the error correction capability of the corresponding code
- Proof: the conditions stated in the theorem (for example columns remaining odd weight columns, or no two columns being identical) do not change by deleting columns of H.
- What columns to delete? See next hardware issue.

ECE 753 Fault Tolerant Computing

bits

XOR Tree

K inf

bits

R check

bits

SEC-DED Codes –Hardware- Encoding hardware

ECE 753 Fault Tolerant Computing

SEC-DED Codes –Hardware (contd.)

- Decoding hardware – Algorithm
- Compute syndrome S
- If S = 0 then no error
- If S 0 { decode S
- If S is in range (decoded S n) then correct sth bit
- Else there is an uncorrectable error

}

- Note: it is easy to determine if S is 0
- Decoding S is also straight forward
- Correction implies a bit flip (EOR operation)

ECE 753 Fault Tolerant Computing

k

EOR tree

Syndrome

or

and

decoder

. . .

n

nor

Error corrector

n EORs

Corrected word

SEC-DED Codes –Hardware (contd.)- Decoding hardware – Implementation

ECE 753 Fault Tolerant Computing

SEC-DED Codes –Hardware (contd.)

- Hardware simplification
- Reduce number of EORs
- Have as few 1s in the matrix as possible
- Reduce delay – depth of EOR tree
- Have as few 1s in each row of H as possible

ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes

- Motivation
- Many memories are organizes as byte oriented
- Failures manifest themselves as follows
- Random failure – bit error
- Chip failure – byte error
- Objective is to detect such byte errors while detect and correct random errors. Hence the error model
- Single random error
- Multiple errors limited within a byte

ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.)

- Theorem (Reddy): Let E1 and E2 be two sets of error patterns and E1E2 = . A linear block described by H can correct all errors in E1 and detect all errors in E2 if and only if
- For e in E1E2 HeT 0
- For ei, ej in E1 HeiT HejT and
- For an ei in E2 there is no ej in E1 such that HeiT = HejT

ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.)

- To demonstrate the use of the theorem, let us look at an example H matrix and its capabilities for a small byte (nibble) size
- b = number of bits in each memory card
- n = total number of bits in a code word
- r = number of check bits
- n = b(2r-b+1 –1)
- For b = 4 and r = 5 we have n = 12. Thus we will construct a (12,7) code which will be able to correct any single error and detect errors confined to 4-bit nibbles

ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.)

- Many parts of the code are shown as blocks in the following figure

Correction

part

Detect mult

Errors in byte

ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.)

- Now let us look at the complete matrix

0 0 0 0 1 1 1 1 1 1 1 1

1 1 1 1 0 0 0 0 1 1 1 1

1 0 0 0 1 0 0 0 1 0 0 0

0 1 0 1 0 1 0 1 0 1 0 1

0 0 1 1 0 0 1 1 0 0 1 1

ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.)

- The capability can be proven as follows
- E1 single error, E2 errors limited to 4-bit nibbles
- All columns are non-zero and any combinations of columns within 4-bit nibble are also non-zero
- All columns are distinct – providing single error correction capability
- The last 3 rows provide guarantee that no combination of errors limited to a nibble will have a syndrome identical to single error syndrome

ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.)

- Two comments
- The code can be converted to a systematic code
- Distance of the code can be increased by 1 to make it a DED code
- This code can also be shortened

ECE 753 Fault Tolerant Computing

Summary

- Why ECC in Fault tolerance
- Hamming code – by example
- Algebra and Algebraic coding
- Codes
- Hardware
- SEC-SBD code

ECE 753 Fault Tolerant Computing

Download Presentation

Connecting to Server..