Number Representation Part 2 Floating Point Representations Rounding

ECE 645: Lecture 5 Number Representation Part 2 Floating Point Representations Rounding Representation of the Galois Field elements

Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 17, Floating-Point Representations Chapter 17.5, Rounding schemes Rounding Algorithms 101 http://www.eetimes.com/document.asp?doc_id=1274515

Floating Point Representations

The ANSI/IEEE standard floating-point number representation formats Originally IEEE 754-1985. Superseded by IEEE 754-2008 Standard. - -

Table 17.1 Some features of the ANSI/IEEE standard floatingpoint number representation formats

1.f 2e f = 0: Representation of  f 0: Representation of NaNs f = 0: Representation of 0 f 0: Representation of denormals, 0.f 2–126 Exponent Encoding Exponent encoding in 8 bits for the single/short (32-bit) ANSI/IEEE format 0 1 126 127 128 254 255 Decimal code 00 01 7E 7F 80 FE FF Hex code Exponent value –126 –1 0 +1 +127

Fig. 17.4 Denormals in the IEEE single-precision format.

New IEEE 754-2008 Standard Basic Formats

New IEEE 754-2008 Standard Binary Interchange Formats

Exact result 1 + 2-1 + 2-23 + 2-24 Requirements for Arithmetic Results of the 4 basic arithmetic operations (+, -, , ) as well as square-rooting must match those obtained if all intermediate computations were infinitely precise That is, a floating-point arithmetic operation should introduce no more imprecision than the error attributable to the final rounding of a result that has no exact representation (this is the best possible) Example: (1 + 2-1)  (1 + 2-23 ) Rounded result 1 + 2-1 + 2-22 Error = ½ ulp

Rounding 101

Rounding Modes The IEEE 754-2008 standard includes five rounding modes: Default: Round to nearest, ties to even (rtne) Optional: Round to nearest, ties away from 0 (rtna) Round toward zero (inward) Round toward + (upward) Round toward – (downward)

Rounding Rounding occurs when we want to approximate a more precise number (i.e. more fractional bits L) with a less precise number (i.e. fewer fractional bits L') Example 1: old: 000110.11010001 (K=6, L=8) new: 000110.11 (K'=6, L'=2) Example 2: old: 000110.11010001 (K=6, L=8) new: 000111. (K'=6, L'=0) The following pages show rounding from L>0 fractional bits to L'=0 bits, but the mathematics hold true for any L' < L Usually, keep the number of integral bits the same K'=K

Whole part Fractional part xk–1xk–2 . . . x1x0.x–1x–2 . . . x–lyk–1yk–2 . . . y1y0 Round Rounding Equation • y = round(x)

Rounding Techniques • There are different rounding techniques: • 1) truncation • results in round towards zero in signed magnitude • results in round towards -∞ in two's complement • 2) round to nearest number • 3) round to nearest even number (or odd number) • 4) round towards +∞ • Other rounding techniques • 5) jamming or von Neumann • 6) ROM rounding • Each of these techniques will differ in their error depending on representation of numbers i.e. signed magnitude versus two's complement • Error = round(x) – x

xk–1xk–2 . . . x1x0.x–1x–2 . . . x–lxk–1xk–2 . . . x1x0 trunc ulp 1) Truncation • Truncation in signed-magnitude results in a number chop(x) that is always of smaller magnitude than x. This is called round towards zero or inward rounding • 011.10 (3.5)10 011 (3)10 • Error = -0.5 • 111.10 (-3.5)10 111 (-3)10 • Error = +0.5 • Truncation in two's complement results in a number chop(x) that is always smaller than x. This is called round towards -∞ or downward-directed rounding • 011.10 (3.5)10 011 (3)10 • Error = -0.5 • 100.10 (-3.5)10 100 (-4)10 • Error = -0.5 The simplest possible rounding scheme: chopping or truncation

x chop( ) 4 4 3 3 2 2 1 1 x x – 4 – 3 – 2 – 1 1 2 3 4 – 4 – 3 – 2 – 1 1 2 3 4 – 1 – 1 – 2 – 2 – 3 – 3 – 4 – 4 Truncation Function Graph: chop(x) x chop( ) Fig. 17.5 Truncation or chopping of a signed-magnitude number (same as round toward 0). Fig. 17.6 Truncation or chopping of a 2’s-complement number (same as round to -∞).

Bias in two's complement truncation • Assuming all combinations of positive and negative values of x equally possible, average error is -0.375 • In general, average error = -(2-L'-2-L )/2, where L' = new number of fractional bits

Implementation truncation in hardware • Easy, just ignore (i.e. truncate) the fractional digits from L to L'+1 xk-1 xk-2 .. x1 x0. x-1 x-2 .. x-L = yk-1 yk-2 .. y1 y0. ignore (i.e. truncate the rest)

2) Round to nearest number • Rounding to nearest number what we normally think of when say round • 010.01 (2.25)10 010 (2)10 • Error = -0.25 • 010.11 (2.75)10 011 (3)10 • Error = +0.25 • 010.00 (2.00)10 010 (2)10 • Error = +0.00 • 010.10 (2.5)10 011 (3)10 • Error = +0.5 [round-half-up (arithmetic rounding)] • 010.10 (2.5)10 010 (2)10 • Error = -0.5 [round-half-down]

Round-half-up: dealing with negative numbers • Rounding to nearest number what we normally think of when say round • 101.11 (-2.25)10 110 (-2)10 • Error = +0.25 • 101.01 (-2.75)10 101 (-3)10 • Error = -0.25 • 110.00 (-2.00)10 110 (-2)10 • Error = +0.00 • 101.10 (-2.5)10 110 (-2)10 • Error = +0.5 [asymmetric implementation] • 101.10 (-2.5)10 101 (-3)10 • Error = -0.5 [symmetric implementation]

Round to Nearest Function Graph: rtn(x)Round-half-up version Symmetric implementation Asymmetric implementation

Bias in two's complement round to nearestRound-half-up asymmetric implementation • Assuming all combinations of positive and negative values of x equally possible, average error is +0.125 • Smaller average error than truncation, but still not symmetric error • We have a problem with the midway value, i.e. exactly at 2.5 or -2.5 leads to positive error bias always • Also have the problem that you can get overflow if only allocate K' = K integral bits • Example: rtn(011.10)  overflow • This overflow only occurs on positive numbers near the maximum positive value, not on negative numbers

Implementing round to nearest (rtn) in hardware Round-half-up asymmetric implementation • Two methods • Method 1: Add '1' in position one digit right of new LSB (i.e. digit L'+1) and keep only L' fractional bits xk-1 xk-2 .. x1 x0. x-1 x-2 .. x-L + 1 = yk-1 yk-2 .. y1 y0. y-1 • Method 2: Add the value of the digit one position to right of new LSB (i.e. digit L'+1) into the new LSB digit (i.e. digit L) and keep only L' fractional bits xk-1 xk-2 .. x1 x0. x-1 x-2 .. x-L + x-1 yk-1 yk-2 .. y1 y0. ignore (i.e. truncate the rest) ignore (i.e truncate the rest)

Fig. 17.9 R* rounding or rounding to the nearest odd number. Round to Nearest Even Function Graph: rtne(x) • To solve the problem with the midway value we implement round to nearest-even number (or can round to nearest odd number) Fig. 17.8 Rounding to the nearest even number.

Bias in two's complement round to nearest even (rtne) • average error is now 0 (ignoring the overflow) • cost: more hardware

4) Rounding towards infinity • We may need computation errors to be in a known direction • Example: in computing upper bounds, larger results are acceptable, but results that are smaller than correct values could invalidate upper bound • Use upward-directed rounding (round toward +∞) • up(x) always larger than or equal to x • Similarly for lower bounds, use downward-directed rounding (round toward -∞) • down(x) always smaller than or equal to x • We have already seen that round toward -∞ in two's complement can be implemented by truncation

Rounding Toward Infinity Function Graph: up(x) and down(x) up(x) down(x) down(x) can be implemented by chop(x) intwo's complement

x inward( ) 4 3 2 1 x – 4 – 3 – 2 – 1 1 2 3 4 – 1 – 2 – 3 – 4 Two's Complement Round to Zero • Two's complement round to zero (inward rounding) also exists

Other Methods • Note that in two's complement round to nearest (rtn) involves an addition which may have a carry propagation from LSB to MSB • Rounding may take as long as an adder takes • Can break the adder chain using the following two techniques: • Jamming or von Neumann • ROM-based

5) Jamming or von Neumann Chop and force the LSB of the result to 1 Simplicity of chopping, with the near-symmetry or ordinary rounding Max error is comparable to chopping (double that of rounding) - - - - - - - - - -

xk–1 . . . x4x3x2x1x0.x–1x–2 . . . x–lxk–1 . . . x4y3y2y1y0 ROM ROM address ROM data 6) ROM Rounding Fig. 17.11 ROM rounding with an 8  2 table. Example: Rounding with a 32  4 table Rounding result is the same as that of the round to nearest scheme in 31 of the 32 possible cases, but a larger error is introduced when x3 = x2 = x1 = x0 = x–1 = 1 - - - - - - - - - -

Representation of the Galois Field elements

Evariste Galois (1811-1832)

Evariste Galois (1811-1832) Studied the problem of finding algebraic solutions for the general equations of the degree  5, e.g., f(x) = a5x5+ a4x4+ a3x3+ a2x2+ a1x+ a0 = 0 Answered definitely the question which specific equations of a given degree have algebraic solutions. On the way, he developed group theory, one of the most important branches of modern mathematics.

Evariste Galois (1811-1832) • Galois submits his results for the first time to • the French Academy of Sciences • Reviewer 1 • Augustin-Luis Cauchy forgot or lost the communication. • 1830 Galois submits the revised version of his manuscript, • hoping to enter the competition for the Grand Prize • in mathematics • Reviewer 2 • Joseph Fourier – died shortly after receiving the manuscript. • 1831 Third submission to the French Academy of Sciences • Reviewer 3 • Simeon-Denis Poisson – did not understand the manuscript • and rejected it.

Evariste Galois (1811-1832) • May 1832 Galois provoked into a duel • The night before the duel he wrote a letter to his friend • containing the summary of his discoveries. • The letter ended with a plea: • “Eventually there will be, I hope, some people who • will find it profitable to decipher this mess.” • May 30, 1832 Galois was grievously wounded in the duel and died • in the hospital the following day. • Galois manuscript rediscovered by Joseph Liouville • 1846 Galois manuscript published for • the first time in a mathematical journal.

Field Set F, and two operations typically denoted by (but not necessarily equivalent to) + and * Set F, and definitions of these two operations must fulfill special conditions.

Examples of fields Infinite fields { R= set of real numbers, + addition of real numbers * multiplication of real numbers } Finite fields { set Zp={0, 1, 2, … , p-1}, + (mod p): addition modulo p, * (mod p): multiplication modulo p }

Finite Fields = Galois Fields p – prime pm – number of elements in the field GF(pm) GF(2m) GF(p) Most significant special cases Arithmetic operations present in many libraries Normal basis representation Polynomial basis representation Fast in hardware Fast squaring

Number Representation Part 2 Floating Point Representations Rounding

Number Representation Part 2 Floating Point Representations Rounding

Presentation Transcript

Floating Point

Number Representation Part 2 Floating Point Representations Little-Endian vs. Big-Endian Representations Galois Field Re

Floating Point Representation

9.4 FLOATING-POINT REPRESENTATION

Number Representation (Part 2)

Floating Point: Representation and Instructions

Symbol Representation and Floating Point Numbers Chapters 2 and 3

Number Representation Part 2 Fixed-Radix Signed Representations Floating Point Representations

Floating Point (FLP) Representation

Data Representation: Floating Point for Real Numbers

Floating Point Representations

Floating Point Representation

Integer Arithmetic Floating Point Representation Floating Point Arithmetic

Floating Point Representation

Floating Point Arithmetic – Part I

Floating-Point Representation

Number Representation Part 2 Little-Endian vs. Big-Endian Representations

Number Representation Part 1 Fixed-Radix Unsigned Representations

Floating Point Operations - Part II

Number Representation Part 2 Little-Endian vs. Big-Endian Representations

Floating Point Representation