A gentle introduction to floating point arithmetic

A gentle introduction to floating point arithmetic Ho Chun Hok (cho@doc.ic.ac.uk) Custom Computing Group Seminar25 Nov 2005

s exponent fraction IEEE 754: floating point standard vsize esize fsize Let v = • Normal numbers (when exponent > 0 and < max exponent) • v = (-1)^s x 2^exponent x (1.fraction) • Subnormal numbers (when exponent = 0) • v = (-1)^s x 2^exponent x (0.fraction) • Special numbers (when exponent = max exponent) • Infinity, Nan (not a number) • precisions • Single: esize = 8, fsize = 23, vsize = 32 • Double: esize = 11, fsize = 52, vsize = 64 • Double extended, vsize > 64 • Operations: • +, -, x, /, sqrt, f2i, i2f, compare, f2d, d2f • Rounding • Nearest even, +inf, -inf, towards 0

What IEEE 754 standard supposes to be • Approximation to real number with expected error • the epsilon can of any real number can be determined when mapping to floating point number • Results of all operations can be correctly rounded in case of inexact result • Ensure some math properties hold (in general), x+y==y+x, -(-a) == a, a>=b & c>=0  a*c >= b*c, x+0 == x, y*y >= 0 • All exception can be detected • Using exception flags • Same results across different machines

How to ensure the standard? • Processor? • Rounding numbers in different mode • Gradual underflow • Raise exceptions • Operating System? • Handle exception • Handle function which may not be supported in hardware (what if a processor cannot handle subnormal number) • Keep track of the floating point unit state, (precision, rounding mode) • Programming Language? • Well-defined semantic for floating point (yes, we have infamous JAVA language) • Compiler? • Preserve the semantic defined in that language • Programmer? • read: What Every Computer Scientist Should Know About Floating-Point Arithmetic

Case study 1 int main (void) { double ref,index; double tmp; int i; ref = (double) 169.0/ (double) 170.0; for(i=0;i<250;i++){ index=i; if(ref == (double) (index/(index+1.0)) ) break; } printf("i=%d\n", i); return 0; }

Visual C compiler, running on P-M • Same result on lulu (pentium 3) and irina (Xeon)

GCC, running on Pentium 4 (skokie)

VCC: fld qword ptr [ebp-10h] fadd qword ptr [__real@8@3fff8000000000000000 (00426028)] fdivr qword ptr [ebp-10h] fcomp qword ptr [ebp-8] fnstsw ax test ah,40h je main+5Fh (0040106f) jmp main+61h (00401071) gcc: fld1 faddp %st,%st(1) fldl 0xfffffff0(%ebp) fdivp %st,%st(1) fldl 0xfffffff8(%ebp) fxch %st(1) fucompp fnstsw %ax and $0x45,%ah cmp $0x40,%ah je 0x80483d2 <main+102> VCC use normal stack (ebp) (64-bits) to store the result, and compare with a 64bit double precision value GCC use advanced FPU register stack (st) (80-bits) to store the result, and compare with a 64bit double precision value

Case study 1 • It’s compiler issue • Using more precision to calculate the intermediate result is a good idea • Compiler should convert the 80-bit floating point number to 64-bit before comparison • And it’s programmer issue too • Equality test between FP variables is dangerous • We can detect the problem before it hurts…. • It is not easy to compliance with the standard

Case study 2 • Calculate • When x is large, result == 0, rather than • Beware, even everything compliance with standard, the standard cannot guarantee the result is always correct • Again, programmer should detect this before it hurts • Define routine to trap the exception • Exceptions are not errors as long as they are handled correctly

Case Study 3 • Jean-Michel Muller’s Recurrence

Using double extended precision (80-bits) • x[2] = 5.590164e+00 • x[3] = 5.633431e+00 • x[4] = 5.674649e+00 • x[5] = 5.713329e+00 • x[6] = 5.749121e+00 • x[7] = 5.781811e+00 • x[8] = 5.811314e+00 • x[9] = 5.837660e+00 • x[10] = 5.861018e+00 • x[11] = 5.882514e+00 • x[12] = 5.918471e+00 • x[13] = 6.240859e+00 • x[14] = 1.115599e+01 • x[15] = 5.279838e+01 • x[16] = 9.469105e+01 • x[17] = 9.966651e+01 • x[18] = 9.998007e+01 • x[19] = 9.999881e+01 Converge to 100.0, it seems correct

Case Study 3 • This series can converge to either 5, 6, and 100 • Depends on the value of x0, x1 • If x0 = 11/2, x1 = 61/11, the series should be converged to 6 • Little round off error may affect the result dramatically • We can calculate the result analytically by substituting • In general, it’s very difficult to detect this error

Case Study 4 • Table maker’s dilemma • If we want n-digit accuracy for elementary function like sine, cosine, we (in most case) need to calculate the digit up to n+2 digit • What if the last 2-digit is “10”? • We can calculate last 3 digit • What if the last 3-digit is “100”? • We can calculate last 4 digit • What if… • The result of most elementary function library (e.g. libm) is not correctly rounded in some cases

Conclusion • We have a standard representation for floating point number • Comforting the standard requires collaboration between different parties • Even if we have standard-compliance platform, cautions must be taken when “underflow, overflow”, and make sure the algorithm is “numerically stable” • When using elementary function, don’t expect the result can be comparable between different machine • Elementary function is NOT included in the IEEE standard • Floating point, when use properly, can do something serious

A gentle introduction to floating point arithmetic

A gentle introduction to floating point arithmetic

Presentation Transcript

Binary and Floating Point Arithmetic

Decimal Floating-Point Arithmetic

Floating-Point Arithmetic

Floating Point Arithmetic

Floating Point Arithmetic

Floating Point Arithmetic Sept. 24, 1998

Set 16 FLOATING POINT ARITHMETIC

Advanced Computer Arithmetic Floating Point Arithmetic Week 3

Floating Point Arithmetic

Floating-Point Arithmetic

Set 16 FLOATING POINT ARITHMETIC

FLOATING POINT ARITHMETIC

Floating Point Arithmetic February 15, 2001

Chapter 9 Floating Point Arithmetic

Floating Point Arithmetic

Floating Point Arithmetic

Floating Point Arithmetic

Integer Arithmetic Floating Point Representation Floating Point Arithmetic

Floating Point Arithmetic Feb 17, 2000

Floating Point Arithmetic – Part I

Floating Point Arithmetic

Floating Point Arithmetic