170 likes | 299 Views
This seminar presentation by Ho Chun Hok explores the intricacies of floating-point arithmetic, focusing on the IEEE 754 standard. It discusses normal and subnormal numbers, special cases like NaN and infinity, and precision issues related to single and double representation. The talk delves into operations such as addition, multiplication, and rounding, highlighting potential pitfalls for programmers. Case studies demonstrate compliance challenges and the importance of correct exception handling in floating-point calculations. A must-watch for computer scientists looking to deepen their understanding of numerical representation.
E N D
A gentle introduction to floating point arithmetic Ho Chun Hok (cho@doc.ic.ac.uk) Custom Computing Group Seminar25 Nov 2005
s exponent fraction IEEE 754: floating point standard vsize esize fsize Let v = • Normal numbers (when exponent > 0 and < max exponent) • v = (-1)^s x 2^exponent x (1.fraction) • Subnormal numbers (when exponent = 0) • v = (-1)^s x 2^exponent x (0.fraction) • Special numbers (when exponent = max exponent) • Infinity, Nan (not a number) • precisions • Single: esize = 8, fsize = 23, vsize = 32 • Double: esize = 11, fsize = 52, vsize = 64 • Double extended, vsize > 64 • Operations: • +, -, x, /, sqrt, f2i, i2f, compare, f2d, d2f • Rounding • Nearest even, +inf, -inf, towards 0
What IEEE 754 standard supposes to be • Approximation to real number with expected error • the epsilon can of any real number can be determined when mapping to floating point number • Results of all operations can be correctly rounded in case of inexact result • Ensure some math properties hold (in general), x+y==y+x, -(-a) == a, a>=b & c>=0 a*c >= b*c, x+0 == x, y*y >= 0 • All exception can be detected • Using exception flags • Same results across different machines
How to ensure the standard? • Processor? • Rounding numbers in different mode • Gradual underflow • Raise exceptions • Operating System? • Handle exception • Handle function which may not be supported in hardware (what if a processor cannot handle subnormal number) • Keep track of the floating point unit state, (precision, rounding mode) • Programming Language? • Well-defined semantic for floating point (yes, we have infamous JAVA language) • Compiler? • Preserve the semantic defined in that language • Programmer? • read: What Every Computer Scientist Should Know About Floating-Point Arithmetic
Case study 1 int main (void) { double ref,index; double tmp; int i; ref = (double) 169.0/ (double) 170.0; for(i=0;i<250;i++){ index=i; if(ref == (double) (index/(index+1.0)) ) break; } printf("i=%d\n", i); return 0; }
Visual C compiler, running on P-M • Same result on lulu (pentium 3) and irina (Xeon)
VCC: fld qword ptr [ebp-10h] fadd qword ptr [__real@8@3fff8000000000000000 (00426028)] fdivr qword ptr [ebp-10h] fcomp qword ptr [ebp-8] fnstsw ax test ah,40h je main+5Fh (0040106f) jmp main+61h (00401071) gcc: fld1 faddp %st,%st(1) fldl 0xfffffff0(%ebp) fdivp %st,%st(1) fldl 0xfffffff8(%ebp) fxch %st(1) fucompp fnstsw %ax and $0x45,%ah cmp $0x40,%ah je 0x80483d2 <main+102> VCC use normal stack (ebp) (64-bits) to store the result, and compare with a 64bit double precision value GCC use advanced FPU register stack (st) (80-bits) to store the result, and compare with a 64bit double precision value
Case study 1 • It’s compiler issue • Using more precision to calculate the intermediate result is a good idea • Compiler should convert the 80-bit floating point number to 64-bit before comparison • And it’s programmer issue too • Equality test between FP variables is dangerous • We can detect the problem before it hurts…. • It is not easy to compliance with the standard
Case study 2 • Calculate • When x is large, result == 0, rather than • Beware, even everything compliance with standard, the standard cannot guarantee the result is always correct • Again, programmer should detect this before it hurts • Define routine to trap the exception • Exceptions are not errors as long as they are handled correctly
Case Study 3 • Jean-Michel Muller’s Recurrence
Using double extended precision (80-bits) • x[2] = 5.590164e+00 • x[3] = 5.633431e+00 • x[4] = 5.674649e+00 • x[5] = 5.713329e+00 • x[6] = 5.749121e+00 • x[7] = 5.781811e+00 • x[8] = 5.811314e+00 • x[9] = 5.837660e+00 • x[10] = 5.861018e+00 • x[11] = 5.882514e+00 • x[12] = 5.918471e+00 • x[13] = 6.240859e+00 • x[14] = 1.115599e+01 • x[15] = 5.279838e+01 • x[16] = 9.469105e+01 • x[17] = 9.966651e+01 • x[18] = 9.998007e+01 • x[19] = 9.999881e+01 Converge to 100.0, it seems correct
Case Study 3 • This series can converge to either 5, 6, and 100 • Depends on the value of x0, x1 • If x0 = 11/2, x1 = 61/11, the series should be converged to 6 • Little round off error may affect the result dramatically • We can calculate the result analytically by substituting • In general, it’s very difficult to detect this error
Case Study 4 • Table maker’s dilemma • If we want n-digit accuracy for elementary function like sine, cosine, we (in most case) need to calculate the digit up to n+2 digit • What if the last 2-digit is “10”? • We can calculate last 3 digit • What if the last 3-digit is “100”? • We can calculate last 4 digit • What if… • The result of most elementary function library (e.g. libm) is not correctly rounded in some cases
Conclusion • We have a standard representation for floating point number • Comforting the standard requires collaboration between different parties • Even if we have standard-compliance platform, cautions must be taken when “underflow, overflow”, and make sure the algorithm is “numerically stable” • When using elementary function, don’t expect the result can be comparable between different machine • Elementary function is NOT included in the IEEE standard • Floating point, when use properly, can do something serious