Floating Point Arithmetic

Floating Point Arithmetic

Hardware vs. Software • Can build the ALU (Arithmetic Logic Unit) to perform Floating Point Arithmetic • Faster • More expensive • Less of an issue as technology improves • Can simulate the operations of Floating Point with multiple integer operations • Done by the compiler • Slower • Cheaper Hardware

IEEE Floating Point Layout • Single Precision – 32 bits • Left bit is a sign bit • Next 8 are exponent • Next 23 are mantissa • Double Presicion – 64 bits • Left bit is a sign bit • Next 11 are exponent • Next 52 are mantissa

Floating Point Addition • Performed in several steps • Line up the decimal points • Now the exponents are the same • Add the mantissas • Exponent of the result is the same as the exponents of the operands • Normalize if necessary • Place in proper scientific notation

Equalizing Exponents • In math, we can shift the value with the larger exponent left while decreasing the exponent until the exponents are equal • But the hardware has no place to shift the value left into. • There is the implied decimal point • We must shift the value with the smaller exponent right and increase the exponent • The right values are lost • Insignificant – low order bits – won’t affect the answer much • Some hardware has extra bits just for computation, not for answer

Adding • Now the bits for the mantissa can be added. • Just like adding integers (but with fewer than 32 bits) • The exponent of the answer is the same exponent as the operands.

Normalizing • In scientific notation, the mantissa of the operands is between 1 and 2. • After getting the exponents equal the mantissa is between 0 and 2. • So, the result is between 1 and 4 • Unless one of the operands is negative, then the result can be between 0 and 4 (in absolute value) • We may need to shift the result left to get a 1 bit into the leftmost bit of the answer • We may need to shift the result right to get the result in the proper range

Correct Results • What happens when we add two values of very different magnitude? • We must shift one of the values many places • The rightmost bits “fall off” the end • The answer will not be “exact”, but very close. • When would this happen? • What if we are summing many, many values. • Sum=Sum + A[I] • Sum can get so big compared to A[I] that Sum does not change.

Multiplication • Actually a little easier • Do unsigned multiplication with the mantissas • Add the exponents • Normalize the result • Set the sign bit of the result

Multiplication Details • We have already done unsigned multiplication. • To add the exponents we need to look at the notation. • The exponents use excess 127 notation • fpe1=reale1+127 • Result = fpe1+fpe2 = reale1+reale2+127+127 • Need to subtract 127 from the result to get appropriate value

Sign • The sign of the result depends on the sign of the operands • If both operands have the same sign, the result is positive, otherwise the result is negative. • This is the XOR function • Of course, must normalize the result • May have many more shifts

True Division • Do unsigned division on the mantissas • Discussed with integers. • Subtract the exponents • Now need to add 127 to get the correct representation of the value • Normalize the result • Same as previous methods • Set the sign • Same as with multiplication

Division by Reciprocal • Calculate a/b as a* (1/b) • This is useful only if we can compute (1/b) without using division. • Use a Newton-Raphson technique (discussed in CSCI 381) • Repeat • r = r * (2 – r*b) • Until r does not change • r starts with a first guess at the reciprocal and gets closer with each iteration

Errors • Floating point numbers are not exact • Do NOT compare floating point numbers for equality. 0.1 * 10 ≠ 1. • Instead of using “if (a == b)” when a and b are floating point, use if (abs(a-b) < .0001) or some other reasonable measure of “close enough”

Rounding in Base 2 • Round to the nearest. • Ties are such that the least significant bit is 0 • Round towards 0 • Truncation • Round towards positive infinity • Round up (careful with negative values) • Round towards negative infinity • Round down

Overflow and Underflow • Overflow for integers is when the result is too big to be held with the number of bits allocated. • The same is true for Floating Point. However, this is determined more by the size of the exponent field than the size of the mantissa field. • Underflow is when a value becomes so small that it becomes 0. • Again, this is related to the exponent field but with negative exponents

Floating Point Arithmetic