1 / 16

Floating Point Arithmetic

Floating Point Arithmetic. Hardware vs. Software. Can build the ALU (Arithmetic Logic Unit) to perform Floating Point Arithmetic Faster More expensive Less of an issue as technology improves Can simulate the operations of Floating Point with multiple integer operations Done by the compiler

gamma
Download Presentation

Floating Point Arithmetic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Floating Point Arithmetic

  2. Hardware vs. Software • Can build the ALU (Arithmetic Logic Unit) to perform Floating Point Arithmetic • Faster • More expensive • Less of an issue as technology improves • Can simulate the operations of Floating Point with multiple integer operations • Done by the compiler • Slower • Cheaper Hardware

  3. IEEE Floating Point Layout • Single Precision – 32 bits • Left bit is a sign bit • Next 8 are exponent • Next 23 are mantissa • Double Presicion – 64 bits • Left bit is a sign bit • Next 11 are exponent • Next 52 are mantissa

  4. Floating Point Addition • Performed in several steps • Line up the decimal points • Now the exponents are the same • Add the mantissas • Exponent of the result is the same as the exponents of the operands • Normalize if necessary • Place in proper scientific notation

  5. Equalizing Exponents • In math, we can shift the value with the larger exponent left while decreasing the exponent until the exponents are equal • But the hardware has no place to shift the value left into. • There is the implied decimal point • We must shift the value with the smaller exponent right and increase the exponent • The right values are lost • Insignificant – low order bits – won’t affect the answer much • Some hardware has extra bits just for computation, not for answer

  6. Adding • Now the bits for the mantissa can be added. • Just like adding integers (but with fewer than 32 bits) • The exponent of the answer is the same exponent as the operands.

  7. Normalizing • In scientific notation, the mantissa of the operands is between 1 and 2. • After getting the exponents equal the mantissa is between 0 and 2. • So, the result is between 1 and 4 • Unless one of the operands is negative, then the result can be between 0 and 4 (in absolute value) • We may need to shift the result left to get a 1 bit into the leftmost bit of the answer • We may need to shift the result right to get the result in the proper range

  8. Correct Results • What happens when we add two values of very different magnitude? • We must shift one of the values many places • The rightmost bits “fall off” the end • The answer will not be “exact”, but very close. • When would this happen? • What if we are summing many, many values. • Sum=Sum + A[I] • Sum can get so big compared to A[I] that Sum does not change.

  9. Multiplication • Actually a little easier • Do unsigned multiplication with the mantissas • Add the exponents • Normalize the result • Set the sign bit of the result

  10. Multiplication Details • We have already done unsigned multiplication. • To add the exponents we need to look at the notation. • The exponents use excess 127 notation • fpe1=reale1+127 • Result = fpe1+fpe2 = reale1+reale2+127+127 • Need to subtract 127 from the result to get appropriate value

  11. Sign • The sign of the result depends on the sign of the operands • If both operands have the same sign, the result is positive, otherwise the result is negative. • This is the XOR function • Of course, must normalize the result • May have many more shifts

  12. True Division • Do unsigned division on the mantissas • Discussed with integers. • Subtract the exponents • Now need to add 127 to get the correct representation of the value • Normalize the result • Same as previous methods • Set the sign • Same as with multiplication

  13. Division by Reciprocal • Calculate a/b as a* (1/b) • This is useful only if we can compute (1/b) without using division. • Use a Newton-Raphson technique (discussed in CSCI 381) • Repeat • r = r * (2 – r*b) • Until r does not change • r starts with a first guess at the reciprocal and gets closer with each iteration

  14. Errors • Floating point numbers are not exact • Do NOT compare floating point numbers for equality. 0.1 * 10 ≠ 1. • Instead of using “if (a == b)” when a and b are floating point, use if (abs(a-b) < .0001) or some other reasonable measure of “close enough”

  15. Rounding in Base 2 • Round to the nearest. • Ties are such that the least significant bit is 0 • Round towards 0 • Truncation • Round towards positive infinity • Round up (careful with negative values) • Round towards negative infinity • Round down

  16. Overflow and Underflow • Overflow for integers is when the result is too big to be held with the number of bits allocated. • The same is true for Floating Point. However, this is determined more by the size of the exponent field than the size of the mantissa field. • Underflow is when a value becomes so small that it becomes 0. • Again, this is related to the exponent field but with negative exponents

More Related