1 / 15

Lecture 16: Computer Arithmetic

Lecture 16: Computer Arithmetic. Today’s topic Floating point numbers IEEE 754 representations FP arithmetic Reminder HW 4 due Monday. Floating Point. Representation for non-integral numbers Including very small and very large numbers Like scientific notation –2.34 × 10 56

Download Presentation

Lecture 16: Computer Arithmetic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 16: Computer Arithmetic • Today’s topic • Floating point numbers • IEEE 754 representations • FP arithmetic • Reminder • HW 4 due Monday

  2. Floating Point • Representation for non-integral numbers • Including very small and very large numbers • Like scientific notation • –2.34 × 1056 • +0.002 × 10–4 • +987.02 × 109 • In binary • ±1.xxxxxxx2 × 2yyyy • Types float and double in C normalized not normalized

  3. Floating Point Standard • Defined by IEEE Std 754-1985 • Now almost universally adopted • Two representations • Single precision (32-bit) • Double precision (64-bit) • A standard notation enables easy exchange of data between machines and simplifies hardware algorithms – the IEEE 754 standard defines how floating point numbers are represented

  4. IEEE Floating-Point Format single: 8 bitsdouble: 11 bits single: 23 bitsdouble: 52 bits • S: sign bit (0  non-negative, 1  negative) • Normalize significand: 1.0 ≤ |significand| < 2.0 • Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit) • Significand is Fraction with the “1.” restored • Exponent: excess representation: actual exponent + Bias • Ensures exponent is unsigned • Single: Bias = 127; Double: Bias = 1203 S Exponent Fraction

  5. Single Precision Sign Exponent Fraction 1 bit 8 bits 23 bits S E F • More exponent bits  wider range of numbers (not necessarily more • numbers – recall there are infinite real numbers) • More fraction bits  higher precision

  6. Single-Precision Range • Exponents 00000000 and 11111111 reserved • Smallest value • Exponent: 00000001 actual exponent = 1 – 127 = –126 • Fraction: 000…00 significand = 1.0 • ±1.0 × 2–126 ≈ ±1.2 × 10–38 • Largest value • exponent: 11111110 actual exponent = 254 – 127 = +127 • Fraction: 111…11 significand ≈ 2.0 • ±2.0 × 2+127 ≈ ±3.4 × 10+38

  7. Double-Precision Range • Exponents 0000…00 and 1111…11 reserved • Smallest value • Exponent: 00000000001 actual exponent = 1 – 1023 = –1022 • Fraction: 000…00 significand = 1.0 • ±1.0 × 2–1022 ≈ ±2.2 × 10–308 • Largest value • Exponent: 11111111110 actual exponent = 2046 – 1023 = +1023 • Fraction: 111…11 significand ≈ 2.0 • ±2.0 × 2+1023 ≈ ±1.8 × 10+308

  8. Floating-Point Example • Represent –0.75 • –0.75 = (–1)1 × 1.12 × 2–1 • S = 1 • Fraction = 1000…002 • Exponent = –1 + Bias • Single: –1 + 127 = 126 = 011111102 • Double: –1 + 1023 = 1022 = 011111111102 • Single: 1011111101000…00 • Double: 1011111111101000…00

  9. Floating-Point Example • What number is represented by the single-precision float 11000000101000…00 • S = 1 • Fraction = 01000…002 • Fxponent = 100000012 = 129 • x = (–1)1 × (1 + 012) × 2(129 – 127) = (–1) × 1.25 × 22 = –5.0

  10. Details • The number “0” has a special code so that the implicit 1 does not • get added: the code is all 0s • (it may seem that this takes up the representation for 1.0, but • given how the exponent is represented, we’ll soon see that • that’s not the case) • The largest exponent value (with zero fraction) represents +/- infinity • The largest exponent value (with non-zero fraction) represents • NaN (not a number) – for the result of 0/0 or (infinity minus infinity)

  11. More Details • To simplify sort, sign was placed as the first bit • For a similar reason, the representation of the exponent is also • modified: in order to use integer compares, it would be preferable to • have the smallest exponent as 00…0 and the largest exponent as 11…1 • This is the biased notation, where a bias is subtracted from the • exponent field to yield the true exponent • IEEE 754 single-precision uses a bias of 127 (since the exponent • must have values between -127 and 128)…double precision uses • a bias of 1023 • Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias)

  12. Floating-Point Addition • Consider a 4-digit decimal example • 9.999 × 101 + 1.610 × 10–1 • 1. Align decimal points • Shift number with smaller exponent • 9.999 × 101 + 0.016 × 101 • 2. Add significands • 9.999 × 101 + 0.016 × 101 = 10.015 × 101 • 3. Normalize result & check for over/underflow • 1.0015 × 102 • 4. Round and renormalize if necessary • 1.002 × 102

  13. Floating-Point Addition • Now consider a 4-digit binary example • 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375) • 1. Align binary points • Shift number with smaller exponent • 1.0002 × 2–1 + –0.1112 × 2–1 • 2. Add significands • 1.0002 × 2–1 + –0.1112 × 2-1= 0.0012 × 2–1 • 3. Normalize result & check for over/underflow • 1.0002 × 2–4, with no over/underflow • 4. Round and renormalize if necessary • 1.0002 × 2–4 (no change) = 0.0625

  14. MIPS Instructions • The usual add.s, add.d, sub, mul, div • Comparison instructions: c.eq.s, c.neq.s, c.lt.s…. • These comparisons set an internal bit in hardware that • is then inspected by branch instructions: bc1t, bc1f • Separate register file $f0 - $f31 : a double-precision • value is stored in (say) $f4-$f5 and is referred to by $f4 • Load/store instructions (lwc1, swc1) must still use • integer registers for address computation

  15. Code Example float f2c (float fahr) { return ((5.0/9.0) * (fahr – 32.0)); } (argument fahr is stored in $f12) lwc1 $f16, const5($gp) lwc1 $f18, const9($gp) div.s $f16, $f16, $f18 lwc1 $f18, const32($gp) sub.s $f18, $f12, $f18 mul.s $f0, $f16, $f18 jr $ra

More Related