- 317 Views
- Uploaded on
- Presentation posted in: General

Floating point

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Floating point

Lecture B04

Lecture notes section B04

CSE1303 Part B lecture notes

- Arithmetic
- unsigned addition

- Signed integers
- two’s complement system

- More arithmetic
- signed addition
- subtraction
- multiplication

CSE1303 Part B lecture notes

- Fixed point
- Scientific notation
- Floating point
- Anatomy of floating point
- sign
- exponent
- mantissa

- Floating point arithmetic
- multiplication
- addition

- Limitations of floating point

CSE1303 Part B lecture notes

Calculate the area of this circle:

area = πd2/4

= 3.142 × 1 × 1 / 4

= 0.785

1

area = π * d * d/ 4

= 3 * 1 * 1 / 4

= 0

CSE1303 Part B lecture notes

- Real numbers
- numbers which are not necessarily integers
- have an integer part and fractional part

- Need a way to represent (or approximate) real numbers using binary

CSE1303 Part B lecture notes

- Attempt 1: rational numbers
- already know how to represent integers
- express numbers as ratio of two integers
- numerator and denominator

- 9.75 = 39 ÷ 4
- 0 = 0 ÷ 1
- –42 = –42 ÷ 1
- π ≈ 22 ÷ 7
- to represent a rational number, just need to specify both integers

CSE1303 Part B lecture notes

- Problems with rational number representations
- multiple representations of the same value
- 9.75 = 39 ÷ 4 = 975 ÷ 100 = –117 ÷ –12 = ...
- –42 = –42 ÷ 1 = 42 ÷ –1 = ...
- 0 = 0 ÷ 1 = 0 ÷ 2 = 0 ÷ –1 = ...
- makes comparing values difficult

- no exact representations of some values
- π ≈ 22 ÷ 7 ≈ 335 ÷ 113 ≈ 314159 ÷ 100000 = ...
- not possible for irrational numbers anyway
- decimal has same problem

- multiple representations of the same value

CSE1303 Part B lecture notes

29

28

27

26

25

24

23

22

21

20

2–1

2–2

2–3

2–4

2–5

2–6

(512)

(256)

(128)

(64)

(32)

(16)

(8)

(4)

(2)

(1)

(0.25)

(0.0625)

(0.015625)

(0.5)

(0.125)

(0.03125)

0

1

0

0

0

0

1

1

0

1

0

1

1

1

1

0

- Attempt 2: fixed point
- imply a “binary point” between two bits
- bits to left of point have place value ≥ 1
- bits to right of point have place value < 1
- eg: binary words of: 10-bit whole & 6-bit fractional parts

binary point

.

269.46875

CSE1303 Part B lecture notes

- Problems with fixed point number representations
- limited range of numbers - for a 10:6 (whole:fraction) format, these are:
- smallest non-zero number = 0.015625
- largest number =1023.984375

- fixed point can mean wasted bits
- π represented as 3.140625
- 8 highest order bits are all 0

- π represented as 3.140625

- limited range of numbers - for a 10:6 (whole:fraction) format, these are:

CSE1303 Part B lecture notes

mantissa: fixed point, 1.000... to 9.999...

exponent: integer

- Used to represent a wide range of numbers

radix (base): constant

sign: + or –

–6.341 × 10–23

CSE1303 Part B lecture notes

mantissa: fixed point binary, 1.000... to 1.111...

exponent: binary integer

- Same idea works in binary

radix (base): constant

sign: + or –

–1.01101 × 21101

This number equals

–1.40625 × 213 =

–1152010

CSE1303 Part B lecture notes

- Advantages
- very wide range of representable numbers
- limited by range of exponent

- similar precision for all values
- no wasted bits

- very wide range of representable numbers
- Disadvantages
- some values still not exactly representable
- e.g., π

- some values still not exactly representable

CSE1303 Part B lecture notes

- Binary representation of numbers using a scientific notation style
- IEEE754 standard

sign

mantissa

0

11011001

00100110000110101100101

exponent

CSE1303 Part B lecture notes

- Sign
- one bit
- signed magnitude representation
- 0 number is positive
- 1 number is negative

+1.5

0

01111111

10000000000000000000000

sign bit

–1.5

1

01111111

10000000000000000000000

CSE1303 Part B lecture notes

- Exponent
- could be represented using two’s complement signed notation
- instead represented using excess-k notation
- represented value in exponent field is k more than the intended value
- k is constant (for C float, k = 127)
- exponent field 00000001 (110) exponent is –126
- exponent field 01111111 (12710) exponent is 0
- exponent field 11111110 (25410) exponent is +127

- exponent fields of 000...000 and 111...111 reserved for special meanings
- all bits 0: denormalized numbers and zero
- all bits 1: infinity and not-a-number (indeterminate)

CSE1303 Part B lecture notes

choose this one: mantissa is in range 1.000...2 to 1.111...2

- Number is normalized
- exponent is chosen such that 110 ≤ mantissa < 210

110.00000 × 2–3

11.000000 × 2–2

These are all equally valid representations of the number 0.7510

1.1000000 × 2–1

0.1100000 × 20

0.0110000 × 21

CSE1303 Part B lecture notes

- Mantissa
- represented as fixed-point value between 1.000...2 and 1.111...2
- first bit (before the point) is always 1
- don’t waste a bit to store with the number just assume it always exists

- fixed precision
- for C float, 23 bits available
- plus 1 implicit bit
- 24 significant bits

- for C float, 23 bits available
- mantissa sometimes called significand

CSE1303 Part B lecture notes

mantissa = 1.4062510 = 1.011012, so store 01101 (skip leading 1), pad on right with 0s

sign = 1 because number is negative

1

10001100

01101000000000000000000

exponent = +1310, so store 13 + 127 (excess) = 14010 = 100011002

- Example: –1152010 as a C float
- = –1.4062510 × 213

CSE1303 Part B lecture notes

- C has floating point types of various sizes
- 32 bits (float)
- 1 bit sign, 8 bits exponent, 23 (24) bits mantissa, excess 127

- 64 bits (double)
- 1 bit sign, 11 bits exponent, 52 (53) bits mantissa, excess 1023

- 80 bits (long double)
- 1 bit sign, 15 bits exponent, 64 bits mantissa, excess 16383

CSE1303 Part B lecture notes

- C does many calculations internally using doubles
- most modern computers can operate on doubles as fast as on floats
- may be less efficient to use float variables, even though smaller in size
- long double operations may be very slow
- may have to be implemented by software

- Using literal floating point values in C
- require decimal point
- eg: 5 is of type int, use 5.0 if you want floating point

- optional exponent
- eg: 5.0e-12 means 5.0 × 10–12 (decimal)

- require decimal point

CSE1303 Part B lecture notes

- Size of exponent is fixed
- cannot represent very large numbers
- for C float, exponent is 8 bits, excess is 127
- largest exponent is +127 (25410 (111111102) – 127)
- 11111111 reserved for infinity (and not-a-number, NaN)

- largest representable numbers
- positive: 1.1111...2 × 2127 = 3.403...10 × 1038
- negative: –1.1111...2 × 2127 = –3.403...10 × 1038

- overflow occurs if numbers larger than this are produced
- rounds up to ±infinity

- solution: use a floating point format with a larger exponent
- double (11 bits), long double (15 bits)

- cannot represent very large numbers

CSE1303 Part B lecture notes

- Size of exponent is fixed
- cannot represent very small numbers
- for C float, exponent is 8 bits, excess is 127
- smallest exponent is –126 (110 (000000012) – 127)
- 00000000 reserved for zero ( and denormalized numbers where implied bit is 0 and exponent = -126 )

- smallest representable (normalized) numbers
- positive: 1.000...2 × 2–126 = 1.175...10 × 10–38
- negative: –1.000...2 × 2–126 = –1.175...10 × 10–38

- underflow occurs if a calculation produces a number smaller than the representable limit
- rounds down to zero

- solution: use a floating point format with a larger exponent
- double (11 bits), long double (15 bits)

- cannot represent very small numbers

CSE1303 Part B lecture notes

- Size of mantissa is fixed
- limited precision in representations
- C float has 23 (24) bits of mantissa
- smallest possible change in a number is to toggle the LSB
- place value 2–23 ≈ 1.2 × 10–7

- C float has (almost) 7 decimal digits of precision

- solution: use a floating point format with a larger mantissa
- double (53 bits), long double (64 bits)

- limited precision in representations

CSE1303 Part B lecture notes

- Size of mantissa is fixed
- some values cannot be represented exactly
- e.g. 1/310 = 0.010101010101010101...2
- continuing binary fraction never ends
- cannot fit in 24 (or 240 (or 24000)) bits

- e.g. 1/310 = 0.010101010101010101...2
- solution: none, same problem occurs in decimal scientific notation
- can use higher precision floating point type to improve accuracy
- if exact representation is needed, use rational numbers

- some values cannot be represented exactly

CSE1303 Part B lecture notes

- Determine if a floating point number is less than, equal to, or greater than another floating-point number
- Comparison is common operation, so need to be able to do it quickly
- Justification for floating-point representation
- sign-exponent-mantissa ordering within word
- excess-k exponent representation
- normalized representation

CSE1303 Part B lecture notes

- can use integer compare logic to compare floating point numbers
- because of sign-exponent-mantissa order

- two floating point numbers are equal iff they have the same bit pattern
- otherwise one is less than the other
- compare sign bits
- if different, order is then known

- if same, compare exponent fields
- if different, order is then known

- if same, compare mantissas
- order is then known

- compare sign bits

CSE1303 Part B lecture notes

- Multiplication
- this example in decimal
- same method in binary

+8.19 × 109

×

–2.35 × 1012

+ × + = +

– × + = – + × – = –

– × – = +

step 1: sign

–

CSE1303 Part B lecture notes

- Multiplication
- this example in decimal
- same method in binary

+8.19 × 109

×

–2.35 × 1012

step 2: exponent

add exponents

–

× 1021

CSE1303 Part B lecture notes

- Multiplication
- this example in decimal
- same method in binary

+8.19 × 109

×

–2.35 × 1012

step 3: mantissa

multiply mantissas

–

19.2465

× 1021

CSE1303 Part B lecture notes

- Multiplication
- this example in decimal
- same method in binary

–19.2465 × 1021

step 4: renormalize and round

some loss of precision is inevitable in floating-point multiplication

–1.92 × 1022

CSE1303 Part B lecture notes

- Addition
- this example in decimal
- same method in binary

+9.35 × 105

+

+8.14 × 104

use signs and magnitudes of numbers to determine sign and whether operation is effectively addition or subtraction

signs same: addition

step 1: sign and operation

+

CSE1303 Part B lecture notes

- Addition
- this example in decimal
- same method in binary

+9.35 × 105

+

+8.14 × 104

+0.814 × 105

step 2: match exponents

rewrite smaller number so that it has same exponent as larger number

CSE1303 Part B lecture notes

- Addition
- this example in decimal
- same method in binary

+9.35 × 105

+

+0.814 × 105

step 3: exponent

copy exponent

+

× 105

CSE1303 Part B lecture notes

- Addition
- this example in decimal
- same method in binary

+9.35 × 105

+

+0.814 × 105

step 4: mantissa

add mantissas

+

10.164

× 105

CSE1303 Part B lecture notes

- Addition
- this example in decimal
- same method in binary

+10.164 × 105

loss of precision is likely if numbers are of very different magnitudes

step 5: normalize and round

+1.02 × 106

CSE1303 Part B lecture notes

- Addition of floating point numbers not associative
- A + (B + C) ≠ (A + B) + C
- if numbers are significantly different in magnitude
- for instance:
- A = 1.00 × 103
- B = 4.00 × 100
- C = 3.00 × 100
- 3 significant digits

CSE1303 Part B lecture notes

- A + (B + C)

B: 4.00 × 100

result of addition B + C

+ C: 3.00 × 100

7.00 × 100

rewrite sum to add to A

0.007 × 103

+ A: 1.00 × 103

result after rounding: 1.01 × 103

1.007 × 103

CSE1303 Part B lecture notes

- (A + B) + C

B: 4.00 × 100

rewrite B to add to A

0.004 × 103

+ A: 1.00 × 103

result of sum A + B

1.004 × 103

result after rounding, carry forward to next addition

1.00 × 103

CSE1303 Part B lecture notes

- (A + B) + C (continued)

C: 3.00 × 100

rewrite C to add to sum

0.003 × 103

carried forward from sum on previous slide

1.00 × 103

1.003 × 103

result after rounding: 1.00 × 103

1.00 × 103

CSE1303 Part B lecture notes

- Floating point
- sign
- exponent
- mantissa

- Floating point arithmetic
- multiplication
- addition

- Limitations of floating point
- limited precision
- overflow and underflow
- addition not associative

CSE1303 Part B lecture notes

- Floating point subtraction and division
- Infinities, Not-a-Number and denormalized numbers
- IEEE754’s dustier corners

CSE1303 Part B lecture notes

- Bit manipulation
- bitwise operations
- shifting
- masking

Reading:

Lecture notes section B05

CSE1303 Part B lecture notes