Floating point

1 / 42

# Floating point - PowerPoint PPT Presentation

Floating point. Lecture B04. Lecture notes section B04. Last time. Arithmetic unsigned addition Signed integers two’s complement system More arithmetic signed addition subtraction multiplication. In this lecture. Fixed point Scientific notation Floating point

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Floating point' - risa

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Floating point

Lecture B04

Lecture notes section B04

CSE1303 Part B lecture notes

Last time
• Arithmetic
• Signed integers
• two’s complement system
• More arithmetic
• subtraction
• multiplication

CSE1303 Part B lecture notes

In this lecture
• Fixed point
• Scientific notation
• Floating point
• Anatomy of floating point
• sign
• exponent
• mantissa
• Floating point arithmetic
• multiplication
• Limitations of floating point

CSE1303 Part B lecture notes

Geometry test

Calculate the area of this circle:

area = πd2/4

= 3.142 × 1 × 1 / 4

= 0.785

1

area = π * d * d/ 4

= 3 * 1 * 1 / 4

= 0

CSE1303 Part B lecture notes

Real number representation
• Real numbers
• numbers which are not necessarily integers
• have an integer part and fractional part
• Need a way to represent (or approximate) real numbers using binary

CSE1303 Part B lecture notes

Rational numbers
• Attempt 1: rational numbers
• already know how to represent integers
• express numbers as ratio of two integers
• numerator and denominator
• 9.75 = 39 ÷ 4
• 0 = 0 ÷ 1
• –42 = –42 ÷ 1
• π ≈ 22 ÷ 7
• to represent a rational number, just need to specify both integers

CSE1303 Part B lecture notes

Rational numbers
• Problems with rational number representations
• multiple representations of the same value
• 9.75 = 39 ÷ 4 = 975 ÷ 100 = –117 ÷ –12 = ...
• –42 = –42 ÷ 1 = 42 ÷ –1 = ...
• 0 = 0 ÷ 1 = 0 ÷ 2 = 0 ÷ –1 = ...
• makes comparing values difficult
• no exact representations of some values
• π ≈ 22 ÷ 7 ≈ 335 ÷ 113 ≈ 314159 ÷ 100000 = ...
• not possible for irrational numbers anyway
• decimal has same problem

CSE1303 Part B lecture notes

29

28

27

26

25

24

23

22

21

20

2–1

2–2

2–3

2–4

2–5

2–6

(512)

(256)

(128)

(64)

(32)

(16)

(8)

(4)

(2)

(1)

(0.25)

(0.0625)

(0.015625)

(0.5)

(0.125)

(0.03125)

0

1

0

0

0

0

1

1

0

1

0

1

1

1

1

0

Fixed point
• Attempt 2: fixed point
• imply a “binary point” between two bits
• bits to left of point have place value ≥ 1
• bits to right of point have place value < 1
• eg: binary words of: 10-bit whole & 6-bit fractional parts

binary point

.

269.46875

CSE1303 Part B lecture notes

Fixed point
• Problems with fixed point number representations
• limited range of numbers - for a 10:6 (whole:fraction) format, these are:
• smallest non-zero number = 0.015625
• largest number =1023.984375
• fixed point can mean wasted bits
• π represented as 3.140625
• 8 highest order bits are all 0

CSE1303 Part B lecture notes

mantissa: fixed point, 1.000... to 9.999...

exponent: integer

Scientific notation
• Used to represent a wide range of numbers

sign: + or –

–6.341 × 10–23

CSE1303 Part B lecture notes

mantissa: fixed point binary, 1.000... to 1.111...

exponent: binary integer

Scientific notation
• Same idea works in binary

sign: + or –

–1.01101 × 21101

This number equals

–1.40625 × 213 =

–1152010

CSE1303 Part B lecture notes

Scientific notation
• very wide range of representable numbers
• limited by range of exponent
• similar precision for all values
• no wasted bits
• some values still not exactly representable
• e.g., π

CSE1303 Part B lecture notes

Floating point
• Binary representation of numbers using a scientific notation style
• IEEE754 standard

sign

mantissa

0

11011001

00100110000110101100101

exponent

CSE1303 Part B lecture notes

Floating point
• Sign
• one bit
• signed magnitude representation
• 0  number is positive
• 1  number is negative

+1.5

0

01111111

10000000000000000000000

sign bit

–1.5

1

01111111

10000000000000000000000

CSE1303 Part B lecture notes

Floating point
• Exponent
• could be represented using two’s complement signed notation
• instead represented using excess-k notation
• represented value in exponent field is k more than the intended value
• k is constant (for C float, k = 127)
• exponent field 00000001 (110) exponent is –126
• exponent field 01111111 (12710) exponent is 0
• exponent field 11111110 (25410) exponent is +127
• exponent fields of 000...000 and 111...111 reserved for special meanings
• all bits 0: denormalized numbers and zero
• all bits 1: infinity and not-a-number (indeterminate)

CSE1303 Part B lecture notes

choose this one: mantissa is in range 1.000...2 to 1.111...2

Floating point
• Number is normalized
• exponent is chosen such that 110 ≤ mantissa < 210

110.00000 × 2–3

11.000000 × 2–2

These are all equally valid representations of the number 0.7510

1.1000000 × 2–1

0.1100000 × 20

0.0110000 × 21

CSE1303 Part B lecture notes

Floating point
• Mantissa
• represented as fixed-point value between 1.000...2 and 1.111...2
• first bit (before the point) is always 1
• don’t waste a bit to store with the number just assume it always exists
• fixed precision
• for C float, 23 bits available
• plus 1 implicit bit
• 24 significant bits
• mantissa sometimes called significand

CSE1303 Part B lecture notes

mantissa = 1.4062510 = 1.011012, so store 01101 (skip leading 1), pad on right with 0s

sign = 1 because number is negative

1

10001100

01101000000000000000000

exponent = +1310, so store 13 + 127 (excess) = 14010 = 100011002

Floating point
• Example: –1152010 as a C float
• = –1.4062510 × 213

CSE1303 Part B lecture notes

Floating point
• C has floating point types of various sizes
• 32 bits (float)
• 1 bit sign, 8 bits exponent, 23 (24) bits mantissa, excess 127
• 64 bits (double)
• 1 bit sign, 11 bits exponent, 52 (53) bits mantissa, excess 1023
• 80 bits (long double)
• 1 bit sign, 15 bits exponent, 64 bits mantissa, excess 16383

CSE1303 Part B lecture notes

Floating point
• C does many calculations internally using doubles
• most modern computers can operate on doubles as fast as on floats
• may be less efficient to use float variables, even though smaller in size
• long double operations may be very slow
• may have to be implemented by software
• Using literal floating point values in C
• require decimal point
• eg: 5 is of type int, use 5.0 if you want floating point
• optional exponent
• eg: 5.0e-12 means 5.0 × 10–12 (decimal)

CSE1303 Part B lecture notes

Limitations of floating point
• Size of exponent is fixed
• cannot represent very large numbers
• for C float, exponent is 8 bits, excess is 127
• largest exponent is +127 (25410 (111111102) – 127)
• 11111111 reserved for infinity (and not-a-number, NaN)
• largest representable numbers
• positive: 1.1111...2 × 2127 = 3.403...10 × 1038
• negative: –1.1111...2 × 2127 = –3.403...10 × 1038
• overflow occurs if numbers larger than this are produced
• rounds up to ±infinity
• solution: use a floating point format with a larger exponent
• double (11 bits), long double (15 bits)

CSE1303 Part B lecture notes

Limitations of floating point
• Size of exponent is fixed
• cannot represent very small numbers
• for C float, exponent is 8 bits, excess is 127
• smallest exponent is –126 (110 (000000012) – 127)
• 00000000 reserved for zero ( and denormalized numbers where implied bit is 0 and exponent = -126 )
• smallest representable (normalized) numbers
• positive: 1.000...2 × 2–126 = 1.175...10 × 10–38
• negative: –1.000...2 × 2–126 = –1.175...10 × 10–38
• underflow occurs if a calculation produces a number smaller than the representable limit
• rounds down to zero
• solution: use a floating point format with a larger exponent
• double (11 bits), long double (15 bits)

CSE1303 Part B lecture notes

Limitations of floating point
• Size of mantissa is fixed
• limited precision in representations
• C float has 23 (24) bits of mantissa
• smallest possible change in a number is to toggle the LSB
• place value 2–23 ≈ 1.2 × 10–7
• C float has (almost) 7 decimal digits of precision
• solution: use a floating point format with a larger mantissa
• double (53 bits), long double (64 bits)

CSE1303 Part B lecture notes

Limitations of floating point
• Size of mantissa is fixed
• some values cannot be represented exactly
• e.g. 1/310 = 0.010101010101010101...2
• continuing binary fraction never ends
• cannot fit in 24 (or 240 (or 24000)) bits
• solution: none, same problem occurs in decimal scientific notation
• can use higher precision floating point type to improve accuracy
• if exact representation is needed, use rational numbers

CSE1303 Part B lecture notes

Floating point comparison
• Determine if a floating point number is less than, equal to, or greater than another floating-point number
• Comparison is common operation, so need to be able to do it quickly
• Justification for floating-point representation
• sign-exponent-mantissa ordering within word
• excess-k exponent representation
• normalized representation

CSE1303 Part B lecture notes

Floating point comparison
• can use integer compare logic to compare floating point numbers
• because of sign-exponent-mantissa order
• two floating point numbers are equal iff they have the same bit pattern
• otherwise one is less than the other
• compare sign bits
• if different, order is then known
• if same, compare exponent fields
• if different, order is then known
• if same, compare mantissas
• order is then known

CSE1303 Part B lecture notes

Floating point arithmetic
• Multiplication
• this example in decimal
• same method in binary

+8.19 × 109

×

–2.35 × 1012

+ × + = +

– × + = – + × – = –

– × – = +

step 1: sign

CSE1303 Part B lecture notes

Floating point arithmetic
• Multiplication
• this example in decimal
• same method in binary

+8.19 × 109

×

–2.35 × 1012

step 2: exponent

× 1021

CSE1303 Part B lecture notes

Floating point arithmetic
• Multiplication
• this example in decimal
• same method in binary

+8.19 × 109

×

–2.35 × 1012

step 3: mantissa

multiply mantissas

19.2465

× 1021

CSE1303 Part B lecture notes

Floating point arithmetic
• Multiplication
• this example in decimal
• same method in binary

–19.2465 × 1021

step 4: renormalize and round

some loss of precision is inevitable in floating-point multiplication

–1.92 × 1022

CSE1303 Part B lecture notes

Floating point arithmetic
• this example in decimal
• same method in binary

+9.35 × 105

+

+8.14 × 104

use signs and magnitudes of numbers to determine sign and whether operation is effectively addition or subtraction

step 1: sign and operation

+

CSE1303 Part B lecture notes

Floating point arithmetic
• this example in decimal
• same method in binary

+9.35 × 105

+

+8.14 × 104

+0.814 × 105

step 2: match exponents

rewrite smaller number so that it has same exponent as larger number

CSE1303 Part B lecture notes

Floating point arithmetic
• this example in decimal
• same method in binary

+9.35 × 105

+

+0.814 × 105

step 3: exponent

copy exponent

+

× 105

CSE1303 Part B lecture notes

Floating point arithmetic
• this example in decimal
• same method in binary

+9.35 × 105

+

+0.814 × 105

step 4: mantissa

+

10.164

× 105

CSE1303 Part B lecture notes

Floating point arithmetic
• this example in decimal
• same method in binary

+10.164 × 105

loss of precision is likely if numbers are of very different magnitudes

step 5: normalize and round

+1.02 × 106

CSE1303 Part B lecture notes

Limitations of floating point
• Addition of floating point numbers not associative
• A + (B + C) ≠ (A + B) + C
• if numbers are significantly different in magnitude
• for instance:
• A = 1.00 × 103
• B = 4.00 × 100
• C = 3.00 × 100
• 3 significant digits

CSE1303 Part B lecture notes

Limitations of floating point
• A + (B + C)

B: 4.00 × 100

result of addition B + C

+ C: 3.00 × 100

7.00 × 100

rewrite sum to add to A

0.007 × 103

+ A: 1.00 × 103

result after rounding: 1.01 × 103

1.007 × 103

CSE1303 Part B lecture notes

Limitations of floating point
• (A + B) + C

B: 4.00 × 100

rewrite B to add to A

0.004 × 103

+ A: 1.00 × 103

result of sum A + B

1.004 × 103

result after rounding, carry forward to next addition

1.00 × 103

CSE1303 Part B lecture notes

Limitations of floating point
• (A + B) + C (continued)

C: 3.00 × 100

rewrite C to add to sum

0.003 × 103

carried forward from sum on previous slide

1.00 × 103

1.003 × 103

result after rounding: 1.00 × 103

1.00 × 103

CSE1303 Part B lecture notes

Covered in this lecture
• Floating point
• sign
• exponent
• mantissa
• Floating point arithmetic
• multiplication
• Limitations of floating point
• limited precision
• overflow and underflow

CSE1303 Part B lecture notes

Going further
• Floating point subtraction and division
• Infinities, Not-a-Number and denormalized numbers
• IEEE754’s dustier corners

CSE1303 Part B lecture notes

Next time
• Bit manipulation
• bitwise operations
• shifting