Floating Point Numbers

Floating Point Numbers

It's all just 1s and 0s • Computers are fundamentally driven by logic and thus bits of data • Manipulation of bits can be done incredibly quickly • Given n bits of information, there are 2n possible combinations • These 2n representations can encode pretty much anything you want, letters, numbers, instructions….

Bases of number systems • Base 10 numbers: 0,1,2,3,4,5,6,7,8,9 • 3107 = 3103 +1102 + 0101 +7100 • Base 2 numbers: 0,1 • 3107 = 1 2 4 8 16 32 64 128 256 512 1024 2048 • =1211 + 1210 + 029 + 028 + 027 + 026 + 125 + 024 + 023 + 022 + 121 + 120 • =110000100011 • Addition, multiplication etc, all proceed same way

Base Notation • What does 10 mean? • 10 in binary = 2 decimal • 10 in octal (base 8) = 8 decimal • 10 in decimal = 10 decimal • Need some method of differentiating between these possibilities • To avoid confusion, where necessary we write • 1010= • 102=

Integer Representation • Integers obviously fit into this base 2 notations • Remains challenge to represent negative numbers • 2s complement • Excess-N • Extra choice is order of bits • Choice is made chip-by-chip • portability

Floating Point Representation • Computers represent oating point numbers in binary form • For generality, they use a binary form of scientic notation In binary, we can use powers of 2

Floating Point Size • In IEEE.h • IEEE.h:#define IEEE_FLOAT_SIZE 4 • IEEE.h:#define IEEE_DOUBLE_SIZE 8 • IEEE.h:#define IEEE_QUAD_SIZE 16

Distribution

In Decimal Terms • Each binary floating point double holds roughly 16 decimal digits • technically, 2^(-52) • MATLAB example

Advantages • Scientific notation can work on any scale (all handled by exponent) • So long as errors are small relative to scale of data values, calculations are accurate • right?

Example 1 • 1e12 + 0.2 – 1e12

Problem • Nice decimal numbers (0.2) have continuing binary representations • like 1/3 = 0.3333333, 0.2 has binary 0.0011 0011 0011 0011… • Analogy with adding, subtracting large number

Roundoff Error • Round-off error will always be present e.g. • Roundoff error is more significant when you are subtracting two almost equal quantities • e.g in decimal, 255.67 – 255.69

Example 2 • A = 112000000 • B = 100000 • C = 0.0009 • X = A - B / C

Common occurrence • Delta x in • finite element methods • numerical differentiation • Places where more closely packed data gives

Example 3: Numerical Diff.

Example 4: Recursion • Comparing sum of delta x and real sum • t = 0; • N = 10000; dx = 1/N; • for (I = 1:N) • t = t + dx; • end

Avoiding (Large) Roundoff Error • Avoid substracting almost-equal quantities • Avoid dividing by small quantities • Avoid sums over large loops, especially with different orders of magnitude in the sum • Avoid recursive calculations, where errors will accumulate

Floating Point Numbers