Number representation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 64

Number Representation PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on
  • Presentation posted in: General

Number Representation. How to Represent Negative Numbers?. So far, unsigned numbers Obvious solution: define leftmost bit to be sign! 0 => +, 1 => - Rest of bits can be numerical value of number Representation called sign and magnitude. Shortcomings of sign and magnitude?.

Download Presentation

Number Representation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Number representation

Number Representation


How to represent negative numbers

How to Represent Negative Numbers?

  • So far, unsigned numbers

  • Obvious solution: define leftmost bit to be sign!

    • 0 => +, 1 => -

    • Rest of bits can be numerical value of number

  • Representation called sign and magnitude


Shortcomings of sign and magnitude

Shortcomings of sign and magnitude?

  • Arithmetic circuit more complicated

    • Special steps depending whether signs are the same or not

  • Also, Two zeros

    • 0x00000000 = +0ten

    • 0x80000000 = -0ten

    • What would it mean for programming?

  • Sign and magnitude abandoned


Another try complement the bits

00000

00001

...

01111

11111

10000

...

11110

Another try: complement the bits

  • Example: 710 = 001112 -710 = 110002

  • Called one’s Complement

  • Note: postive numbers have leading 0s, negative numbers have leadings 1s.

  • What is -00000 ?

  • How many positive numbers in N bits?

  • How many negative ones?


Shortcomings of ones complement

Shortcomings of ones complement?

  • Arithmetic not too hard

  • Still two zeros

    • 0x00000000 = +0ten

    • 0xFFFFFFFF = -0ten

    • What would it mean for programming?

  • One’s complement eventually abandoned because another solution was better


Search for negative number representation

Search for Negative Number Representation

  • Obvious solution didn’t work, find another

  • What is result for unsigned numbers if tried to subtract large number from a small one?

    • Would try to borrow from string of leading 0s, so result would have a string of leading 1s

    • With no obvious better alternative, pick representation that made the hardware simple: leading 0s  positive, leading 1s  negative

    • 000000...xxx is >=0, 111111...xxx is < 0

  • This representation called two’s complement


Two s complement number line

Two’s Complement Number line

00000

11111

  • 2 N-1 non-negatives

  • 2 N-1 negatives

  • one zero

  • how many positives?

  • comparison?

  • overflow?

00001

11110

00010

0

-1

1

2

-2

.

.

.

.

.

.

15

-15

-16

01111

10001

10000


Two s complement numbers

Two’s Complement Numbers

0000 ... 0000 0000 0000 0000two = 0ten0000 ... 0000 0000 0000 0001two = 1ten0000 ... 0000 0000 0000 0010two = 2ten. . .0111 ... 1111 1111 1111 1101two = 2,147,483,645ten0111 ... 1111 1111 1111 1110two = 2,147,483,646ten0111 ... 1111 1111 1111 1111two = 2,147,483,647ten1000 ... 0000 0000 0000 0000two = –2,147,483,648ten1000 ... 0000 0000 0000 0001two = –2,147,483,647ten1000 ... 0000 0000 0000 0010two = –2,147,483,646ten. . . 1111 ... 1111 1111 1111 1101two =–3ten1111 ... 1111 1111 1111 1110two =–2ten1111 ... 1111 1111 1111 1111two =–1ten

  • One zero, 1st bit => >=0 or <0, called sign bit

    • but one negative with no positive –2,147,483,648ten


Two s complement formula

Two’s Complement Formula

  • Can represent positive and negative numbers in terms of the bit value times a power of 2:

    • d31 x -231+ d30 x 230 + ... + d2 x 22 + d1 x 21 + d0 x 20

  • Example1111 1111 1111 1111 1111 1111 1111 1100two

    = 1x-231+1x230 +1x229+...+1x22+0x21+0x20

    = -231+ 230 + 229 + ...+ 22 + 0 + 0

    = -2,147,483,648ten + 2,147,483,644ten

    = -4ten

  • Note: need to specify width: we use 32 bits


Two s complement shortcut negation

Two’s complement shortcut: Negation

  • Invert every 0 to 1 and every 1 to 0, then add 1 to the result

    • Sum of number and its one’s complement must be 111...111two

    • 111...111two= -1ten

    • Let x’ mean the inverted representation of x

    • Then x + x’ = -1  x + x’ + 1 = 0  x’ + 1 = -x

  • Example: -4 to +4 to -4x : 1111 1111 1111 1111 1111 1111 1111 1100twox’: 0000 0000 0000 0000 0000 0000 0000 0011two+1: 0000 0000 0000 0000 0000 0000 0000 0100two()’: 1111 1111 1111 1111 1111 1111 1111 1011two+1: 1111 1111 1111 1111 1111 1111 1111 1100two


Two s comp shortcut sign extension

Two’s comp. shortcut: Sign extension

  • Convert 2’s complement number using n bits to more than n bits

  • Simply replicate the most significant bit (sign bit) of smaller to fill new bits

    • 2’s comp. positive number has infinite 0s

    • 2’s comp. negative number has infinite 1s

    • Bit representation hides leading bits; sign extension restores some of them

    • 16-bit -4ten to 32-bit:

      1111 1111 1111 1100two

      1111 1111 1111 1111 1111 1111 1111 1100two


Distribusi nilai uts

Distribusi Nilai UTS


Addition of positive numbers

Addition of Positive Numbers


One bit full adder 1 3

Carries

a: 0 0 1 1

b: 0 1 0 1

Sum:1 0 0 0

One-Bit Full Adder (1/3)

  • Example Binary Addition:

  • Thus for any bit of addition:

    • The inputs are ai, bi, CarryIni

    • The outputs are Sumi, CarryOuti

  • Note: CarryIni+1 = CarryOuti


One bit full adder 2 3

Definition

Sum = ABCin + ABCin + ABCin + ABCin

CarryOut = AB + ACin + BCin

¯ ¯ ¯ ¯ ¯ ¯

One-Bit Full Adder (2/3)


One bit full adder 3 3

CarryIn

A

+

Sum

B

CarryOut

One-Bit Full Adder (3/3)

  • To create one-bit full adder:

    • implement gates for Sum

    • implement gates for CarryOut

    • connect all inputs with same name


Ripple carry adders adding n bits numbers

Ripple-Carry Adders: adding n-bits numbers

CarryIn0

  • Critical Path of n-bit Rippled-carry adder is n*CP

    • CP = 2 gate-delays (Cout = AB + ACin + BCin)

A0

1-bit

FA

Sum0

B0

CarryOut0

CarryIn1

A1

1-bit

FA

Sum1

B1

CarryOut1

CarryIn2

A2

1-bit

FA

Sum2

B2

CarryOut2

CarryIn3

A3

1-bit

FA

Sum3

B3

CarryOut3


Fast adders

Fast Adders


Carry look ahead reducing carry propagation delay

Cin

ABC-out

000“kill”

01C-in“propagate”

10C-in“propagate”

111“generate”

A0

S0

G

B0

P

C1 =G0 + C0· P0

= A0B0 + C0(A0+B0)

A

S

P = A + B

G = A  B

G

B

P

C2 = G1 + G0 · P1 + C0· P0 · P1

A

S

G

B

P

C3 = G2 + G1 · P2 + G0 · P1 · P2 + C0· P0 · P1 · P2

A

S

G = G3 + P3·G2 + P3·P2·G1 + P3·P2·P1·G0

G

B

P

P = P3·P2·P1·P0

C4 = . . .

Carry Look Ahead: reducing Carry Propagation delay


Carry look ahead delays

¯ ¯ ¯ ¯ ¯ ¯

Sum = ABCin + ABCin + ABCin + ABCin

Carry Look Ahead: Delays

  • Expression for any carry:

    • Ci+1 = Gi + PiGi-1 + … + PiPi-1 … P0C0

  • All carries can be obtained in 3 gate-delays:

    • 1 needed to developed all Pi and Gi

    • 2 needed in the AND-OR circuit

  • All sums can be obtained in 6 gate-delays:

    • 3 needed to obtain carries

    • 1 needed to invert carry

    • 2 needed in the AND-OR circuit of Sum’s circuit

  • Independent of the number of bits (n)

  • 4-bit Adder:

    • CLA: 6 gate-delays

    • RC: (3*2 + 3) gate-delays

  • 16-bit Adder:

    • CLA: 6 gate-delays

    • RC: (15*2 + 3) gate-delays


Cascaded cla overcoming fan in constraint

C

L

A

4-bit

Adder

4-bit

Adder

4-bit

Adder

Cascaded CLA: overcoming Fan-in constraint

C0

G0

P0

C1 =G0 + C0 · P0

Delay = 3 + 2 + 3 = 8

DelayRC = 15*2 + 3 = 33

C2 = G1 + G0 · P1 + C0 · P0 · P1

C3 = G2 + G1 · P2 + G0 · P1 · P2 + C0 · P0 · P1 · P2

G

P

C4 = . . .


Signed addition subtraction

Signed Addition & Subtraction


Addition subtraction operations

Addition & Subtraction Operations

  • Addition:

    • Just add the two numbers

    • Ignore the Carry-out from MSB

    • Result will be correct, provided there’s no overflow

  • Subtraction:

    • Form 2’s complement of the subtrahend

    • Add the two numbers as in Addition

0 1 0 1(+5)+0 0 1 0(+2) 0 1 1 1(+7)

0 1 0 1(+5)+1 0 1 0(-6) 1 1 1 1(-1)

0 0 1 0(+2) 0 0 1 00 1 0 0(+4)+1 1 0 0(-4) 1 1 1 0 (-2)

1 0 1 1(-5)+1 1 1 0(-2)11 0 0 1(-7)

0 1 1 1(+7)+1 1 0 1(-3)10 1 0 0(+4)

1 1 1 0(-2) 1 1 1 01 0 1 1(-5)+0 1 0 1(+5)10 0 1 1 (+3)


Overflow

Overflow

Decimal

Binary

Decimal

2’s Complement

0

0000

0

0000

  • Examples: 7 + 3 = 10 but ...

  • - 4  5 = - 9 but ...

1

0001

-1

1111

2

0010

-2

1110

3

0011

-3

1101

4

0100

-4

1100

5

0101

-5

1011

6

0110

-6

1010

7

0111

-7

1001

-8

1000

0

1

1

1

1

0

1

1

1

7

1

1

0

0

– 4

3

– 5

+

0

0

1

1

+

1

0

1

1

1

0

1

0

– 6

0

1

1

1

7


Overflow detection

Overflow Detection

  • Overflow: the result is too large (or too small) to represent properly

    • Example: - 8 < = 4-bit binary number <= 7

  • When adding operands with different signs, overflow cannot occur!

  • Overflow occurs when adding:

    • 2 positive numbers and the sum is negative

    • 2 negative numbers and the sum is positive

  • On your own: Prove you can detect overflow by:

    • Carry into MSB ° Carry out of MSB

0

1

1

1

1

0

0

1

1

1

7

1

1

0

0

–4

3

– 5

+

0

0

1

1

+

1

0

1

1

1

0

1

0

– 6

0

1

1

1

7


Overflow detection logic

CarryIn1

A1

1-bit

FA

Result1

B1

CarryOut1

Overflow Detection Logic

  • Carry into MSB ° Carry out of MSB

    • For a N-bit Adder: Overflow = CarryIn[N - 1] XOR CarryOut[N - 1]

CarryIn0

A0

1-bit

FA

Result0

X

Y

X XOR Y

B0

0

0

0

CarryOut0

0

1

1

1

0

1

1

1

0

CarryIn2

A2

1-bit

FA

Result2

B2

CarryIn3

Overflow

A3

1-bit

FA

Result3

B3

CarryOut3


Arithmetic branching conditions

Arithmetic & Branching Conditions


Condition codes

Condition Codes

  • CC Flags will be set/cleared by arithmetic operations:

    • N (negative): 1 if result is negative (MSB = 1), otherwise 0

    • C (carry): 1 if carry-out(borrow) is generated, otherwise 0

    • V (overflow): 1 if overflow occurs, otherwise 0

    • Z (zero): 1 if result is zero, otherwise 0

0 1 0 1(+5)+0 1 0 0(+4) 1 0 0 1(-7?)

0 1 0 1(+5)+1 0 1 0(-6)1 1 1 1(-1)

0 0 1 1(+3)+1 1 0 1(-3)10 0 0 0(0)

0 1 1 1(+7)+1 1 0 1(-3)10 1 0 0(+4)


Multiplication of positive numbers

Multiplication of Positive Numbers


Unsigned multiplication

Unsigned Multiplication

  • Paper and pencil example (unsigned):

    Multiplicand 1101(13)Multiplier1011(11) 11011101 00001101Product 10001111(143)

  • m bits x n bits = m+n bit product

  • Binary makes it easy:

    • 0 => place 0 ( 0 x multiplicand)

    • 1 => place a copy ( 1 x multiplicand)


Unsigned combinational multiplier

0

0

0

0

A3

A2

A1

A0

B0

A3

A2

A1

A0

B1

A3

A2

A1

A0

B2

A3

A2

A1

A0

B3

P7

P6

P5

P4

P3

P2

P1

P0

Unsigned Combinational Multiplier

  • Stage i accumulates A * 2 i if Bi == 1


How does it work

A3

A2

A1

A0

A3

A2

A1

A0

A3

A2

A1

A0

A3

A2

A1

A0

How does it work?

0

0

0

0

0

0

0

B0

  • at each stage shift A left ( x 2)

  • use next bit of B to determine whether to add in shifted multiplicand

  • accumulate 2n bit partial product at each stage

B1

B2

B3

P7

P6

P5

P4

P3

P2

P1

P0


Multiplier circuit

Multiplier Circuit

4 bits

Multiplicand

0

Multiplicand 1101

C ProductMultiplier0 000010110 11011011Add0 01101101Shift

1 00111101Add0 10011110Shift

0 10011110NoAdd0 01001111Shift

1 00011111Add0 10001111Shift

MUX

Shift Right

Control

4-bit FA

Add/NoAdd

C

Product

(Multiplier)

8 bits

4 bits


Signed operand multiplication

Signed-Operand & Multiplication


Signed multiplication

Signed Multiplication

  • Negative Multiplicand:

    Multiplicand 10011(-13)Multiplier 01011(+11)1111110011111110011000000001110011000000Product 1101110001(-143)

  • Negative Multiplier:

    • Form 2’s complement of both multiplier and multiplicand

    • Proceed as above


Motivation for booth s algorithm

Motivation for Booth’s Algorithm

  • Works well for both Negative & Positive Multipliers

  • Example 2 x 6 = 0010 x 0110: 0010 x 0110 + 0000shift (0 in multiplier) + 0010 add (1 in multiplier) + 0100 add (1 in multiplier) +0000 shift (0 in multiplier) 00001100

  • FA with add or subtract gets same result in more than one way:6= – 2 + 8 0110 = – 00010 + 01000 = 11110 + 01000

  • For example

  • 0010 x 0110 00000000shift (0 in multiplier) 1111110 sub(first 1 in multpl.)000000shift (mid string of 1s) +00010 add (prior step had last 1) 00001100


Booth s algorithm

Booth’s Algorithm

Current BitBit to the RightExplanationExampleOp

10Begins run of 1s0001111000sub

11Middle of run of 1s0001111000none

01End of run of 1s0001111000add

00Middle of run of 0s0001111000none

Originally for Speed (when shift was faster than add)

  • Small number of additions needed when multiplier has a few large blocks of 1s


Booths example 2 x 7

Booths Example (2 x 7)

OperationMultiplicandProductnext?

0. initial value00100000 0111 010 -> sub

1a. P = P - m1110 +11101110 0111 0shift P (sign ext)

1b. 00101111 0011111 -> nop, shift

2.00101111 1001111 -> nop, shift

3.00101111 1100101 -> add

4a.0010 +0010

0001 11001shift

4b.00100000 1110 0done


Booths example 2 x 3

Booths Example (2 x -3)

OperationMultiplicandProductnext?

0. initial value00100000 1101 010 -> sub

1a. P = P - m1110 +11101110 1101 0shift P (sign ext)

1b. 00101111 0110101 -> add + 0010

2a.0001 01101shift P

2b.00100000 1011010 -> sub +1110

3a.00101110 10110shift

3b.0010 1111 0101 111 -> nop

4a1111 0101 1 shift

4b.00101111 10101 done


Fast multiplication read yourself

Fast Multiplication

(read yourself!)


Integer division

Integer Division


Divide paper pencil

Divide: Paper & Pencil

1001 Quotient

Divisor 1000 1001010 Dividend–1000 10 101 1010–1000 10 Remainder (or Modulo result)

  • See how big a number can be subtracted, creating quotient bit on each step

    Binary => 1 * divisor or 0 * divisor

  • Dividend = Quotient x Divisor + Remainder=> | Dividend | = | Quotient | + | Divisor |


Division circuit

Divisor

Division Circuit

33 bits

Shift Left

Control

33-bit FA

Q Setting

Remainder

(Quotient)

65 bits

33 bits

Sign-bit Checking


Restoring division algorithm

0010QuotientDivisor111000Dividend 1110Remainder

Restoring Division Algorithm

RemainderQuotient

Initially000001000Shift00001000_Sub(-11)11101Set q011110Restore000010000

Shift00010000_Sub(-11) 11101Set q011111Restore000100000

Shift00100000_Sub(-11) 11101Set q000001000010001

Shift00010001_Sub(-11) 11101001_Set q011111Restore000100010


Floating point numbers operations

Floating-point Numbers & Operations


Review of numbers

Review of Numbers

  • Computers are made to deal with numbers

  • What can we represent in N bits?

    • Unsigned integers:

      0to2N - 1

    • Signed Integers (Two’s Complement)

      -2(N-1)to 2(N-1) - 1


Other numbers

Other Numbers

  • What about other numbers?

    • Very large numbers? (seconds/century)3,155,760,00010 (3.1557610 x 109)

    • Very small numbers? (atomic diameter)0.0000000110 (1.010 x 10-8)

    • Rationals (repeating pattern)2/3 (0.666666666. . .)

    • Irrationals21/2 (1.414213562373. . .)

    • Transcendentalse (2.718...),  (3.141...)

  • All represented in scientific notation


Scientific notation review

exponent

mantissa

radix (base)

decimal point

Scientific Notation Review

  • Normalized form: no leadings 0s (exactly one digit to left of decimal point)

  • Alternatives to representing 1/1,000,000,000

    • Normalized: 1.0 x 10-9

    • Not normalized: 0.1 x 10-8,10.0 x 10-10

6.02 x 1023


Scientific notation for binary numbers

exponent

Mantissa

radix (base)

“binary point”

Scientific Notation for Binary Numbers

  • Computer arithmetic that supports it called floating point, because it represents numbers where binary point is not fixed, as it is for integers

    • Declare such variable in C as float

1.0two x 2-1


Floating point representation 1 2

31

30

23

22

0

S

Exponent

Significand

1 bit

8 bits

23 bits

Floating Point Representation (1/2)

  • Normal format: +1.xxxxxxxxxxtwo*2yyyytwo

  • Multiple of Word Size (32 bits)

  • S represents SignExponent represents y’sSignificand represents x’s

  • Represent numbers as small as 2.0 x 10-38 to as large as 2.0 x 1038


Floating point representation 2 2

Floating Point Representation (2/2)

  • What if result too large? (> 2.0x1038 )

    • Overflow!

    • Overflow => Exponent larger than represented in 8-bit Exponent field

  • What if result too small? (>0, < 2.0x10-38 )

    • Underflow!

    • Underflow => Negative exponent larger than represented in 8-bit Exponent field

  • How to reduce chances of overflow or underflow?


Double precision fl pt representation

31

30

20

19

0

S

Exponent

Significand

1 bit

11 bits

20 bits

Significand (cont’d)

32 bits

Double Precision Fl. Pt. Representation

  • Next Multiple of Word Size (64 bits)

  • Double Precision (vs. Single Precision)

    • C variable declared as double

    • Represent numbers almost as small as 2.0 x 10-308 to almost as large as 2.0 x 10308

    • But primary advantage is greater accuracy due to larger significand


Ieee 754 floating point standard 1 4

IEEE 754 Floating Point Standard (1/4)

  • Single Precision, DP similar

  • Sign bit:1 means negative0 means positive

  • Significand:

    • To pack more bits, leading 1 implicit for normalized numbers

    • 1 + 23 bits single, 1 + 52 bits double

    • always true: 0 < Significand < 1(for normalized numbers)

  • Note: 0 has no leading 1, so reserve exponent value 0 just for number 0


Ieee 754 floating point standard 2 4

IEEE 754 Floating Point Standard (2/4)

  • Kahan wanted FP numbers to be used even if no FP hardware; e.g., sort records with FP numbers using integer compares

  • Could break FP number into 3 parts: compare signs, then compare exponents, then compare significands

  • Wanted it to be faster, single compare if possible, especially if positive numbers

  • Then want order:

    • Highest order bit is sign ( negative < positive)

    • Exponent next, so big exponent => bigger #

    • Significand last: exponents same => bigger #


Ieee 754 floating point standard 3 4

0

0111 1110

000 0000 0000 0000 0000 0000

1/2

2

0

1000 0000

000 0000 0000 0000 0000 0000

0

1111 1111

000 0000 0000 0000 0000 0000

1/2

2

0

0000 0001

000 0000 0000 0000 0000 0000

IEEE 754 Floating Point Standard (3/4)

  • Negative Exponent?

    • 2’s comp? 1.0 x 2-1 v. 1.0 x2+1 (1/2 v. 2)

  • This notation using integer compare of 1/2 v. 2 makes 1/2 > 2!

  • Instead, pick notation 0000 0001 is most negative, and 1111 1111 is most positive

    • 1.0 x 2-1 v. 1.0 x2+1 (1/2 v. 2)


Ieee 754 floating point standard 4 4

31

30

23

22

0

S

Exponent

Significand

1 bit

8 bits

23 bits

IEEE 754 Floating Point Standard (4/4)

  • Called Biased Notation, where bias is number subtract to get real number

    • IEEE 754 uses bias of 127 for single prec.

    • Subtract 127 from Exponent field to get actual value for exponent

    • 1023 is bias for double precision

  • Summary (single precision):

  • (-1)S x (1 + Significand) x 2(Exponent-127)

    • Double precision identical, except with exponent bias of 1023


Special numbers

Special Numbers

  • What have we defined so far? (Single Precision)

    ExponentSignificandObject

    000

    0nonzero???

    1-254anything+/- fl. pt. #

    2550+/- infinity

    255nonzeroNaN


Infinity and nans

Infinity and NaNs

result of operation overflows, i.e., is larger than the largest number that

can be represented

overflow is not the same as divide by zero (raises a different exception)

S 1 . . . 1 0 . . . 0

+/- infinity

It may make sense to do further computations with infinity

e.g., X/0 > Y may be a valid comparison

Not a number, but not infinity (e.q. sqrt(-4))

invalid operation exception (unless operation is = or =)

S 1 . . . 1 non-zero

NaN

HW decides what goes here

NaNs propagate: f(NaN) = NaN


Fp addition

FP Addition

  • Much more difficult than with integers

  • Can’t just add significands

  • How do we do it?

    • De-normalize to match exponents

    • Add significands to get resulting one

    • Keep the same exponent

    • Normalize (possibly changing exponent)

  • Note: If signs differ, just perform a subtract instead.


Fp subtraction

FP Subtraction

  • Similar to addition

  • How do we do it?

    • De-normalize to match exponents

    • Subtract significands

    • Keep the same exponent

    • Normalize (possibly changing exponent)


Extra bits for rounding

Extra Bits for rounding

"Floating Point numbers are like piles of sand; every time you move one you lose a little sand, but you pick up a little dirt."

How many extra bits?

IEEE: As if computed the result exactly and rounded.

  • Guard Digits: digits to the right of the first p digits of significand to guard against loss of digits – can later be shifted left into first P places during normalization.

  • Addition: carry-out shifted in

  • Subtraction: borrow digit and guard

  • Multiplication: carry and guard, Division requires guard

Addition:

1.xxxxx1.xxxxx1.xxxxx

+1.xxxxx0.001xxxxx0.01xxxxx

1x.xxxxy 1.xxxxxyyy 1x.xxxxyyy

post-normalization pre-normalization pre and post


Rounding digits

Rounding Digits

normalized result, but some non-zero digits to the right of the

significand --> the number should be rounded

E.g., B = 10, p = 3:

2-bias

0 2 1.69

= 1.6900 * 10

= - .0785 * 10

= 1.6115 * 10

2-bias

0 0 7.85

-

2-bias

0 2 1.61

one round digit must be carried to the right of the guard digit so that

after a normalizing left shift, the result can be rounded, according

to the value of the round digit

IEEE Standard:

four rounding modes: round to nearest (default)

round towards plus infinity

round towards minus infinity

round towards 0

round to nearest:

round digit < B/2 then truncate

> B/2 then round up (add 1 to ULP: unit in last place)

= B/2 then round to nearest even digit

it can be shown that this strategy minimizes the mean error

introduced by rounding


Sticky bit

Sticky Bit

Additional bit to the right of the round digit to better fine tune rounding

d0 . d1 d2 d3 . . . dp-1 0 0 0

0 . 0 0 X . . . X X X S

X X S

Sticky bit: set to 1 if any 1 bits fall off

the end of the round digit

+

d0 . d1 d2 d3 . . . dp-1 0 0 0

0 . 0 0 X . . . X X X 1

d0 . d1 d2 d3 . . . dp-1 0 0 0

0 . 0 0 X . . . X X X 0

-

-

X X 0

generates a borrow

Rounding Summary:

Radix 2 minimizes wobble in precision

Normal operations in +,-,*,/ require one carry/borrow bit + one guard digit

One round digit needed for correct rounding

Sticky bit needed when round digit is B/2 for max accuracy

Rounding to nearest has mean error = 0 if uniform distribution of digits

are assumed


Denormalized numbers

Denormalized Numbers

2-bias

denorm

gap

1-bias

-bias

2

2

0

2

normal numbers with hidden bit -->

B = 2, p = 4

The gap between 0 and the next representable number is much larger

than the gaps between nearby representable numbers.

IEEE standard uses denormalized numbers to fill in the gap, making the

distances between numbers near 0 more alike.

2-bias

1-bias

-bias

2

2

0

2

p-1

bits of

precision

p bits of

precision

same spacing, half as many values!

NOTE: PDP-11, VAX cannot represent subnormal numbers. These

machines underflow to zero instead.


  • Login