1 / 35

Multiplication - PowerPoint PPT Presentation

Multiplication. Example. multiplicand: 1 1 0 0 12 multiplier: 0 1 0 1 5 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 60. 4 partial products. repeat n times:. compute partial product; shift; add. note: each bit of partial products is just an AND operation. z = 0;

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about ' Multiplication' - kiona

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

• Example

multiplicand: 1 1 0 0 12multiplier: 0 1 0 1 5 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 60

4 partial products

repeat n times:

note: each bit of partial products is just an AND operation

CSE 567 - Autumn 1998 - Misc. Topics - 1

• z = 0;

• repeat n

• if (x[0]) z = z + y;

• x = x >> 1; y = y << 1;

Sequential Multiplier

one bit of multiplier applied each cycle

multiplicand

y

0

x

multiplier

z

result

CSE 567 - Autumn 1998 - Misc. Topics - 2

• z = 0;

• repeat n

• if (x[0]) z = z + y * 2n;

• x = x >> 1; z = z >> 1;

Sequential Multiplier (cont’d)

one bit of multiplier applied each cycle

multiplicand

y

x

multiplier

z

result

CSE 567 - Autumn 1998 - Misc. Topics - 3

• Fine-grained - bit level

• Pipelining

• same number of functional units

• different latency, but increased throughput

• less work per clock cycle

• Coarse-grained - data-path level

• e.g., multiple arithmetic units

• multi-port register files (read/write from different sources/destinations)

• Processor level

• difficult to take advantage of many levels of parallelismin fixed general-purpose processors

• much easier when the processors are special-purpose,e.g., systolic computations

CSE 567 - Autumn 1998 - Misc. Topics - 4

• Exploit ability to do necessary bit-level computations directly

• exploit redundant logic

• goal - keep all circuits busy, reduce critical path

• Examples

• multipliers

CSE 567 - Autumn 1998 - Misc. Topics - 5

LSB

1

0

1

LSB

1

1

1

0

multplier

1

1

1

0

0

0

0

0

1

0

0

1

0

Combinational Multipliers

• Use AND gates to generate all partial products in parallel

CSE 567 - Autumn 1998 - Misc. Topics - 6

1

1

0

LSB

1

1

1

0

1

0

1

1

0

0

1

0

0

0

0

1

0

Combinational Multipliers (cont'd)

• Skew array to send partial products along diagonal and make it square

CSE 567 - Autumn 1998 - Misc. Topics - 7

A

LSB

Cout

Cin

LSB

0

0

0

0

S

0

0

Combinational Multipliers (cont'd)

• Ripple-carry adder in each row (carries ripple right to left)

• Sums ripple down (shifted one to right)

worst-case delay is 3n

CSE 567 - Autumn 1998 - Misc. Topics - 8

• Forward carries to next row of adders

• CLA at the end to add last partial product and forwarded carries

LSB

0

0

0

LSB

0

0

0

A

B

Cin

0

Cout

S

0

no need to optimize carry more than sum

using CLA for final stage makes this fasterthan previous multiplier (worst-case is 2n)

CLA

CSE 567 - Autumn 1998 - Misc. Topics - 9

partial products

x2

x2

x2

x2

x2

x1

x1

x1

x1

x1

CSE 567 - Autumn 1998 - Misc. Topics - 10

PP2

PP3

PP4

PP5

PP6

PP7

PP8

PP0

+

+

+

+

+

+

+

CLA

Result

Wallace Tree Multiplier

• Use tree structure to reduce number of additions in critical path to O(logn) rather than O(n)

• Difficult structure to layoutand integrate with partial product crossbar

• Wiring constraints make it unattractive in many technologies

CSE 567 - Autumn 1998 - Misc. Topics - 11

• Problem with Wallace tree is 3:2 column reduction

• need 2:1 reduction for binary tree

• One solution: signed-digit binary trees

• represent digits as 0, 1, -1

• similar to Booth's encoding

1

+ 0

1 -1

0 1

x

y

if x>=0 and y>=0

otherwise

1

+ 1

1 0

1

+ -1

0 0

-1

+ -1

-1 0

0

+ 0

0 0

-1

+ 0

0 -1

-1 1

x

y

if x>=0 and y>=0

otherwise

CSE 567 - Autumn 1998 - Misc. Topics - 12

• Take care of (retire) more than one bit per shift operation

• Example: shift two bits at a time

0 0 1 1 0 1 13 1 1 1 0 1 0 –6 0 0 –1 1 –1 0 0 –1 –21 1 1 1 1 1 1 0 0 1 1 01 1 1 1 1 1 0 0 1 10 0 0 0 0 0 0 01 1 1 1 1 0 1 1 0 0 1 0 –78

Boothrecodingsteps

0 0 0 0*M0 0 1 1*M0 1 0 1*M0 1 1 2*M1 0 0 –2*M1 0 1 –1*M1 1 0 –1*M1 1 1 0*M

must be able to add multiplier times 0, –1, –2, 1, and 2

Boothrecodingtable

CSE 567 - Autumn 1998 - Misc. Topics - 13

• Registers have input and output

• output can be fanned out to many destinations

• input can come from many sources

• multiplexer needed on input to select which

inputs from other registers

controlsignalsto choose

inputsource

input

input

output

output

outputs to other registers

CSE 567 - Autumn 1998 - Misc. Topics - 14

• Multiplexers: lots of control signals but full parallelism of transfers

• Busses

CSE 567 - Autumn 1998 - Misc. Topics - 15

• Adding registers along a path

• split combinational logic into multiple cycles

• each cycle smaller than previously

• Told Cold > Tnew Cnew

• increase throughput

CSE 567 - Autumn 1998 - Misc. Topics - 16

• Delay, d, of slowest combinational stage determines performance

• Throughput = 1/d – rate at which outputs are produced

• Latency = n•d – number of stages * clock period

• Pipelining increases circuit utilization

• Registers slow down data, synchronize data paths

• Wave-pipelining

• no pipeline registers - waves of data flow through circuit

• relies on equal-delay circuit paths - no short paths

CSE 567 - Autumn 1998 - Misc. Topics - 17

• Where is the best place to add registers?

• splitting combinational logic

• overhead of registers (propagation delay and setup time requirements)

• What about cycles in data path?

CSE 567 - Autumn 1998 - Misc. Topics - 18

• Process of optimally distributing registers throughout a circuit

• minimize the clock period

• minimize the number of registers

CSE 567 - Autumn 1998 - Misc. Topics - 19

• Fast optimal algorithm (Leiserson & Saxe 1983)

• Retiming rules:

• remove one register from each input and add one to each output

• remove one register from each output and add one to each input

CSE 567 - Autumn 1998 - Misc. Topics - 20

13

7

8

6

5

10

13

7

8

6

5

Optimal Pipelining

• Add registers - use retiming to find optimal location

CSE 567 - Autumn 1998 - Misc. Topics - 21

• yt = d(xt, a0) + d(xt-1, a1) + d(xt-2, a2) + d(xt-3, a3)

• d(xt, a0) = 0 if x  a, 1 otherwise (and passes x along to the right)

yt

+

+

+

host

d

d

d

d

a0

a1

a2

a3

xt

CSE 567 - Autumn 1998 - Misc. Topics - 22

+

+

host

d

d

d

d

+

+

+

host

d

d

d

d

Example - Digital Correlator (cont’d)

• Delays: adder, 7; comparator, 3; host, 0

cycle time = 24

cycle time = 13

CSE 567 - Autumn 1998 - Misc. Topics - 23

• Pipelining can be applied to any of the combinational multipliers

+

+

+

+

+

+

+

CLA

CLA

FF at every intersection of pipe state and wire

CSE 567 - Autumn 1998 - Misc. Topics - 24

H

B

L

Example - Sorting

Comparator

Parallel Sorter

CSE 567 - Autumn 1998 - Misc. Topics - 25

• Pipelined

CSE 567 - Autumn 1998 - Misc. Topics - 26

CSE 567 - Autumn 1998 - Misc. Topics - 27

CSE 567 - Autumn 1998 - Misc. Topics - 28

CSE 567 - Autumn 1998 - Misc. Topics - 29

• Set of identical processing elements

• specialized or programmable

• Efficient nearest-neighbor interconnections (in 1-D, 2-D, other)

• SIMD-like

• Multiple data flows, converging to engage in computation

Analogy: data flowing through the system in a

rhythmic fashion – from main memory through

a series of processing elements and back to

main memory

CSE 567 - Autumn 1998 - Misc. Topics - 30

• yj = xjw1 + xj+1w2 + . . . + xj+n-1wn

- x3 - x2 - x1

w4

w3

w1

w2

- - - y1 - y2 - y3 -

y1 = x1w1 + x2w2 + x3w3 + x4w4

y2 = x2w1 + x3w2 + x4w3 + x5w4

y3 = x3w1 + x4w2 + x5w3 + x6w4

. . . .

CSE 567 - Autumn 1998 - Misc. Topics - 31

w4 w3 w2 w1

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 –

– – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2

– y1 – y2 – y3

x6 – x5 – x4 – x3 –

y1 – y2 – y3

x6 – x5 – x4 – x3

– y2 – y3

CSE 567 - Autumn 1998 - Misc. Topics - 32

c11 c12 c13 c14

c21 c22 c23 c24

c31 c32 c33 c34

c41 c42 c43 c44

Example: Matrix Multiplication

• C = A  B cij = k=1n aikbkj

CSE 567 - Autumn 1998 - Misc. Topics - 33

|||b44

||b43 b34

|b42 b33 b24

b41 b32 b23 b14

b31 b22 b13 |

b21 b12 ||

b11 |||

c11 c12 c13 c14

c21 c22 c23 c24

c31 c32 c33 c34

c41 c42 c43 c44

– – – a14 a13 a12 a11

– – a24 a23 a22 a21 –

– a34 a33 a32 a31 ––

a44 a43 a42 a41 –––

• Warp (CMU) - 1987

• linear array of 10 or more processing cells

• optimized inter-cell communication for low-latency

• pipelined cells and communication

• conditional execution

• compiler partitions problem into cells and generates microcode

• i-Warp (Intel) - 1990

• successor to Warp

• two-dimensional array

• time-multiplexing of physical busses between cells

• 32x32 array has 20Gflops peak performance

CSE 567 - Autumn 1998 - Misc. Topics - 35