Multiplication

1 / 35

# Multiplication - PowerPoint PPT Presentation

Multiplication. Example. multiplicand: 1 1 0 0 12 multiplier: 0 1 0 1 5 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 60. 4 partial products. repeat n times:. compute partial product; shift; add. note: each bit of partial products is just an AND operation. z = 0;

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Multiplication' - kiona

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Multiplication
• Example

multiplicand: 1 1 0 0 12multiplier: 0 1 0 1 5 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 60

4 partial products

repeat n times:

note: each bit of partial products is just an AND operation

CSE 567 - Autumn 1998 - Misc. Topics - 1

z = 0;

• repeat n
• if (x[0]) z = z + y;
• x = x >> 1; y = y << 1;
Sequential Multiplier

one bit of multiplier applied each cycle

multiplicand

y

0

x

multiplier

z

result

CSE 567 - Autumn 1998 - Misc. Topics - 2

z = 0;

• repeat n
• if (x[0]) z = z + y * 2n;
• x = x >> 1; z = z >> 1;
Sequential Multiplier (cont’d)

one bit of multiplier applied each cycle

multiplicand

y

x

multiplier

z

result

CSE 567 - Autumn 1998 - Misc. Topics - 3

Parallelism in hardware
• Fine-grained - bit level
• Pipelining
• same number of functional units
• different latency, but increased throughput
• less work per clock cycle
• Coarse-grained - data-path level
• e.g., multiple arithmetic units
• multi-port register files (read/write from different sources/destinations)
• Processor level
• difficult to take advantage of many levels of parallelismin fixed general-purpose processors
• much easier when the processors are special-purpose,e.g., systolic computations

CSE 567 - Autumn 1998 - Misc. Topics - 4

Bit level parallelism
• Exploit ability to do necessary bit-level computations directly
• exploit redundant logic
• goal - keep all circuits busy, reduce critical path
• Examples
• multipliers

CSE 567 - Autumn 1998 - Misc. Topics - 5

multiplicand

LSB

1

0

1

LSB

1

1

1

0

multplier

1

1

1

0

0

0

0

0

1

0

0

1

0

Combinational Multipliers
• Use AND gates to generate all partial products in parallel

CSE 567 - Autumn 1998 - Misc. Topics - 6

LSB

1

1

0

LSB

1

1

1

0

1

0

1

1

0

0

1

0

0

0

0

1

0

Combinational Multipliers (cont'd)
• Skew array to send partial products along diagonal and make it square

CSE 567 - Autumn 1998 - Misc. Topics - 7

B

A

LSB

Cout

Cin

LSB

0

0

0

0

S

0

0

Combinational Multipliers (cont'd)
• Ripple-carry adder in each row (carries ripple right to left)
• Sums ripple down (shifted one to right)

worst-case delay is 3n

CSE 567 - Autumn 1998 - Misc. Topics - 8

Using Carry-Save
• Forward carries to next row of adders
• CLA at the end to add last partial product and forwarded carries

LSB

0

0

0

LSB

0

0

0

A

B

Cin

0

Cout

S

0

no need to optimize carry more than sum

using CLA for final stage makes this fasterthan previous multiplier (worst-case is 2n)

CLA

CSE 567 - Autumn 1998 - Misc. Topics - 9

Combinational Multipliers (cont'd)

partial products

x2

x2

x2

x2

x2

x1

x1

x1

x1

x1

CSE 567 - Autumn 1998 - Misc. Topics - 10

PP1

PP2

PP3

PP4

PP5

PP6

PP7

PP8

PP0

+

+

+

+

+

+

+

CLA

Result

Wallace Tree Multiplier
• Use tree structure to reduce number of additions in critical path to O(logn) rather than O(n)
• Difficult structure to layoutand integrate with partial product crossbar
• Wiring constraints make it unattractive in many technologies

CSE 567 - Autumn 1998 - Misc. Topics - 11

Binary Tree Multipliers
• Problem with Wallace tree is 3:2 column reduction
• need 2:1 reduction for binary tree
• One solution: signed-digit binary trees
• represent digits as 0, 1, -1
• similar to Booth's encoding

1

+ 0

1 -1

0 1

x

y

if x>=0 and y>=0

otherwise

1

+ 1

1 0

1

+ -1

0 0

-1

+ -1

-1 0

0

+ 0

0 0

-1

+ 0

0 -1

-1 1

x

y

if x>=0 and y>=0

otherwise

CSE 567 - Autumn 1998 - Misc. Topics - 12

Booth's Algorithm
• Take care of (retire) more than one bit per shift operation
• Example: shift two bits at a time

0 0 1 1 0 1 13 1 1 1 0 1 0 –6 0 0 –1 1 –1 0 0 –1 –21 1 1 1 1 1 1 0 0 1 1 01 1 1 1 1 1 0 0 1 10 0 0 0 0 0 0 01 1 1 1 1 0 1 1 0 0 1 0 –78

Boothrecodingsteps

0 0 0 0*M0 0 1 1*M0 1 0 1*M0 1 1 2*M1 0 0 –2*M1 0 1 –1*M1 1 0 –1*M1 1 1 0*M

must be able to add multiplier times 0, –1, –2, 1, and 2

Boothrecodingtable

CSE 567 - Autumn 1998 - Misc. Topics - 13

Register Transfer
• Registers have input and output
• output can be fanned out to many destinations
• input can come from many sources
• multiplexer needed on input to select which

inputs from other registers

controlsignalsto choose

inputsource

input

input

output

output

outputs to other registers

CSE 567 - Autumn 1998 - Misc. Topics - 14

Connecting Registers
• Multiplexers: lots of control signals but full parallelism of transfers
• Busses

CSE 567 - Autumn 1998 - Misc. Topics - 15

Pipelining
• Adding registers along a path
• split combinational logic into multiple cycles
• each cycle smaller than previously
• Told Cold > Tnew Cnew
• increase throughput

CSE 567 - Autumn 1998 - Misc. Topics - 16

Pipelining
• Delay, d, of slowest combinational stage determines performance
• Throughput = 1/d – rate at which outputs are produced
• Latency = n•d – number of stages * clock period
• Pipelining increases circuit utilization
• Registers slow down data, synchronize data paths
• Wave-pipelining
• no pipeline registers - waves of data flow through circuit
• relies on equal-delay circuit paths - no short paths

CSE 567 - Autumn 1998 - Misc. Topics - 17

When and How to Pipeline?
• Where is the best place to add registers?
• splitting combinational logic
• overhead of registers (propagation delay and setup time requirements)
• What about cycles in data path?

CSE 567 - Autumn 1998 - Misc. Topics - 18

Retiming
• Process of optimally distributing registers throughout a circuit
• minimize the clock period
• minimize the number of registers

CSE 567 - Autumn 1998 - Misc. Topics - 19

Retiming (cont’d)
• Fast optimal algorithm (Leiserson & Saxe 1983)
• Retiming rules:
• remove one register from each input and add one to each output
• remove one register from each output and add one to each input

CSE 567 - Autumn 1998 - Misc. Topics - 20

10

13

7

8

6

5

10

13

7

8

6

5

Optimal Pipelining
• Add registers - use retiming to find optimal location

CSE 567 - Autumn 1998 - Misc. Topics - 21

Example - Digital Correlator
• yt = d(xt, a0) + d(xt-1, a1) + d(xt-2, a2) + d(xt-3, a3)
• d(xt, a0) = 0 if x  a, 1 otherwise (and passes x along to the right)

yt

+

+

+

host

d

d

d

d

a0

a1

a2

a3

xt

CSE 567 - Autumn 1998 - Misc. Topics - 22

+

+

+

host

d

d

d

d

+

+

+

host

d

d

d

d

Example - Digital Correlator (cont’d)
• Delays: adder, 7; comparator, 3; host, 0

cycle time = 24

cycle time = 13

CSE 567 - Autumn 1998 - Misc. Topics - 23

Pipelined Multipliers
• Pipelining can be applied to any of the combinational multipliers

+

+

+

+

+

+

+

CLA

CLA

FF at every intersection of pipe state and wire

CSE 567 - Autumn 1998 - Misc. Topics - 24

A

H

B

L

Example - Sorting

Comparator

Parallel Sorter

CSE 567 - Autumn 1998 - Misc. Topics - 25

Example - Sorting (cont’d)
• Pipelined

CSE 567 - Autumn 1998 - Misc. Topics - 26

Pipelined Sorter (cont’d)

CSE 567 - Autumn 1998 - Misc. Topics - 27

Better Sorter

CSE 567 - Autumn 1998 - Misc. Topics - 28

Sequential Sorter

CSE 567 - Autumn 1998 - Misc. Topics - 29

Systolic Arrays
• Set of identical processing elements
• specialized or programmable
• Efficient nearest-neighbor interconnections (in 1-D, 2-D, other)
• SIMD-like
• Multiple data flows, converging to engage in computation

Analogy: data flowing through the system in a

rhythmic fashion – from main memory through

a series of processing elements and back to

main memory

CSE 567 - Autumn 1998 - Misc. Topics - 30

Example - Convolution
• yj = xjw1 + xj+1w2 + . . . + xj+n-1wn

- x3 - x2 - x1

w4

w3

w1

w2

- - - y1 - y2 - y3 -

y1 = x1w1 + x2w2 + x3w3 + x4w4

y2 = x2w1 + x3w2 + x4w3 + x5w4

y3 = x3w1 + x4w2 + x5w3 + x6w4

. . . .

CSE 567 - Autumn 1998 - Misc. Topics - 31

Example - Convolution (cont’d)

w4 w3 w2 w1

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 –

– – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2

– y1 – y2 – y3

x6 – x5 – x4 – x3 –

y1 – y2 – y3

x6 – x5 – x4 – x3

– y2 – y3

CSE 567 - Autumn 1998 - Misc. Topics - 32

c11 c12 c13 c14

c21 c22 c23 c24

c31 c32 c33 c34

c41 c42 c43 c44

Example: Matrix Multiplication
• C = A  B cij = k=1n aikbkj

CSE 567 - Autumn 1998 - Misc. Topics - 33

Example: Matrix Multiplication

|||b44

||b43 b34

|b42 b33 b24

b41 b32 b23 b14

b31 b22 b13 |

b21 b12 ||

b11 |||

c11 c12 c13 c14

c21 c22 c23 c24

c31 c32 c33 c34

c41 c42 c43 c44

– – – a14 a13 a12 a11

– – a24 a23 a22 a21 –

– a34 a33 a32 a31 ––

a44 a43 a42 a41 –––

Systolic Computers
• Warp (CMU) - 1987
• linear array of 10 or more processing cells
• optimized inter-cell communication for low-latency
• pipelined cells and communication
• conditional execution
• compiler partitions problem into cells and generates microcode
• i-Warp (Intel) - 1990
• successor to Warp
• two-dimensional array
• time-multiplexing of physical busses between cells
• 32x32 array has 20Gflops peak performance

CSE 567 - Autumn 1998 - Misc. Topics - 35