Multiplication
Download
1 / 35

Multiplication - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

Multiplication. Example. multiplicand: 1 1 0 0 12 multiplier: 0 1 0 1 5 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 60. 4 partial products. repeat n times:. compute partial product; shift; add. note: each bit of partial products is just an AND operation. z = 0;

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Multiplication' - kiona


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Multiplication
Multiplication

  • Example

multiplicand: 1 1 0 0 12multiplier: 0 1 0 1 5 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 60

4 partial products

repeat n times:

compute partial product; shift; add

note: each bit of partial products is just an AND operation

CSE 567 - Autumn 1998 - Misc. Topics - 1


Sequential multiplier

  • z = 0;

  • repeat n

    • if (x[0]) z = z + y;

    • x = x >> 1; y = y << 1;

Sequential Multiplier

one bit of multiplier applied each cycle

multiplicand

y

0

x

multiplier

2n bit adder

adder

z

result

CSE 567 - Autumn 1998 - Misc. Topics - 2


Sequential multiplier cont d

  • z = 0;

  • repeat n

    • if (x[0]) z = z + y * 2n;

    • x = x >> 1; z = z >> 1;

Sequential Multiplier (cont’d)

one bit of multiplier applied each cycle

multiplicand

y

x

multiplier

adder

n-bit adder

z

result

CSE 567 - Autumn 1998 - Misc. Topics - 3


Parallelism in hardware
Parallelism in hardware

  • Fine-grained - bit level

    • e.g., carry-select, carry-lookahead adder

  • Pipelining

    • same number of functional units

    • different latency, but increased throughput

    • less work per clock cycle

  • Coarse-grained - data-path level

    • e.g., multiple arithmetic units

    • multi-port register files (read/write from different sources/destinations)

  • Processor level

    • difficult to take advantage of many levels of parallelismin fixed general-purpose processors

    • much easier when the processors are special-purpose,e.g., systolic computations

CSE 567 - Autumn 1998 - Misc. Topics - 4


Bit level parallelism
Bit level parallelism

  • Exploit ability to do necessary bit-level computations directly

    • exploit redundant logic

    • goal - keep all circuits busy, reduce critical path

  • Examples

    • carry-lookahead adder

    • carry-select adder

    • multipliers

CSE 567 - Autumn 1998 - Misc. Topics - 5


Combinational multipliers

multiplicand

LSB

1

0

1

LSB

1

1

1

0

multplier

1

1

1

0

0

0

0

0

1

0

0

1

0

Combinational Multipliers

  • Use AND gates to generate all partial products in parallel

CSE 567 - Autumn 1998 - Misc. Topics - 6


Combinational multipliers cont d

LSB

1

1

0

LSB

1

1

1

0

1

0

1

1

0

0

1

0

0

0

0

1

0

Combinational Multipliers (cont'd)

  • Skew array to send partial products along diagonal and make it square

CSE 567 - Autumn 1998 - Misc. Topics - 7


Combinational multipliers cont d1

B

A

LSB

Full Adder

Cout

Cin

LSB

0

0

0

0

S

0

0

Combinational Multipliers (cont'd)

  • Ripple-carry adder in each row (carries ripple right to left)

  • Sums ripple down (shifted one to right)

worst-case delay is 3n

CSE 567 - Autumn 1998 - Misc. Topics - 8


Using carry save
Using Carry-Save

  • Forward carries to next row of adders

  • CLA at the end to add last partial product and forwarded carries

LSB

0

0

0

LSB

0

0

0

A

B

Cin

Full Adder

0

Cout

S

0

no need to optimize carry more than sum

using CLA for final stage makes this fasterthan previous multiplier (worst-case is 2n)

CLA

CSE 567 - Autumn 1998 - Misc. Topics - 9


Combinational multipliers cont d2
Combinational Multipliers (cont'd)

  • Carry-save adder is a 3-2 adder:

partial products

x2

x2

x2

x2

x2

x1

x1

x1

x1

x1

CSE 567 - Autumn 1998 - Misc. Topics - 10


Wallace tree multiplier

PP1

PP2

PP3

PP4

PP5

PP6

PP7

PP8

PP0

+

+

+

+

+

+

+

CLA

Result

Wallace Tree Multiplier

  • Use tree structure to reduce number of additions in critical path to O(logn) rather than O(n)

  • Difficult structure to layoutand integrate with partial product crossbar

  • Wiring constraints make it unattractive in many technologies

CSE 567 - Autumn 1998 - Misc. Topics - 11


Binary tree multipliers
Binary Tree Multipliers

  • Problem with Wallace tree is 3:2 column reduction

    • need 2:1 reduction for binary tree

  • One solution: signed-digit binary trees

    • represent digits as 0, 1, -1

    • similar to Booth's encoding

1

+ 0

1 -1

0 1

x

y

if x>=0 and y>=0

otherwise

1

+ 1

1 0

1

+ -1

0 0

-1

+ -1

-1 0

0

+ 0

0 0

-1

+ 0

0 -1

-1 1

x

y

if x>=0 and y>=0

otherwise

CSE 567 - Autumn 1998 - Misc. Topics - 12


Booth s algorithm
Booth's Algorithm

  • Take care of (retire) more than one bit per shift operation

  • Example: shift two bits at a time

0 0 1 1 0 1 13 1 1 1 0 1 0 –6 0 0 –1 1 –1 0 0 –1 –21 1 1 1 1 1 1 0 0 1 1 01 1 1 1 1 1 0 0 1 10 0 0 0 0 0 0 01 1 1 1 1 0 1 1 0 0 1 0 –78

Boothrecodingsteps

i+1 i i-1 add

0 0 0 0*M0 0 1 1*M0 1 0 1*M0 1 1 2*M1 0 0 –2*M1 0 1 –1*M1 1 0 –1*M1 1 1 0*M

must be able to add multiplier times 0, –1, –2, 1, and 2

Boothrecodingtable

CSE 567 - Autumn 1998 - Misc. Topics - 13


Register transfer
Register Transfer

  • Registers have input and output

    • output can be fanned out to many destinations

    • input can come from many sources

      • multiplexer needed on input to select which

inputs from other registers

controlsignalsto choose

inputsource

input

input

output

output

outputs to other registers

CSE 567 - Autumn 1998 - Misc. Topics - 14


Connecting registers
Connecting Registers

  • Multiplexers: lots of control signals but full parallelism of transfers

  • Busses

CSE 567 - Autumn 1998 - Misc. Topics - 15


Pipelining
Pipelining

  • Adding registers along a path

    • split combinational logic into multiple cycles

    • each cycle smaller than previously

    • Told Cold > Tnew Cnew

    • increase throughput

CSE 567 - Autumn 1998 - Misc. Topics - 16


Pipelining1
Pipelining

  • Delay, d, of slowest combinational stage determines performance

  • Throughput = 1/d – rate at which outputs are produced

  • Latency = n•d – number of stages * clock period

  • Pipelining increases circuit utilization

  • Registers slow down data, synchronize data paths

  • Wave-pipelining

    • no pipeline registers - waves of data flow through circuit

    • relies on equal-delay circuit paths - no short paths

CSE 567 - Autumn 1998 - Misc. Topics - 17


When and how to pipeline
When and How to Pipeline?

  • Where is the best place to add registers?

    • splitting combinational logic

    • overhead of registers (propagation delay and setup time requirements)

  • What about cycles in data path?

  • Example: 16-bit adder, add 8-bits in each of two cycles

CSE 567 - Autumn 1998 - Misc. Topics - 18


Retiming
Retiming

  • Process of optimally distributing registers throughout a circuit

    • minimize the clock period

    • minimize the number of registers

CSE 567 - Autumn 1998 - Misc. Topics - 19


Retiming cont d
Retiming (cont’d)

  • Fast optimal algorithm (Leiserson & Saxe 1983)

  • Retiming rules:

    • remove one register from each input and add one to each output

    • remove one register from each output and add one to each input

CSE 567 - Autumn 1998 - Misc. Topics - 20


Optimal pipelining

10

13

7

8

6

5

10

13

7

8

6

5

Optimal Pipelining

  • Add registers - use retiming to find optimal location

CSE 567 - Autumn 1998 - Misc. Topics - 21


Example digital correlator
Example - Digital Correlator

  • yt = d(xt, a0) + d(xt-1, a1) + d(xt-2, a2) + d(xt-3, a3)

  • d(xt, a0) = 0 if x  a, 1 otherwise (and passes x along to the right)

yt

+

+

+

host

d

d

d

d

a0

a1

a2

a3

xt

CSE 567 - Autumn 1998 - Misc. Topics - 22


Example digital correlator cont d

+

+

+

host

d

d

d

d

+

+

+

host

d

d

d

d

Example - Digital Correlator (cont’d)

  • Delays: adder, 7; comparator, 3; host, 0

cycle time = 24

cycle time = 13

CSE 567 - Autumn 1998 - Misc. Topics - 23


Pipelined multipliers
Pipelined Multipliers

  • Pipelining can be applied to any of the combinational multipliers

+

+

+

+

+

+

+

CLA

CLA

FF at every intersection of pipe state and wire

CSE 567 - Autumn 1998 - Misc. Topics - 24


Example sorting

A

H

B

L

Example - Sorting

Comparator

Parallel Sorter

CSE 567 - Autumn 1998 - Misc. Topics - 25


Example sorting cont d
Example - Sorting (cont’d)

  • Pipelined

CSE 567 - Autumn 1998 - Misc. Topics - 26


Pipelined sorter cont d
Pipelined Sorter (cont’d)

CSE 567 - Autumn 1998 - Misc. Topics - 27


Better sorter
Better Sorter

CSE 567 - Autumn 1998 - Misc. Topics - 28


Sequential sorter
Sequential Sorter

CSE 567 - Autumn 1998 - Misc. Topics - 29


Systolic arrays
Systolic Arrays

  • Set of identical processing elements

    • specialized or programmable

  • Efficient nearest-neighbor interconnections (in 1-D, 2-D, other)

  • SIMD-like

  • Multiple data flows, converging to engage in computation

Analogy: data flowing through the system in a

rhythmic fashion – from main memory through

a series of processing elements and back to

main memory

CSE 567 - Autumn 1998 - Misc. Topics - 30


Example convolution
Example - Convolution

  • yj = xjw1 + xj+1w2 + . . . + xj+n-1wn

- x3 - x2 - x1

w4

w3

w1

w2

- - - y1 - y2 - y3 -

y1 = x1w1 + x2w2 + x3w3 + x4w4

y2 = x2w1 + x3w2 + x4w3 + x5w4

y3 = x3w1 + x4w2 + x5w3 + x6w4

. . . .

CSE 567 - Autumn 1998 - Misc. Topics - 31


Example convolution cont d
Example - Convolution (cont’d)

w4 w3 w2 w1

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 –

– – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2

– y1 – y2 – y3

x6 – x5 – x4 – x3 –

y1 – y2 – y3

x6 – x5 – x4 – x3

– y2 – y3

CSE 567 - Autumn 1998 - Misc. Topics - 32


Example matrix multiplication

c11 c12 c13 c14

c21 c22 c23 c24

c31 c32 c33 c34

c41 c42 c43 c44

Example: Matrix Multiplication

  • C = A  B cij = k=1n aikbkj

CSE 567 - Autumn 1998 - Misc. Topics - 33


Example matrix multiplication1
Example: Matrix Multiplication

|||b44

||b43 b34

|b42 b33 b24

b41 b32 b23 b14

b31 b22 b13 |

b21 b12 ||

b11 |||

c11 c12 c13 c14

c21 c22 c23 c24

c31 c32 c33 c34

c41 c42 c43 c44

– – – a14 a13 a12 a11

– – a24 a23 a22 a21 –

– a34 a33 a32 a31 ––

a44 a43 a42 a41 –––


Systolic computers
Systolic Computers

  • Warp (CMU) - 1987

    • linear array of 10 or more processing cells

    • optimized inter-cell communication for low-latency

    • pipelined cells and communication

    • conditional execution

    • compiler partitions problem into cells and generates microcode

  • i-Warp (Intel) - 1990

    • successor to Warp

    • two-dimensional array

    • time-multiplexing of physical busses between cells

    • 32x32 array has 20Gflops peak performance

CSE 567 - Autumn 1998 - Misc. Topics - 35


ad