1 / 32

# Multiplication and Sum-of-Products Circuits: - PowerPoint PPT Presentation

Multiplication and Sum-of-Products Circuits:. Giving Up Simplicity To Gain Speed Steve Nuchia. In The Beginning. ???. ???. With Log Table. Strength In Numbers. Partial Products. Partial Products. Partial Products. Accumulation. 13701. 095041. 091340. 0561741. 0456700.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Multiplication and Sum-of-Products Circuits:' - darin

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Multiplication andSum-of-Products Circuits:

Giving Up Simplicity

To Gain Speed

Steve Nuchia

???

???

095041

091340

0561741

0456700

Pairwise Summation

• The multiplication table is trivial (AND gate).

• No multi-digit entries in the table, so the partial products are well-formed numbers.

• Addition of binary numbers is hard:

• O(1+n/10) with linear hardware

• O(log n) with O(n2 log n) hardware.

• Column counting is the accepted solution.

• Wallace Trees circa 1964.

Continuing research in the area has led to steady improvement in the designs for Partial Product Reduction Trees (PPRTs) for parallel multipliers designs, as evidenced in the progression of work in [18], [2], [12], [10], [11], [6]. However, almost all of this prior work focused on finding good basic building blocks (compressors) that could be connected in a regular pattern to build a PPRT. ...

A compressor operates in a single column of the PPRT […] These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

The DSP Filter Setting These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• Infinite series of data values arriving at a fixed rate.

• Compute the convolution with a specified vector, fast enough to keep up.

• Economic considerations often favor an FPGA (programmable gate array) solution.

• Linear algebra sum-of-products problems are more likely to a) be floating point and b) favor a software-intensive solution.

A These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).t+2

At+1

At

At-1

At-2

At-3

C-1

C0

C1

Improving the Standard Circuit These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• The final accumulator has to be fast enough. What if it isn’t?

• Idea: distribute the feedback through the PPRT.

• OK, How?

• Opportunistic feedback: whenever a full adder has fewer than three inputs, give it feedback.

• Problem: The Supermarket Separator.

• Solution starts with the generalized full adder.

Generalized Full Adder These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• Inputs represent data and control information.

• Outputs represent the number of “effective” one bits among the inputs.

• Maps directly into a Xilinx FPGA logic cell (with maximum of four inputs).

a b c d

C

S

Supermarket Separator Problem These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

1

1

0

k=1

k=0, t=1

0

1

0

1

0

1

k=q-1, t=0

0

1

1

k=q-2

a b c d

C

S

Time Signatures These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• To allow for feedback, need to be able to do the bookkeeping.

• Zeros may appear on some wires as columns are reduced. To exploit this sparseness, we need to detect and manipulate it.

• Time signature algebra: associate a vector with each wire (or bus) giving the maximum arithmetic value carried on the wire in each clock period.

Time Signature Constraints These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• The arithmetic contribution of a signal must be conserved.

• No non-zero contribution can cross over a supermarket barrier.

• Remark: Delaying a signal by one clock should be an identity in the algebra.

Signal Origination These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

0

0

0

Control or N/C

1

0

0

At the top of the tree,

the input data are assumed

to have time signature 1111.

1

1

0

1

1

1

a b c d

C

S

3,2,1,0

Pair Splitting These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

a b c d

C

S

3,2,1,0

1,1,1,0

1,1,0,0

A wire can carry no more than a contribution of 1. The sum

bit may be a one if the bus carries more than zero. The carry

bit may be one if the bus carries more than one.

Note: the carry bit belongs to the next higher file.

Wire Splitting These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

1,0,1,1

1,0,G,1

G,0,1,G

G is for Garbage. The information content (contribution) of

the wire is split but the electrical signal is not altered.

Right-Shift Rule These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

A signal may be re-assigned to the next-lower file

if it is doubled.

This is occasionally useful when a cell would

otherwise be underutilized.

Diagonal Shift Rule These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

As long as no contribution slides across a mod q barrier,

signals can be reassigned to neighbors on the positive-slope

diagonal. The TS is given relative to the rank r, so the TS

vector must be “rotated” by the shift length s.

t0 t1 t2 t3

If t3 = G or 0.

t3t0 t1 t2

Sink Rule These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• When a signal contains only one active timeslot and that timeslot contains the sole representative of the lowest remaining column, that signal is sunk and is removed from consideration.

• Sunk signals may be stored for parallel output or may be consumed as soon as they are produced, depending on the application.

Gate Rule These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

k=2

k=1

G

1

Clock-period indicator signals

are used to gate out garbage

1

0

G

G

1

G

a b c d

C

S

1,0,1,1

0,0,0,0

Design Generation These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• Currently, I have a Prolog program with constraint propagation extensions that knows the algebra. It does not yet successfully generate designs.

• The general strategy is to generate desgigns rank-by rank, under iterative deepening, until a successful (valid and complete) design is found.

Generation, Continued These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• Once a valid design is found, its cost will be used as an upper bound for an exhaustive search for better designs.

• Efficiently generating candidate designs with feedback is a chicken-and-egg problem. I am using a “suspense list” of inputs not yet connected to outputs to handle this problem.

Generation, Continued These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• The routines that implement the TS algebra have to “wire up” the TS rules without knowing the TS of the feedback inputs. Tricky coding problem, but under control.

• The end game is not yet well understood. That needs more study.

• I hope to be generating real designs soon, and to have some idea what an optimal design might look like in January.

Sign Handling These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• We haven’t talked about signed numbers. Signed data can be handled rather easily by this circuit, but signed coefficients require some thought.

• The standard circuit sign-extends the partial product terms in the feedback path. To do that, you have to know the sign bit’s value!

• I have a solution: next seminar.

Conclusions These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

• Inventing an appropriate algebra helped me to formulate the optimization problem for software solution and gives me confidence that the resulting designs are correct.

• Optimality, of course, is a different problem.

• The range of applicability of this circuit is not very broad: it is best suited for FPGA realization near the maximum clock speed of the logic family.