- 60 Views
- Uploaded on
- Presentation posted in: General

Multiplication and Sum-of-Products Circuits:

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Multiplication andSum-of-Products Circuits:

Giving Up Simplicity

To Gain Speed

Steve Nuchia

???

???

Partial Products

Partial Products

Partial Products

Accumulation

13701

095041

091340

0561741

0456700

Pairwise Summation

Column Counting

- The multiplication table is trivial (AND gate).
- No multi-digit entries in the table, so the partial products are well-formed numbers.
- Addition of binary numbers is hard:
- O(1+n/10) with linear hardware
- O(log n) with O(n2 log n) hardware.

- Column counting is the accepted solution.
- Wallace Trees circa 1964.

Continuing research in the area has led to steady improvement in the designs for Partial Product Reduction Trees (PPRTs) for parallel multipliers designs, as evidenced in the progression of work in [18], [2], [12], [10], [11], [6]. However, almost all of this prior work focused on finding good basic building blocks (compressors) that could be connected in a regular pattern to build a PPRT. ...

A compressor operates in a single column of the PPRT […] These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

- Infinite series of data values arriving at a fixed rate.
- Compute the convolution with a specified vector, fast enough to keep up.
- Economic considerations often favor an FPGA (programmable gate array) solution.
- Linear algebra sum-of-products problems are more likely to a) be floating point and b) favor a software-intensive solution.

At+2

At+1

At

At-1

At-2

At-3

C-1

C0

C1

- The final accumulator has to be fast enough. What if it isn’t?
- Idea: distribute the feedback through the PPRT.
- OK, How?
- Opportunistic feedback: whenever a full adder has fewer than three inputs, give it feedback.
- Problem: The Supermarket Separator.
- Solution starts with the generalized full adder.

- Inputs represent data and control information.
- Outputs represent the number of “effective” one bits among the inputs.
- Maps directly into a Xilinx FPGA logic cell (with maximum of four inputs).

a b c d

C

S

1

1

0

k=1

k=0, t=1

0

1

0

1

0

1

k=q-1, t=0

0

1

1

k=q-2

a b c d

C

S

- To allow for feedback, need to be able to do the bookkeeping.
- Zeros may appear on some wires as columns are reduced. To exploit this sparseness, we need to detect and manipulate it.
- Time signature algebra: associate a vector with each wire (or bus) giving the maximum arithmetic value carried on the wire in each clock period.

- The arithmetic contribution of a signal must be conserved.
- No non-zero contribution can cross over a supermarket barrier.
- Remark: Delaying a signal by one clock should be an identity in the algebra.

0

0

0

Control or N/C

1

0

0

At the top of the tree,

the input data are assumed

to have time signature 1111.

1

1

0

1

1

1

a b c d

C

S

3,2,1,0

a b c d

C

S

3,2,1,0

1,1,1,0

1,1,0,0

A wire can carry no more than a contribution of 1. The sum

bit may be a one if the bus carries more than zero. The carry

bit may be one if the bus carries more than one.

Note: the carry bit belongs to the next higher file.

1,0,1,1

1,0,G,1

G,0,1,G

G is for Garbage. The information content (contribution) of

the wire is split but the electrical signal is not altered.

A signal may be re-assigned to the next-lower file

if it is doubled.

This is occasionally useful when a cell would

otherwise be underutilized.

As long as no contribution slides across a mod q barrier,

signals can be reassigned to neighbors on the positive-slope

diagonal. The TS is given relative to the rank r, so the TS

vector must be “rotated” by the shift length s.

t0 t1 t2 t3

If t3 = G or 0.

t3t0 t1 t2

- When a signal contains only one active timeslot and that timeslot contains the sole representative of the lowest remaining column, that signal is sunk and is removed from consideration.
- Sunk signals may be stored for parallel output or may be consumed as soon as they are produced, depending on the application.

k=2

k=1

G

1

Clock-period indicator signals

are used to gate out garbage

in the generalized full adder.

1

0

G

G

1

G

a b c d

C

S

1,0,1,1

0,0,0,0

- Currently, I have a Prolog program with constraint propagation extensions that knows the algebra. It does not yet successfully generate designs.
- The general strategy is to generate desgigns rank-by rank, under iterative deepening, until a successful (valid and complete) design is found.

- Once a valid design is found, its cost will be used as an upper bound for an exhaustive search for better designs.
- Efficiently generating candidate designs with feedback is a chicken-and-egg problem. I am using a “suspense list” of inputs not yet connected to outputs to handle this problem.

- The routines that implement the TS algebra have to “wire up” the TS rules without knowing the TS of the feedback inputs. Tricky coding problem, but under control.
- The end game is not yet well understood. That needs more study.
- I hope to be generating real designs soon, and to have some idea what an optimal design might look like in January.

- We haven’t talked about signed numbers. Signed data can be handled rather easily by this circuit, but signed coefficients require some thought.
- The standard circuit sign-extends the partial product terms in the feedback path. To do that, you have to know the sign bit’s value!
- I have a solution: next seminar.

- Inventing an appropriate algebra helped me to formulate the optimization problem for software solution and gives me confidence that the resulting designs are correct.
- Optimality, of course, is a different problem.
- The range of applicability of this circuit is not very broad: it is best suited for FPGA realization near the maximum clock speed of the logic family.