Multiplication and Sum-of-Products Circuits:

Multiplication andSum-of-Products Circuits: Giving Up Simplicity To Gain Speed Steve Nuchia

In The Beginning ??? ???

With Log Table

Strength In Numbers

Partial Products

Accumulation

13701 095041 091340 0561741 0456700 Pairwise Summation

Column Counting

Binary Multiplication • The multiplication table is trivial (AND gate). • No multi-digit entries in the table, so the partial products are well-formed numbers. • Addition of binary numbers is hard: • O(1+n/10) with linear hardware • O(log n) with O(n2 log n) hardware. • Column counting is the accepted solution. • Wallace Trees circa 1964.

Oklobdzija & Stelling 1998 Continuing research in the area has led to steady improvement in the designs for Partial Product Reduction Trees (PPRTs) for parallel multipliers designs, as evidenced in the progression of work in [18], [2], [12], [10], [11], [6]. However, almost all of this prior work focused on finding good basic building blocks (compressors) that could be connected in a regular pattern to build a PPRT. ...

A compressor operates in a single column of the PPRT […] These compressors are made up of full adders that are interconnected in a way to minimize the compressor’s delay. In contrast, our approach is to design a faster PPRT by finding a globally optimal way of interconnecting the low-level components (adders).

The DSP Filter Setting • Infinite series of data values arriving at a fixed rate. • Compute the convolution with a specified vector, fast enough to keep up. • Economic considerations often favor an FPGA (programmable gate array) solution. • Linear algebra sum-of-products problems are more likely to a) be floating point and b) favor a software-intensive solution.

At+2 At+1 At At-1 At-2 At-3 C-1 C0 C1

Improving the Standard Circuit • The final accumulator has to be fast enough. What if it isn’t? • Idea: distribute the feedback through the PPRT. • OK, How? • Opportunistic feedback: whenever a full adder has fewer than three inputs, give it feedback. • Problem: The Supermarket Separator. • Solution starts with the generalized full adder.

Generalized Full Adder • Inputs represent data and control information. • Outputs represent the number of “effective” one bits among the inputs. • Maps directly into a Xilinx FPGA logic cell (with maximum of four inputs). a b c d C S

Supermarket Separator Problem 1 1 0 k=1 k=0, t=1 0 1 0 1 0 1 k=q-1, t=0 0 1 1 k=q-2 a b c d C S

Time Signatures • To allow for feedback, need to be able to do the bookkeeping. • Zeros may appear on some wires as columns are reduced. To exploit this sparseness, we need to detect and manipulate it. • Time signature algebra: associate a vector with each wire (or bus) giving the maximum arithmetic value carried on the wire in each clock period.

Time Signature Constraints • The arithmetic contribution of a signal must be conserved. • No non-zero contribution can cross over a supermarket barrier. • Remark: Delaying a signal by one clock should be an identity in the algebra.

Signal Origination 0 0 0 Control or N/C 1 0 0 At the top of the tree, the input data are assumed to have time signature 1111. 1 1 0 1 1 1 a b c d C S 3,2,1,0

Pair Splitting a b c d C S 3,2,1,0 1,1,1,0 1,1,0,0 A wire can carry no more than a contribution of 1. The sum bit may be a one if the bus carries more than zero. The carry bit may be one if the bus carries more than one. Note: the carry bit belongs to the next higher file.

Wire Splitting 1,0,1,1 1,0,G,1 G,0,1,G G is for Garbage. The information content (contribution) of the wire is split but the electrical signal is not altered.

Right-Shift Rule A signal may be re-assigned to the next-lower file if it is doubled. This is occasionally useful when a cell would otherwise be underutilized.

Diagonal Shift Rule As long as no contribution slides across a mod q barrier, signals can be reassigned to neighbors on the positive-slope diagonal. The TS is given relative to the rank r, so the TS vector must be “rotated” by the shift length s. t0 t1 t2 t3 If t3 = G or 0. t3t0 t1 t2

Sink Rule • When a signal contains only one active timeslot and that timeslot contains the sole representative of the lowest remaining column, that signal is sunk and is removed from consideration. • Sunk signals may be stored for parallel output or may be consumed as soon as they are produced, depending on the application.

Gate Rule k=2 k=1 G 1 Clock-period indicator signals are used to gate out garbage in the generalized full adder. 1 0 G G 1 G a b c d C S 1,0,1,1 0,0,0,0

Design Generation • Currently, I have a Prolog program with constraint propagation extensions that knows the algebra. It does not yet successfully generate designs. • The general strategy is to generate desgigns rank-by rank, under iterative deepening, until a successful (valid and complete) design is found.

Generation, Continued • Once a valid design is found, its cost will be used as an upper bound for an exhaustive search for better designs. • Efficiently generating candidate designs with feedback is a chicken-and-egg problem. I am using a “suspense list” of inputs not yet connected to outputs to handle this problem.

Generation, Continued • The routines that implement the TS algebra have to “wire up” the TS rules without knowing the TS of the feedback inputs. Tricky coding problem, but under control. • The end game is not yet well understood. That needs more study. • I hope to be generating real designs soon, and to have some idea what an optimal design might look like in January.

Sign Handling • We haven’t talked about signed numbers. Signed data can be handled rather easily by this circuit, but signed coefficients require some thought. • The standard circuit sign-extends the partial product terms in the feedback path. To do that, you have to know the sign bit’s value! • I have a solution: next seminar.

Conclusions • Inventing an appropriate algebra helped me to formulate the optimization problem for software solution and gives me confidence that the resulting designs are correct. • Optimality, of course, is a different problem. • The range of applicability of this circuit is not very broad: it is best suited for FPGA realization near the maximum clock speed of the logic family.

Multiplication and Sum-of-Products Circuits: