# Distributed Arithmetic: Implementations and Applications - PowerPoint PPT Presentation

1 / 30

Distributed Arithmetic: Implementations and Applications. A Tutorial. Distributed Arithmetic (DA) [ Peled and Liu ,1974]. An efficient technique for calculation of sum of products or vector dot product or inner product or multiply and accumulate (MAC)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Distributed Arithmetic: Implementations and Applications

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Distributed Arithmetic: Implementations and Applications

A Tutorial

### Distributed Arithmetic (DA) [Peled and Liu,1974]

• An efficient technique for calculation of sum of products or vector dot product or inner product or multiply and accumulate (MAC)

• MAC operation is very common in all Digital Signal Processing Algorithms

### So Why Use DA?

• The advantages of DA are best exploited in data-path circuit designing

• Area savings from using DA can be up to 80% and seldom less than 50% in digital signal processing hardware designs

• An old technique that has been revived by the wide spread use of Field Programmable Gate Arrays (FPGAs) for Digital Signal Processing (DSP)

• DA efficiently implements the MAC using basic building blocks (Look Up Tables) in FPGAs

### An Illustration of MAC Operation

• The following expression represents a multiply and accumulate operation

• A numerical example

### A Few Points about the MAC

• Consider this

Note a few points

• A=[A1, A2,…, AK] is a matrix of “constant” values

• x=[x1, x2,…, xK] is matrix of input “variables”

• Each Ak is of M-bits

• Each xk is of N-bits

• y should be able large enough to accommodate the result

### A Possible Hardware (NOT DA Yet!!!)

• Let,

Shift right

Registers to hold sum of partial products

Multi-bit AND gate

Each scaling accumulator calculates Ai X xi

Shift registers

### How does DA work?

• The “basic” DA technique is bit-serial in nature

• DA is basically a bit-level rearrangement of the multiply and accumulate operation

• DA hides the explicit multiplications by ROM look-ups an efficient technique to implement on Field Programmable Gate Arrays (FPGAs)

### Moving Closer to Distributed Arithmetic

…(1)

• Consider once again

• a. Let xk be a N-bits scaled two’s complement number i.e.

| xk | < 1

xk : {bk0, bk1, bk2……, bk(N-1) }

where bk0 is the sign bit

• b. We can express xk as

• c. Substituting (2) in (1),

…(2)

…(3)

### Moving More Closer to DA

…(3)

Expanding this part

### Almost there!

…(4)

The Final Reformulation

### Lets See the change of hardware

Our Original Equation

Bit Level Rearrangement

### So where does the ROM come in?

Note this portion. It’s can be treated as function of serial inputs bits of

{A, B, C,D}

### The ROM Construction

• has only 2K possible values i.e.

• (5) can be pre-calculated for all possible values of b1n b2n …bKn

• We can store these in a look-up table of 2Kwordsaddressed byK-bits i.e. b1n b2n …bKn

…(4)

…(5)

### Lets See An Example

• Let number of taps K=4

• The fixed coefficients are A1 =0.72, A2= -0.3, A3 = 0.95, A4 = 0.11

• We need 2K = 24 = 16-words ROM

…(4)

### Key Issue: ROM Size

• The size of ROM is very important for high speed implementation as well as area efficiency

• The number of address lines are equal to the number of elements in the vector i.e. K

• Elements up to 16 and more are common => 216=64K of ROM!!!

• We have to reduce the size of ROM

…(6)

2‘s-complement

…(7)

### Re-Writing xk in a Different Code

• Define: Offset Code

• Finally

…(7)

…(8)

### Using the New xk

• Substitute the new xk in here

…(9)

Let and

Constant

Inverse symmetry

### Hardware Using Offset Coding

x1 selects between the two symmetric halves

Ts indicates when the sign bit arrives

### Speed Concerns

• We considered One Bit At A Time (1 BAAT)

• No. of Clock Cycles Required = N

• If K=N, then essentially we are taking 1 cycle per dot product  Not bad!

• Opportunity for parallelism exists but at a cost of more hardware

• We could have 2 BAAT or up to N BAAT in the extreme case

• N BAAT  One complete result/cycle

### The Speed Limit: Carry Propagation

• The speed in the critical path is limited by the width of the carry propagation

• Speed can be improved upon by using techniques to limit the carry propagation

### Speeding Up Further: Using RNS+DA

• By Using RNS, the computations can be broken down into smaller elements which can be executed in parallel

• Since we are operating on smaller arguments, the carry propagation is naturally limited

• So by using RNS+DA, greater speed benefits can be attained, specially for higher precision calculations

### Conclusion

• Ref: Stanley A. White, “Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review,” IEEE ASSP Magazine, July, 1989

• Ref: Xilinx App Note, ”The Role of Distributed Arithmetic In FPGA Based Signal Processing’