Loading in 2 Seconds...

Distributed Arithmetic: Implementations and Applications

Loading in 2 Seconds...

- 62 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Distributed Arithmetic: Implementations and Applications' - georgia-steele

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Distributed Arithmetic (DA) [Peled and Liu,1974]

- An efficient technique for calculation of sum of products or vector dot product or inner product or multiply and accumulate (MAC)
- MAC operation is very common in all Digital Signal Processing Algorithms

So Why Use DA?

- The advantages of DA are best exploited in data-path circuit designing
- Area savings from using DA can be up to 80% and seldom less than 50% in digital signal processing hardware designs
- An old technique that has been revived by the wide spread use of Field Programmable Gate Arrays (FPGAs) for Digital Signal Processing (DSP)
- DA efficiently implements the MAC using basic building blocks (Look Up Tables) in FPGAs

An Illustration of MAC Operation

- The following expression represents a multiply and accumulate operation
- A numerical example

A Few Points about the MAC

- Consider this

Note a few points

- A=[A1, A2,…, AK] is a matrix of “constant” values
- x=[x1, x2,…, xK] is matrix of input “variables”
- Each Ak is of M-bits
- Each xk is of N-bits
- y should be able large enough to accommodate the result

A Possible Hardware (NOT DA Yet!!!)

- Let,

Shift right

Registers to hold sum of partial products

Multi-bit AND gate

Each scaling accumulator calculates Ai X xi

Adder/Subtractor

Shift registers

How does DA work?

- The “basic” DA technique is bit-serial in nature
- DA is basically a bit-level rearrangement of the multiply and accumulate operation
- DA hides the explicit multiplications by ROM look-ups an efficient technique to implement on Field Programmable Gate Arrays (FPGAs)

Moving Closer to Distributed Arithmetic

…(1)

- Consider once again
- a. Let xk be a N-bits scaled two’s complement number i.e.

| xk | < 1

xk : {bk0, bk1, bk2……, bk(N-1) }

where bk0 is the sign bit

- b. We can express xk as
- c. Substituting (2) in (1),

…(2)

…(3)

So where does the ROM come in?

Note this portion. It’s can be treated as function of serial inputs bits of

{A, B, C,D}

The ROM Construction

- has only 2K possible values i.e.
- (5) can be pre-calculated for all possible values of b1n b2n …bKn
- We can store these in a look-up table of 2Kwordsaddressed byK-bits i.e. b1n b2n …bKn

…(4)

…(5)

Lets See An Example

- Let number of taps K=4
- The fixed coefficients are A1 =0.72, A2= -0.3, A3 = 0.95, A4 = 0.11
- We need 2K = 24 = 16-words ROM

…(4)

Key Issue: ROM Size

- The size of ROM is very important for high speed implementation as well as area efficiency
- ROM size grows exponentially with each added input address line
- The number of address lines are equal to the number of elements in the vector i.e. K
- Elements up to 16 and more are common => 216=64K of ROM!!!
- We have to reduce the size of ROM

The Benefit: Only Half Values to Store

Inverse symmetry

Hardware Using Offset Coding

x1 selects between the two symmetric halves

Ts indicates when the sign bit arrives

Alternate Technique: Decomposing the ROM

Requires additional adder to the sum the partial outputs

Speed Concerns

- We considered One Bit At A Time (1 BAAT)
- No. of Clock Cycles Required = N
- If K=N, then essentially we are taking 1 cycle per dot product Not bad!
- Opportunity for parallelism exists but at a cost of more hardware
- We could have 2 BAAT or up to N BAAT in the extreme case
- N BAAT One complete result/cycle

The Speed Limit: Carry Propagation

- The speed in the critical path is limited by the width of the carry propagation
- Speed can be improved upon by using techniques to limit the carry propagation

Speeding Up Further: Using RNS+DA

- By Using RNS, the computations can be broken down into smaller elements which can be executed in parallel
- Since we are operating on smaller arguments, the carry propagation is naturally limited
- So by using RNS+DA, greater speed benefits can be attained, specially for higher precision calculations

Conclusion

- Ref: Stanley A. White, “Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review,” IEEE ASSP Magazine, July, 1989
- Ref: Xilinx App Note, ”The Role of Distributed Arithmetic In FPGA Based Signal Processing’

Download Presentation

Connecting to Server..