Distributed arithmetic implementations and applications
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

Distributed Arithmetic: Implementations and Applications PowerPoint PPT Presentation


  • 33 Views
  • Uploaded on
  • Presentation posted in: General

Distributed Arithmetic: Implementations and Applications. A Tutorial. Distributed Arithmetic (DA) [ Peled and Liu ,1974]. An efficient technique for calculation of sum of products or vector dot product or inner product or multiply and accumulate (MAC)

Download Presentation

Distributed Arithmetic: Implementations and Applications

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Distributed arithmetic implementations and applications

Distributed Arithmetic: Implementations and Applications

A Tutorial


Distributed arithmetic da peled and liu 1974

Distributed Arithmetic (DA) [Peled and Liu,1974]

  • An efficient technique for calculation of sum of products or vector dot product or inner product or multiply and accumulate (MAC)

  • MAC operation is very common in all Digital Signal Processing Algorithms


So why use da

So Why Use DA?

  • The advantages of DA are best exploited in data-path circuit designing

  • Area savings from using DA can be up to 80% and seldom less than 50% in digital signal processing hardware designs

  • An old technique that has been revived by the wide spread use of Field Programmable Gate Arrays (FPGAs) for Digital Signal Processing (DSP)

  • DA efficiently implements the MAC using basic building blocks (Look Up Tables) in FPGAs


An illustration of mac operation

An Illustration of MAC Operation

  • The following expression represents a multiply and accumulate operation

  • A numerical example


A few points about the mac

A Few Points about the MAC

  • Consider this

    Note a few points

  • A=[A1, A2,…, AK] is a matrix of “constant” values

  • x=[x1, x2,…, xK] is matrix of input “variables”

  • Each Ak is of M-bits

  • Each xk is of N-bits

  • y should be able large enough to accommodate the result


A possible hardware not da yet

A Possible Hardware (NOT DA Yet!!!)

  • Let,

Shift right

Registers to hold sum of partial products

Multi-bit AND gate

Each scaling accumulator calculates Ai X xi

Adder/Subtractor

Shift registers


How does da work

How does DA work?

  • The “basic” DA technique is bit-serial in nature

  • DA is basically a bit-level rearrangement of the multiply and accumulate operation

  • DA hides the explicit multiplications by ROM look-ups an efficient technique to implement on Field Programmable Gate Arrays (FPGAs)


Moving closer to distributed arithmetic

Moving Closer to Distributed Arithmetic

…(1)

  • Consider once again

    • a. Let xk be a N-bits scaled two’s complement number i.e.

      | xk | < 1

      xk : {bk0, bk1, bk2……, bk(N-1) }

      where bk0 is the sign bit

    • b. We can express xk as

    • c. Substituting (2) in (1),

…(2)

…(3)


Moving more closer to da

Moving More Closer to DA

…(3)

Expanding this part


Moving still more closer to da

Moving Still More Closer to DA


Almost there

Almost there!

…(4)

The Final Reformulation


Lets see the change of hardware

Lets See the change of hardware

Our Original Equation

Bit Level Rearrangement


So where does the rom come in

So where does the ROM come in?

Note this portion. It’s can be treated as function of serial inputs bits of

{A, B, C,D}


The rom construction

The ROM Construction

  • has only 2K possible values i.e.

  • (5) can be pre-calculated for all possible values of b1n b2n …bKn

  • We can store these in a look-up table of 2Kwordsaddressed byK-bits i.e. b1n b2n …bKn

…(4)

…(5)


Lets see an example

Lets See An Example

  • Let number of taps K=4

  • The fixed coefficients are A1 =0.72, A2= -0.3, A3 = 0.95, A4 = 0.11

  • We need 2K = 24 = 16-words ROM

…(4)


Rom address and contents

ROM: Address and Contents


Key issue rom size

Key Issue: ROM Size

  • The size of ROM is very important for high speed implementation as well as area efficiency

  • ROM size grows exponentially with each added input address line

  • The number of address lines are equal to the number of elements in the vector i.e. K

  • Elements up to 16 and more are common => 216=64K of ROM!!!

  • We have to reduce the size of ROM


A very neat trick

A Very Neat Trick:

…(6)

2‘s-complement

…(7)


Re writing x k in a different code

Re-Writing xk in a Different Code

  • Define: Offset Code

  • Finally

…(7)

…(8)


Using the new x k

Using the New xk

  • Substitute the new xk in here

…(9)


The new formulation in offset code

The New Formulation in Offset Code

Let and

Constant


The benefit only half values to store

The Benefit: Only Half Values to Store

Inverse symmetry


Hardware using offset coding

Hardware Using Offset Coding

x1 selects between the two symmetric halves

Ts indicates when the sign bit arrives


Alternate technique decomposing the rom

Alternate Technique: Decomposing the ROM

Requires additional adder to the sum the partial outputs


Speed concerns

Speed Concerns

  • We considered One Bit At A Time (1 BAAT)

  • No. of Clock Cycles Required = N

  • If K=N, then essentially we are taking 1 cycle per dot product  Not bad!

  • Opportunity for parallelism exists but at a cost of more hardware

  • We could have 2 BAAT or up to N BAAT in the extreme case

  • N BAAT  One complete result/cycle


Illustration of 2 baat

Illustration of 2 BAAT


Illustration of n baat

Illustration of N BAAT


The speed limit carry propagation

The Speed Limit: Carry Propagation

  • The speed in the critical path is limited by the width of the carry propagation

  • Speed can be improved upon by using techniques to limit the carry propagation


Speeding up further using rns da

Speeding Up Further: Using RNS+DA

  • By Using RNS, the computations can be broken down into smaller elements which can be executed in parallel

  • Since we are operating on smaller arguments, the carry propagation is naturally limited

  • So by using RNS+DA, greater speed benefits can be attained, specially for higher precision calculations


Conclusion

Conclusion

  • Ref: Stanley A. White, “Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review,” IEEE ASSP Magazine, July, 1989

  • Ref: Xilinx App Note, ”The Role of Distributed Arithmetic In FPGA Based Signal Processing’


  • Login