1 / 25

faster functional modules lessons taught from FP-ADDERs

faster functional modules lessons taught from FP-ADDERs. Guy Even Electrical Engineering Dept. Tel-Aviv Univ. Silicon Value Seminar (April 29, 2002). outline. FP-Adder: an example of a complicated module brief overview focus on two sub-blocks Counting leading zeros – priority encoders

tabib
Download Presentation

faster functional modules lessons taught from FP-ADDERs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. faster functional moduleslessons taught from FP-ADDERs Guy Even Electrical Engineering Dept. Tel-Aviv Univ. Silicon Value Seminar (April 29, 2002)

  2. outline • FP-Adder: an example of a complicated module • brief overview • focus on two sub-blocks • Counting leading zeros – priority encoders various design methods: • divide & conquer • parallel prefix computation • redundant addition • Adders: • fast adders • compound adders

  3. Background Faster clock rates require faster modules. • Example: Floating-Point Adders • early designs: 50-60 logic levels. • 15-20 gate levels per cycle  3-4 cycles! • new designs: 25 gate levels  2 cycles. How? better algorithms and faster sub-blocks…

  4. Floating-Point Add • Algorithm: Why 50-60 logic levels? • Sub-Modules: List of sub-blocks. floating-point number: S-sign, E-exponent, F-mantissa • floating-point addition: • input: (Sa,Ea,Fa) & (Sb,Eb,Fb) • output: (S,E,F) such that:

  5. Normalize sum (shift left) SWAP operands Add/Sub mantissas Align mantissa of smaller operand (shift right) Convert sum to sign & mag abs(negative sum) rounding decision INC according to rounding decision Compute sticky-bit (OR of bits shifted outside) RESULT FP-Add: naïve algorithm Pre-process: Add/Sub: Round:

  6. focus: normalization shift Problem: • LZ= number of leading zeros • shift left by LZ positions unary example: X[1:4]=0010 A[1:4]=0011 • Use a priority encoder! • two types of priority encoders: • unary • binary binary example: X[1:4]=0010 Y[2:0]= 010

  7. Unary PENC • input: X[1:n] • Output: A[1:n] • functionality: Simpler: Implementation: what is the best design? Delay =  (log n) & Cost = (n).

  8. X[1:n/2] X[1+n/2:n] U- PENC(n/2) OR-tree(n/2) U-PENC(n/2) n/2 1 n/2 OR(n/2) A[1:n/2] A[1+n/2:n] Unary PENC –divide & conquer share OR-tree slight reduction of cost linear fan-out logarithmic delay delay: O(log n) is optimal O(log n) even if fan-out considered cost: O(n log n) not optimal

  9. Unary PENC - improve Parallel Prefix Computation (PPC) [FL,BK]!

  10. delay = O(log n) cost = O(n) X[1] X[2] X[3] X[4] X[n-1] X[n] OR OR OR U-PENC (n/2) OR OR A[2] A[4] A[n] Unary PENC = PPC(OR) A[1] A[3] A[n-1]

  11. X[1] X[2] X[3] X[4] X[n-1] X[n] OR OR OR U-PENC (n/2) OR OR A[1] A[2] A[3] A[4] A[n-1] A[n] PPC - properties Fan-out: Logarithmic fan-out can be decreased to constant (cost still O(n)). Layout: O(n log n) area. Same design as “Brent-Kung” adder. Applicable for every associative operator.

  12. Binary PENC • input: X[1:n] (n=2^k) • Output: Y[k:0] • functionality: Relation to Unary PENC: Implementation: what is the best design? Delay =  (log n) & Cost = (n).

  13. Binary PENC –simple & optimal X[1:n] PPC (OR) A[1:n] delay(diff(n)) = constant delay(encoder(n)) = O(log n) diff(n) encoder(n) cost(diff(n)) = O(n) cost(encoder(n)) = O(n) Y[k:0]

  14. Binary PENC –with adder tree X[1:n] PPC (OR) A[1:n] ADD-tree(n) • problem: • adder(k) in tree  O(log k) delay per adder • total delay is O(log n log k). Y[k:0]

  15. a3 c3 a2 c2 a1 c1 a0 c0 b3 b2 b1 b0 x3 x2 x1 x0 y3 y2 y1 y0 Redundant addition add columns in parallel using Full-Adders Partial compression or (3:2)-addition: delay is constant! Tree structure enables (n:2)-addition with O(log n) delay. (n:2)-addition used in fast multipliers

  16. X[1:n] PPC (OR) FA-tree(n) 2:1-Adder Y[k:0] Binary PENC –O(log n) delay A[1:n] 2[k:0] Tree of Full-Adders: delay of each full-adder is constant depth is O(log n) output is carry-save number

  17. Binary PENC –divide & conquer 1 2 n/2 n/2+1 n XL XR XL=00…0  YL[k-1]=1

  18. X[1:n/2] X[1+n/2:n] k Bin-PENC(n/2) Bin-PENC(n/2) 1 k k YL YR k YL[k-1] delay=constant cost=O(log n) delay=constant cost=O(log n) MUX(k) +2(k-1) (Half Adder) Y[k:0] Binary PENC –divide & conquer fan-out=k incurs O(log log n) delay initial analysis: delay = O(log n) cost = O(n) bottom line: delay = O(log n log logn ) cost = O(n)

  19. PENC – quick summary

  20. PENC - further issues back to FP-Adder: can we estimate LZ before subtracting? must pre-process to avoid “catastrophic cancellation”! method: partial compression (signed half-adders).

  21. k a b Compound Adder a+ba+b+1 rounding decision MUX RESULT focus – adder Avoid INC after rounding decision by pre-computing increment.

  22. Define: PPC Adder [FL,BK] • computes carry bits C[n:1] • sum bits satisfy: S[i]=XOR(A[i],B[i],C[i]). • computation of carry bits C[n:1]. example: A[3:0]=0100 B[3:0]=1110 [3:0]=pgpk C[4:1]=1100 claim:

  23. how to compute the event: define an operator  : {k,p,g} {k,p,g} {k,p,g} as follows: g  x = g p  x = p k  x = k claim: PPC adder (cont.) compute [i] using PPC with -gates! claim: is associative. definition: [i] = [i] …  [0].

  24. a+b+1 = (a+0.5)+(b+0.5) Therefore, Compound Adder [T] how to compute a+b & a+b+1? • use 2 separate adders • understand PPC adder recall: [i] = [i] … [0]. Now, for a+b+1: ’[i] = [i] … [0]  g.

  25. Conclusion • faster modules require clever designs • starting point: gate count (for delay & cost) • Must take fan-out & layout into account • lots of methods: • divide & conquer • parallel prefix computation • redundant arithmetic

More Related