1 / 28

CprE / ComS 583 Reconfigurable Computing

This lecture covers the recap of reconfigurable computing, LUT mapping techniques, and different adder designs for FPGA arithmetic. It also discusses the performance and optimization of carry bypass, carry select, pipelining, and other techniques.

swilson
Download Presentation

CprE / ComS 583 Reconfigurable Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CprE / ComS 583Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #5 – FPGA Arithmetic

  2. Quick Points • HW #1 due Thursday at 12:00pm • Any comments? CprE 583 – Reconfigurable Computing

  3. Recap • Cluster size of N = [6-8] is good, K = [4-5] CprE 583 – Reconfigurable Computing

  4. LUT Mapping Techniques CprE 583 – Reconfigurable Computing

  5. LUT Mapping Techniques (cont.) CprE 583 – Reconfigurable Computing

  6. LUT Mapping Techniques (cont.) CprE 583 – Reconfigurable Computing

  7. Outline • Recap • Motivation • Carry / Cascade Logic • Addition • Ripple Carry • Carry Bypass • Carry Select • Carry Lookahead • Basic Multipliers CprE 583 – Reconfigurable Computing

  8. (1, 2) (1) A B A B Op ALU 2-LUT Out Out Motivation • Traditional microprocessors, DSPs, etc. don’t use LUTs • Instead use a w-bit Arithmetic and Logic Unit (ALU) • Carry connections are hard-wired • No switches, no stubs, short wires (2) (1) AND2 OR2 XOR2 A B Cin 3-LUT 3-LUT Sum Cout / Cin A B (2) ADD SUB CMP 3-LUT 3-LUT Sum Cout CprE 583 – Reconfigurable Computing

  9. Adder Delays • Assuming a ripple-carry adder: • 32-bit ALU delay – 6ns • 32-cascaded 4-LUTs – 32 x 2.5ns = 80ns • Motivates extra hardware to accelerate carry operations Compare: 32-bit ALU (0.6λ) 4-LUT delay 16 ns Area optimized 2.0 ns Logic delay 6 ns Delay optimized 0.5 ns Single channel delay 2.5 ns per bit CprE 583 – Reconfigurable Computing

  10. Altera Flex 8000 Carry Chain CprE 583 – Reconfigurable Computing

  11. Xilinx XC4000 Carry Chain CprE 583 – Reconfigurable Computing

  12. Cascades • Large fanin operations (reductions): • Decoding • Matching • Completion detection • Many-to-one reductions • Combining logic is simple • Makes use of dedicated paths CprE 583 – Reconfigurable Computing

  13. Altera Cascade Logic • LE delay – 2.4 ns • Cascade delay – 0.6 ns CprE 583 – Reconfigurable Computing

  14. Why Look at Arithmetic? • Parallelization • Specialization • Architecture • Size • Inputs • Adder problem – delay grows linearly with bit width • Solutions for larger adders: • Pipelining • Carry bypass • Carry select CprE 583 – Reconfigurable Computing

  15. Adder Pipelining • Not as practical in ASIC world (registers are expensive) • Registers essentially “free” in FPGA logic blocks A1 B1 A2 B2 A3 B3 + Cout S1 + Cout S2 + S3 Cout CprE 583 – Reconfigurable Computing

  16. Carry Bypass Adders • If all the propagates are 1 (P0P1P2P3 = 1) then Cout3 = Cin • Pi = Ai xor Bi • Skip all the carry logic • Inexpensive A0 B0 A1 B1 A2 B2 A3 B3 Cout0 Cout1 Cout2 + + + + Cin Cout3 0 1 P0P1P2P3 CprE 583 – Reconfigurable Computing

  17. Carry Bypass Performance • Small hardware cost: • 16-bit add – 4 CLB overhead • 32-bit add – 9 CLB overhead • Delay growth still linear, smaller slope Ripple Carry Carry Bypass Delay N-bit Adder CprE 583 – Reconfigurable Computing

  18. Carry Select Adders • Precompute addition value for (Cin = 0) case and (Cin = 1) case • Use mux to select between two with actual Cin value • Cost of this approach? A0 B0 A1 B1 A2 B2 A3 B3 + + + + Cin A4 B4 A5 B5 A6 B6 A7 B7 + + + + 0 0 S4-7 1 + + + + 1 A4 B4 A5 B5 A6 B6 A7 B7 CprE 583 – Reconfigurable Computing

  19. 1 1 1 1 0 0 0 0 Linear Carry-Select • Adder delay = w, mux delay = 1 A31-24 A23-16 A15-8 A7-0 B31-24 B23-16 B15-8 B7-0 + + + + 0 0 0 0 + + + + 1 1 1 1 t8 t8 t8 t8 t8 t8 t8 t8 t11 t10 t9 t12 S31-24 S23-16 S15-8 S7-0 CprE 583 – Reconfigurable Computing

  20. 1 1 1 1 1 1 0 0 0 0 0 0 Square-Root Carry Select • Each carry arrives when the corresponding sum is ready A31-30 A29-22 A21-15 A14-9 A8-4 A3-0 B31-30 B29-22 B21-15 B14-9 B8-4 B3-0 + + + + + + 0 0 0 0 0 0 + + + + + + 1 1 1 1 1 1 t8 t8 t7 t7 t6 t6 t5 t5 t4 t4 t9 t8 t7 t6 t5 t10 S31-30 S29-22 S21-15 S14-9 S8-4 S3-0 CprE 583 – Reconfigurable Computing

  21. A0 A2 A3 A1 HA HA C3 S0 S2 S3 S1 Constant Addition • If one operand is constant: • More speed? • Less hardware? A0 0 1 0 1 A1 A2 A3 HA FA FA FA C3 S0 S1 S2 S3 CprE 583 – Reconfigurable Computing

  22. Multiplication • Shift and add operations • Need N bit adder, M cycles 42 = 101010 Multiplicand (N bits) x 10 = x 1010 Multiplier (M bits) 000000 101010 Partial products 000000 + 101010 420 = 111001110 Product (N+M bits) CprE 583 – Reconfigurable Computing

  23. X3 X2 X1 X0 Array Multiplier Y0 • Area – N x M cells • Delay – O(N+M) Y1 X2 X1 X3 X0 + + + + Y2 X2 X1 X3 X0 + + + + Y3 X2 X1 X3 X0 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing

  24. X3 X2 X1 X0 Carry-Save Y0 Y1 X2 X1 X3 X0 + + + + Y2 + + + + Y3 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing

  25. Multiplier Pipelining • Register cost: • Multiplicand – (N bits/stage x M stages) • Multiplier – (M2 + M) / 2 bits • Early output values – (M2 + M) / 2 bits • Total – M x (N + M + 1) bits • Critical path = max: • DFF + FA + setup • Bottom-level adder CprE 583 – Reconfigurable Computing

  26. Constant Multiplier Y0=0 • Can greatly reduce the number of adders • Removes all and gates Y1=1 X2 X1 X3 X0 Y2=0 Y3=1 X2 X1 X3 X0 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing

  27. LUT-based Constant Multipliers • k-LUT can perform constant multiply of k-bit number • Break operand into k-bit quantities • Example: 8-bit x 8-bit constant, k=4 10101011 x NNNNNNNN AAAAAAAAAAA (N * 1011 (LSN)) + BBBBBBBBBBBBBBB (N * 1010 (MSN)) SSSSSSSSSSSSSSS Product CprE 583 – Reconfigurable Computing

  28. Summary • Latency overhead of programmable logic • Several approaches to reducing design latency: • Fast carry • Cascade • Hardwired connections • Multiplier optimization goals different from adder • Other techniques: • Logarithmic v. linear (Wallace Tree multiplier) • Data encoding (Booth’s multiplier) CprE 583 – Reconfigurable Computing

More Related