1 / 11

Optimizing Multipliers for the CPU: A ROM based approach

Optimizing Multipliers for the CPU: A ROM based approach. Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California: Berkeley. Problem. Many power-limited applications for CPU Media/Graphics Portable applications

takara
Download Presentation

Optimizing Multipliers for the CPU: A ROM based approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer ScienceUniversity of California: Berkeley

  2. Problem • Many power-limited applications for CPU • Media/Graphics • Portable applications • Investigating the impact of different multiplier designs on power and performance of CPU: • SimpleScalar to model CPU and benchmarks • Modify SimpleScalar multiplier cycle times to model different multiplier architectures

  3. Array Multipliers • AND function to multiply bits • Critical path in carry-chain

  4. Wallace Multipliers • Critical path shortened • Final Adder still needed to combine partial products • Power consumption approximately the same as Array Multiplier

  5. Modified Booth Representation • 3 bits examined at a time, even values of i traversed • Reduces partial products by half • However, overhead required to generate signals, MUXes • Y-1 = 0 • Examples: 1 1 1 1 [0] 0 -1 0 1 1 0 [0] 2 -2

  6. Read Only Memory • Desirable because of low power requirements • Con stems from read delay, size 240 MHz -> 4.2 ns delay Consumes 3.24mW at 100MHz (10ns delay)

  7. ROM-based multipliers • ROM-based multipliers attractive • Issue of space • 32-bit multiplier requires 232*232*64 bits—unrealistic • Techniques to reduce table sizes • Karatsuba Algorithm: • A=A31-16A15-0, B=B31-16B15-0 • A*B=A31-16B31-16<<32+A15-0B31-16<<16+A31-16B15-0<<16+A15-0B15-0 • Reduces table size to 216*216*32 bits, but requires 4 lookups and 3 additions. • Using multiple, parallel lookups still uses fewer bits than regular table lookup

  8. ROM-based multipliers cont. • Vinnakota’s approach – Use tables of squares • Let x = floor([A + B]/2) and y = floor([A- B]/2) • If A0 xor B0 = 0: A*B = x2-y2 • If A0 xor B0 = 1: A*B = x2-y2 +B • Reduces table size to 232 * 64 bits, further reducible with split-tables (introduced later), requires 2 table lookups and 3 (or 4) additions • Hybrid approach: • Use tables of squares to find partial products for Karatsuba algorithm

  9. Proposed Implementation A=A1A0 B=B1B0 x11, y11… 216* 32bit ROM 216* 32bit ROM x112, y112… A1*B1, A1*B0 …

  10. Results • Most of the SPEC2000 benchmarks exhibited little or no performance loss (< .5%) from extra multiplier cycles: art, bzip*, gcc, gzip*, ijpeg, li, mcf, mesa, parser*, vpr • : Significant • * : Possibly significant • Of applications that did experience a drop in performance (extra cycles): • go.outorder (6.41%) – go playing program • m88ksim (5.39%) – chip simulator • perl (0.72%) – perl interpreter • vortex (2.33%) – Object Orientated Database

  11. Further Work • Measurements: • Accurate power measurements • More specific benchmarks—targeting multimedia • Optimizations: • Tables: Vinnakota’s split-table work • If A, B share lower k bits, A2, B2 share lower k+1 bits. • Can change 2N*N table to 2N*(N-[k+1]) and 2k*(k+1) tables. • Gives somewhat faster lookups and lower memory requirements. • Adders: • Adders can be optimized, final 64-bit additions are more like 48-bit additions. • Pipelining multiplication operations can occur in up to 3 stages.

More Related