1 / 12

Compiled code acceleration on FPGAs

Compiled code acceleration on FPGAs. W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering University of California Riverside. Comparison of a dual core Opteron (2.5 GHz) to Virtex 4 & 5 FPGA on dp fp

arlene
Download Presentation

Compiled code acceleration on FPGAs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering University of California Riverside

  2. Comparison of a dual core Opteron (2.5 GHz) to Virtex 4 & 5 FPGA on dp fp Balanced allocation of adders, multipliers and registers Use both DSP and logic for multipliers, run at lower speed Logic & wires for I/O interfaces Why? Are FPGA: A New HPC Platform? David Strensky, FPGAs Floating-Point Performance -- a pencil and paper evaluation, in HPCwire.com Future of Computing - W. Najjar

  3. ROCCC Riverside Optimizing Compiler for Configurable Computing • Code acceleration • By mapping of circuits to FPGA • Achieve same speed as hand-written VHDL codes • Improved productivity • Allows design and algorithm space exploration • Keeps the user fully in control • We automate only what is very well understood Future of Computing - W. Najjar

  4. Challenges • FPGA is an amorphous mass of logic • Structure provided by the code being accelerated • Repeatedly applied to a large data set: streams • Languages reflect the von Neumann execution model: • Highly structured and sequential (control driven) • Vast randomly accessible uniform memory Future of Computing - W. Najjar

  5. Procedure, loop and array optimizations Instruction scheduling Pipelining and storage optimizations DSP CPU C/C++ GPU High level transformations Low level transformations Code generation Hi-CIRRF Lo-CIRRF Java Custom unit VHDL FPGA SystemC CIRRF Compiler Intermediate Representation for Reconfigurable Fabrics Binary ROCCC Overview • Limitations on the code: • No recursion • No pointers Future of Computing - W. Najjar

  6. Input memory (on or off chip) Mem Fetch Unit Output Buffer Input Buffer Multiple loop bodies Unrolled and pipelined Mem Store Unit Output memory (on or off chip) A Decoupled Execution Model • Decoupled memory access from datapath • Parallel loop iterations • Pipelined datapath • Smart buffer (input) does data reuse • Memory fetch and store units, data path configured by compiler • Off chip accesses platform specific Future of Computing - W. Najjar

  7. So far, working compiler with … • Extensive optimizations and transformations • Traditional and FPGA specific • Systolic array, pipelined unrolling, look-up tables • Compile + hardware support for data reuse • > 98% reduction in memory fetches on image codes • Efficient code generation and pipelining • Within 10% of hand-optimized HDL codes • Import of existing IP cores • Leverages huge wealth, integrated with C source code • Support for dynamic partial reconfiguration Future of Computing - W. Najjar

  8. Indices of A[] coefficients Example: 3-tap FIR #define N 516 void begin_hw(); void end_hw(); int main() { int i; const int T[5] = {3,5,7}; int A[N], B[N]; begin_hw(); L1: for (i=0; i<=(N-3); i=i+1) { B[i] = T[0]*A[i] + T[1]*A[i+1] + T[2]*A[i+2]; } end_hw(); } Future of Computing - W. Najjar

  9. Memory interface Memory interface CPU SRAM FPGA FPGA CPU CPU Fast Network SRAM SRAM CPU CPU Memory Memory FPGA FPGA RC Platform Models 1 2 3 Future of Computing - W. Najjar

  10. What we have learned so far • Big speedups are possible • 10x to 1,000x on application codes, over Xeon and Itanium, molecular dynamics, bio-informatics, etc. • Works best with streaming data • New paradigms and tools • For spatio-temporal concurrency • Algorithms, languages, compilers, run-time systems etc Future of Computing - W. Najjar

  11. Future? Very wide use of FPGAs • Why? • High throughput (> 10x) AND low power (< 25%) • How? • Mostly in Models 2 and 3, initially • Model2: See Intel QuickAssist, Xtremedata & DRC • Model 3: SGI, SRC & Cray • Contingency • Market brings price of FPGAs down • Availability of some software stack • for savvy programmers, initially • Potential • Multiple “killer apps” (to be discovered) Future of Computing - W. Najjar

  12. Conclusion We as a research community should be ready Stamatis was Thank you Future of Computing - W. Najjar

More Related