Hardware Acceleration of Applications Using FPGAs

Hardware Acceleration of Applications Using FPGAs Andy Nisbet Andy.Nisbet@cs.tcd.ie http://www.cs.tcd.ie/Andy.Nisbet Phone:+353-(0)1-608-3682 FAX:+353-(0)1-677-2204

Content • FPGAs? • High-level language hardware compilation. • High Performance Computing Using FPGAs??? • HandelC. • Research Directions at Trinity College using FPGAs.

FPGAs • FPGAs can be configured after the time of manufacture. • Configurable logic blocks, input/output blocks for connecting to external microchip pins and programmable interconnect. • Logic blocks can be configured/interconnected to form simple combinatorial logic structures, & complex functional units. • FPGAs can be used to provide a standalone solution to a task, or they can be used in tandem with a conventional microprocessor.

FPGA Development Boards

How are FPGAs Programmed • Conventional techniques use VHDL or Verilog. Require many low-level hardware details. • Synthesis tools can then convert the HDL into an EDIF netlist which can be placed and routed onto an FPGA device. A bit/configuration file is produced. • High-level languages such as HandelC and SystemC. A hardware compiler translates the specification into VHDL/Verilog, or an EDIF netlist.

Why use an FPGA? • Conventional microprocessors have a fixed architecture. • FPGAs can generate application specific logic where the balance and mix of functional units can be altered dynamically. • The number and type of functional units that can be instantiated on an FPGA are only limited by the silicon real estate available. • Potential to generate orders of magnitude speedup for computationally intensive algorithms, such as in signal/image-processing. • Maximum clock speed of FPGA is <= 400MHz, designs often ~50MHz.

High Performance Computing using FPGAs • Current FPGAs can instantiate multiple floating-point units. Applications work has focussed on using integer and fixed-point arithmetic. • Logarithmic Number System (LNS) ALU has single 20/32 bit precision with very small area in comparison to standard floating-point units. • Performance benefits for this system have already been demonstrated over Texas Instruments TMS320C3/4x 50MHz DSP processors on a 2million gate Xilinx FPGA device.

LNS on XC2V8000 • 14 independent 32bit ALUs (2-3GFlop Peak Estimate). • 336 independent 20bit ALUs (20-40GFlop Peak Estimate). • Clock 60 MHz (predicted), ADD and SUB latency 6 cycles, pipelined. MUL,DIV,SQRT =>depends just on the number of parallel 32bit adders, subtractors or bit shifters.

HandelC • Provided by Celoxica http://www.celoxica.com/ • Based on CSP/OCCAM (we’re not showing CSP aspects in this talk!) • C with hardware extensions. • Enables the compilation of programs into synchronous/clocked hardware. • Many other similar systems SystemC, Forge, JHDL, SA-C …

HandelC • Fork-join model of parallel computation, parallel statements are placed in a par{} block. • CSP aspect uses channels, useful for multiple clock domains. • Each statement takes ONE clock cycle to execute. • Clock cycle is determined after place and route and is 1/(longest logic gate + routing delay).

HandelC Simple Example // Original code takes one LONG clock cycle. unsigned int 32 x,a,b,c,d,e,f,g,h; x = a +b + c +d + e + f +g +h; // Parallelise and pipeline into something taking 3 SHORT cycles par{ // all statements inside the par block are executed in parallel temp1 = a+b; temp2 = c+d; temp3 = e+f; temp4 = g+h; // position 1 SYNCHRONISATION????? sum1 = temp1+temp2; sum2 = temp3 + temp4; // position 2 SYNCHRONISATION????? x = sum1+sum2; }

Porting/Optimising Applications for HandelC • Define variable storage & bit width, off-chip SRAM, on-chip FPGA synthesised registers/RAM/Block RAM. • Replace floating-point with LNS or fixed-point arithmetic. • Iterative optimisation process, apply high-level restructuring transformations=>see file.HTML

Efficient HandelC • Replace parallel for loops with parallel while loops. The loop increment can then execute in parallel. • Avoid the use of n-bit (n >> 1) comparators <,<=,>,>= and single-cycle multipliers. • Parallelise and pipeline code as far as possible. • Use dedicated on-chip resources such as multipliers (interface command/VHDL/Verilog). • Sequential statements not on the critical path can share functional units in order to reduce area requirements. • Optimise variable storage:-- registers, distributed RAM, block RAM, or off-chip in SRAM.

For -> While unsigned int 8 i; for(i = 0; i < 255; i++) { par{ … } } // becomes unsigned int 1 terminate = 0; while(!terminate) { par { terminate = (i == 254); i++; } }

Variable Storage unsigned int 32 i; // REGISTER unsigned int 23 j[40]; // ARRAY of REGISTERs (fully associative) ram unsigned 8 myRAM[16]; // single port DISTRIBUTED RAM mpram { // dual port DISTRIBUTED RAM ram unsigned int 8 readWrite[16]; // R/W rom unsigned int 8 readOnly[16]; // could be ram as well } myMPRAM; //to minimise logic for access to RAMs/ROMs USE registers myRam[aRegister] = aRegisterDataValue; // Adding with {block = 1}; Makes BLOCK RAM ram unsigned int 8 myBlockRAM[16] with {block = 1}; ram unsigned in 21 twoDim[12][8]; par { twoDim[0][aReg] = 0; twoDim[1][aReg] = 1; }

FPGA Research at TCD. • Interactive/Automatic iterative conversion from C to HandelC/SystemC/FORGE, prototype using SUIF/NCI (with David Gregg). • Application studies using lattice QCD, image segmentation, image processing (with Jim Sexton, Simon Wilson & Fergal Shevlin). Collision detection and telecommunication applications. • FPGA/SCI work, (Michael Manzke). • Exploitation of “striped CPU” FPGAs. • Numerical stability? Floating to 20/32 bit LNS and fixed-point. • New work, no results (yet!) focussed on compute bound applications, as PCI has poor IO.

Hardware Acceleration of Applications Using FPGAs

Hardware Acceleration of Applications Using FPGAs

Presentation Transcript

Hardware Acceleration

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK

Evolvable Hardware Techniques for Autonomous Repair of FPGAs

Performance Measurement of Applications with GPU Acceleration using CUDA

Hardware Acceleration of Parallel Prefix Algorithms

Hardware Acceleration Using GPUs

Application Performance through Hardware Acceleration

Using FPGAs as device

Hardware Acceleration of Fault-tolerant System Verification

FPGAs for the Masses: Hardware Acceleration without Hardware Design

HARDWARE BASED PACKET FILTERING USING FPGAs

Digital signature using MD5 algorithm Hardware Acceleration

Networking Virtualization Using FPGAs

Embedded Systems: Hardware: Using Combinational Logic in Applications:

Using FPGAs with Embedded Processors for Complete Hardware and Software Systems

Application Performance through Hardware Acceleration

Other Applications of Velocity, Acceleration

Application Performance through Hardware Acceleration

Interactive Reach Planning for Animated Characters Using Hardware Acceleration

Acceleration of the Retinal Vascular Tracing Algorithm using FPGAs