1 / 20

FPGA based Acceleration of Linear Algebra Computations.

B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan. FPGA based Acceleration of Linear Algebra Computations. Outline. Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions

takoda
Download Presentation

FPGA based Acceleration of Linear Algebra Computations.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan FPGA based Acceleration of Linear Algebra Computations.

  2. Outline • Double Precision Dense Matrix-Matrix Multiplication. • Motivation • Related Work • Algorithm • Design • Results • Conclusions • Double Precision Sparse Matrix-Vector Multiplication. • Introduction • Prasanna • DeLorimier • David Gregg et. al. • What can we do ?

  3. FPGA based Double Precision Dense Matrix-Matrix Multiplication.

  4. Motivation • FPGAs have been making inroads for HiPC. • Accelerating BLAS-3 achieved by accelerating matrix multiplications. • Modern FPGAs provide an abundance of resources – We must capitalise upon these.

  5. Related Work{1/2} • The two main works ~ Dou and Prasanna. Both based on linear arrays, both use memory switching – both sustain their peak. • Dou : • Optimised for a large VirtexII pro device (Xillinx). • Created his own MAC (Not fully compliant). • Sub-block dimensions must be powers of 2. • Optimised for Low IO bandwidth.

  6. Related Work{2/2} • Prasanna: • Scaling results in speed degradation of about 35% (2 PEs to 20 PEs). • 2.1 GFLOPs on a CRAY XD1 with VirtexII Pros (XC2VP50). • For design only (XC2VP125) they report 15% clock degradation on 2 to 24 PEs. • They state they have not made any platform specific optimisations, for the implemented design.

  7. Algorithm • Broadcast ‘A’, keep a unique ‘B’ per PE • Multiply, and put in pipeline of multiplier. • Output is fed to directly to Adder+Ram (accumulator) • When the updated C is ready, take them out.

  8. Design-1

  9. Design-II

  10. FPGA Synthesis/PAR data{1/2} Table: Resource Utilisation for SX95T and SX240T (post PAR) Table: Clock Speed in MHz for the overall design for different number of PE.

  11. FPGA Synthesis/PAR data{2/2} Table: Resource Utilisation for Virtex II ProXC2VP100 (post PAR)

  12. Conclusions • We propose a variation of the rank one update algorithm for matrix multiplication. • We introduce a scalable processing element for this algorithm, targeted a Virtex-5 SX240T FPGA • The two designs clearly show the difference of local storage on IO bandwidth. • The designs achieved a design speed of 373 MHz, 40 PEs and a sustained performance of 29.8 GFLOPS for a single FPGA. We also provide 5.3 GFLOPS on a XC2VP100.

  13. FPGA based Double Precision Sparse Matrix-Vector Multiplication.

  14. Introduction • There are three main papers we will be looking at • Viktor Prasanna: Hybrid method use HLL+S/W+HDL • Michael DeLorimier: Maximum performance but unrealistic • David Gregg et. al.: Most realistic assumptions wrt DRAM

  15. Prasanna • Use of prexisting IP cores – specifically for iterative solver (CG) • 4 input reduction ckt does dot product results in partial sums as op. • Adder loop with Array does summation of dotproduct – created using HLL • Reduction ckt at the end uses B-Tree to create the final value • IP s are available • DRAM looked at – but not realistically • Order of Matrices is small • DRAM is bottleneck • With their IP's they have a good architecture -however change the IP and modify datapath – eg. Dou MAC

  16. DeLorimier • Use BRAMs for everything. • Use for iterative Solver – specifically CG • MAC requires interleaving • They do load balancing in their partitioner which requires – a communication stage, very matrix/partitioner dependent. • Communication is the bottleneck • Performance:750 MFLOPS / processor • 16 Virtex II 6000s • Each has 5 PE + 1 CE

  17. David Gregg et. al. (SPAR) • They only report the use of the SPAR architecture for FPGAs • They use very pessimistic DRAM access times. Emphasis on cache-miss removal • Not using their Block RAMs well – maybe something interesting can be done here • 128 MFLOPS for 3 parallel SPAR units but remove cache miss and we get a peak of 570 MFLOPS

  18. What can we do ? • Both use CSR – Not required why not modify representation • Two approaches : We can try both simultaneously • Prasanna – split across dot products (same row many PE) • Delorimier – split accross rows (many rows – one PE) • Use data from SPAR – viable approach – both do zero multiplies – we get away with one zero multiply/coloumn • Minimise communication or overlap it. - we can do interleaving for this – while one stage computes the previous one communicates.

  19. Questions ?

  20. Thank You Thank You

More Related