Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs - PowerPoint PPT Presentation

emily
performance modeling of pipelined linear algebra architectures on fpgas n.
Skip this Video
Loading SlideShow in 5 Seconds..
Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs PowerPoint Presentation
Download Presentation
Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs

play fullscreen
1 / 32
Download Presentation
Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs
381 Views
Download Presentation

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky Sonia Lopez, MarcinLukowiak, James Letendre and Matthew Ryan High Performance Research Group

  2. Outline • Our Motivation to design FPGA model • Assumptions made • Architecture decomposed into equations • Model equations • Model verification • Performance experiment • Conclusion High Performance Research Group

  3. Motivation Compute-intensive problems requiring high performance Data assimilation applications • Linear algebra computations: • Dot Product, MV-Multiply, MM-Multiply, Matrix inverse, and Matrix decomposition • Execution time on GPP makes use impractical High Performance Research Group

  4. Motivation Compute-intensive problems requiring high performance Data assimilation application • Medical diagnosis • NTEPI: Non-invasive Transmural Electrophysiological Imaging • Kalman Filter: ECG & Model • 120 electrodes • Solves Inverse Propagation Problem High Performance Research Group

  5. Heterogeneous Systems Taking full advantage of compute capabilities of CPU, GPU, FPGA Design decisions • Which processor architecture to use? • Which implementation should be used? • How to handle different data sizes? High Performance Research Group

  6. Simulation Design space exploration necessitates using processor models • CPU [1,2] • GPU [3,4] • FPGA [5,6] • Higher level analysis prior to implementation • Model error rates up to 15% • Computation specific model High Performance Research Group

  7. Research Goals Modeling support for LA computations on FPGA • Linear Algebra Computations: • Dot product, MV-multiply, MM-multiply, Matrix inverse, and Matrix decomposition • Pipelined FPGA architectures • More Parallelism = fewer cycles • Model inputs: architecture (latencies), clock speed, memory bandwidth, data size, pipeline size High Performance Research Group

  8. Assumptions We assume an FPGA system containing: • Memory Controller • Read/write elements, avoid bank conflicts, buffer up accesses (data stored is sequential) • Start/Finish Control Logic • Data ordering or precomputing FPGA High Performance Research Group

  9. Assumptions We assume the pipeline structure is one of two types: Multiple pipeline single pipeline replicated Scalable pipeline having data dependencies or feeback • Stages • Functional units: add, subtract,multiply, etc. High Performance Research Group

  10. Matrix-Vector Multiply Computation is characterized as: • Pipeline: MAC • Vector value stored • Data Sizes: Am×n, xn×1 • Usedn times • Performs mIterationsto calculate a single result High Performance Research Group

  11. Matrix-Vector Multiply Computation is characterized as: • Pipeline: MAC • Vector value stored • Data Sizes: Am×n, xn×1 • Usedn times • Performs mIterationsto calculate a single result High Performance Research Group

  12. Model: Compute Cycles The core calculation in the model • Combine the Uses() and Iterations() • With the # of Usable Pipelines() High Performance Research Group

  13. Model: Compute Cycles The core calculation in the model • Combine the Uses() and Iterations() • With the # of Usable Pipelines() x0 • A00 • A01 • A02 • A03 x1 • A10 • A11 • A12 • A13 x2 • A20 • A21 • A22 • A23 x3 • A30 • A31 • A32 • A33 Pipeline0 y3 y2 y1 y0 High Performance Research Group

  14. Model: Compute Cycles The core calculation in the model • Combine the Uses() and Iterations() • With the # of Usable Pipelines() x0 • A00 • A01 • A02 • A03 Pipeline0 x1 • A10 • A11 • A12 • A13 y0 x2 • A20 • A21 • A22 • A23 Pipeline0 y1 x3 • A30 • A31 • A32 • A33 Pipeline0 y2 y3 High Performance Research Group

  15. Model: Operating Frequency To calculate the frequency at which the design will operate • Memory Bandwidth Ratio • Available to Required • Frequency the design operates at ( ) High Performance Research Group

  16. Model: Total Execution Time Combines all model equations into final result • Total cycles required for computation ( ) • Compute Cycles • Latency (pipeline fill/empty) • Total Execution Time • Control Logic Latency • Memory access latency High Performance Research Group

  17. Verification Hardware Platform Specifications used: • Data Sizes: 5 to 8000 • FPGA Devices: Virtex 6 & Virtex 7 • Memory Bandwidth: 800Mbps & 1.6Gbps • Single & Double Precision Floating Point High Performance Research Group

  18. Verification Computation validated the accuracy of the model Dot Product MV Multiply MM Multiply Linear Regression Coefficient of determination (R2) of 1 Cholesky M Inverse High Performance Research Group

  19. Verification: MM Multiply Computation validated the accuracy of the model MM Multiply Linear Regression Coefficient of determination (R2) of 1 High Performance Research Group

  20. Performance Results: Cholesky Control flow dependent Execution Times for NxN matrices Pipeline sizes to achieve maximum performance Size of Pipeline Smaller Larger < > Less Cycles More Cycles Slower Rate Faster Rate High Performance Research Group

  21. Performance Results: M Inverse Hardware resource dependent Execution Times for NxN matrices Pipeline sizes to achieve maximum performance High Performance Research Group

  22. Performance Results: MM Multiply Memory bandwidth dependent Execution Times for NxN matrices Pipeline sizes to achieve maximum performance High Performance Research Group

  23. Conclusion • Model incorporates elements from the architecture through the Uses, Iterations, and latencies • We showed the size of the pipeline is a critical parameter to achieve maximum performance • Quickly explore the design space without any implementation or simulation • The best accelerator can be configured to meet the needs of the application High Performance Research Group

  24. References [1]  A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha, “A framework for performance modeling and prediction,” in Supercomputing, ACM/IEEE 2002 Conference, nov. 2002, p. 21. [2]  M. Laurenzano, M. Tikir, L. Carrington, and A. Snavely, “Pebil: Efficient static binary instrumentation for linux,” in Performance Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium on, march 2010, pp. 175 –183. [3]  S.HongandH.Kim,“Ananalyticalmodelforagpuarchitecture with memory-level and thread-level parallelism awareness,” in Proceedings of the 36th annual international symposium on Computer architecture, ser. ISCA ’09. New York, NY, USA: ACM, 2009, pp. 152–163. [4]  J. Sim, A. Dasgupta, H. Kim, and R. Vuduc, “A performance analysis framework for identifying potential benefits in gpgpu applications,” in Proceedings of the 17th ACM SIGPLAN sympo- sium on Principles and Practice of Parallel Programming, ser. PPoPP ’12. New York, NY, USA: ACM, 2012, pp. 11–22. [5] B. Holland, K. Nagarajan, and A. George, “RAT: RC Amenability Test for Rapid Performance Prediction,” in ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 1, no. 4, 2009. [6] B. Holland, A. George, H. Lam, and M. Smith, “An analytical model for multilevel performance prediction of Multi-FPGA systems ,” in ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 4, no. 3, 2011. High Performance Research Group

  25. BACKUP SLIDES High Performance Research Group

  26. Performance Experiment To compute the time required given a matrix size • Based on the number of pipelines • Incorporating design factors: • Data precision (single or double) • Memory bandwidth • Available logic resources • Clock frequency High Performance Research Group

  27. A03 A00 A02 A01 x0 A13 A12 A11 A10 x1 A23 A21 A22 A20 x2 A33 A31 A32 A30 x3

  28. y0 A03 A00 A02 A01 x0 A13 A12 A11 A10 x1 A23 A21 A22 A20 x2 A33 A31 A32 A30 x3

  29. x3 x0 x1 x2 y1 y0 A13 A12 A11 A10 A23 A21 A22 A20 A33 A31 A32 A30

  30. x3 x0 x1 x2 y2 y0 y1 A23 A21 A22 A20 A33 A31 A32 A30

  31. x3 x0 x1 x2 y3 y0 y1 y2 A33 A31 A32 A30

  32. x3 x0 x1 x2 y0 A32 A30 A31 A33 y1 y2 y3