1 / 48

June 17, 2012

Solving Challenging Numerical Linear Algebra Algorithms Using GPU Accelerators Hatem Ltaief KAUST Supercomputing Laboratory Stanimire Tomov University of Tennessee, Knoxville ISC’12 Tutorial, Hamburg. June 17, 2012. Outline. MAGMA: LAPACK for GPUs Methodology overview

naava
Download Presentation

June 17, 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Solving Challenging Numerical Linear Algebra Algorithms Using GPU AcceleratorsHatem LtaiefKAUST Supercomputing LaboratoryStanimireTomovUniversity of Tennessee, KnoxvilleISC’12 Tutorial, Hamburg June 17, 2012

  2. Outline • MAGMA: LAPACK for GPUs • Methodology overview • Use both GPUs and multicore CPUs • MAGMA: from single to multiGPU support • One-sided factorizations and linear solvers • Two-sided factorizations and eigensolvers • Dynamic scheduling approaches to DLA • MAGMA algorithms with dynamic scheduling • Conclusions

  3. MAGMA: LAPACK for GPUs • MAGMA • Matrix algebra for GPU and multicore architecture • To provide LAPACK/ScaLAPACK on hybrid architectures • http://icl.cs.utk.edu/magma/ • MAGMA BLAS • A subset of BLAS for GPUs, highly optimized for NVIDIA GPGPUs • Fast GEMM for Fermi (CUBLAS3.2) [IJHPCA’10] • MAGMA developers & collaborators • UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) • Community effort, similarly to LAPACK/ScaLAPACK

  4. A New Generation of DLA Software MAGMA Hybrid Algorithms (heterogeneity friendly) Rely on - hybrid scheduler - hybrid kernels

  5. MAGMA Software Stack

  6. MAGMA 1.1 • 50+ hybrid LAPACK algorithms have been developed (total of 200+ routines) • Every algorithm is in 4 precisions (s/c/d/z) • There are 3 mixed precision algorithms (zc & ds) • These are hybrid algorithms, expressed in terms of BLAS • MAGMA BLAS • A subset of GPU BLAS, optimized for Tesla and Fermi GPUs

  7. MAGMA Methodology A methodology to use all available resources: • MAGMA usesHYBRIDIZATION methodology based on • Representing linear algebra algorithms as collections of TASKS and DATA DEPENDENCIES among them • Properly SCHEDULING tasks' execution over multicore and GPU hardware components • Successfully applied to fundamentallinear algebra algorithms • One and two-sided factorizations and solvers • Iterative linear and eigen-solvers • Productivity • 1) High-level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions Hybrid CPU+GPU algorithms(small tasks for multicores and large tasks for GPUs)

  8. Hybrid Algorithms One-sided factorizations (LU, QR, Cholesky) • Hybridization • Panels (Level 2 BLAS) are factored on CPU using LAPACK • Trailing matrix updates (Level 3 BLAS) are done on the GPU using “look-ahead”

  9. A Hybrid Algorithm Example • Left-looking hybrid Cholesky factorization in MAGMA 1.0 • The difference with LAPACK – the 3 additional lines in red • Line 10 (done on CPU) is overlapped with work on the GPU

  10. LU Factorization (Single GPU)

  11. From single to multiple GPUs support • Data distribution • 1-D block-cyclic distribution • Algorithm • GPU holding current panel is sendingit to CPU • All updates are done in parallel onthe GPUs • Look-ahead is done with GPU holdingthe next panel nb GPU0 GPU1 GPU2 GPU0 . . .

  12. LU factorization (multiGPUs)

  13. LU factorization (multiGPUs) Matrix out ofGPU memory

  14. Out of GPU Memory Algorithms • Perform left-looking factorizations on sub-matrices that fit in the GPU memory (using existing algorithms) • The rest of the matrix stays on the CPU • Left-looking versions minimize writing on the CPU Copy A2 to the GPU Update A2 using A1 (a panel of A1 at a time) Factor the updated A2 using existinghybrid code Copy factored A2 to the CPU Trivially extended to multiGPUs: A2 is “larger” with 1-D block cyclic distribution, again reusing existing algorithms Untouched part of thematrix To be factoredsub-matrixA2 on GPU Factoredsub-matricA1 on CPU . . .

  15. Hybrid Algorithms Two-sided factorizations (to bidiagonal, tridiagonal, and upper Hessenberg forms) for eigen- and singular-value problems • Hybridization • Trailing matrix updates (Level 3 BLAS) are done on the GPU(similar to the one-sided factorizations) • Panels (Level 2 BLAS) are hybrid– operations with memory footprint restricted to the panel are done on CPU– The time consuming matrix-vector products involving the entire trailing matrix are done on the GPU

  16. MultiGPUs Two-Sided Factorizations • Need HP multiGPU Level 2 BLAS (e.g., 50% of flops in the tridiagonal reduction) Performance of DSYMV on multi M2090s • T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.

  17. Hybrid Two-Sided Factorizations

  18. MAGMA Tridiagonalization in DP • 50 % of the flops are in SYMV • Memory bound, i.e. does not scale well on multicore CPUs • Use the GPU’s high memory bandwidth and optimized SYMV • 8 x speedup over 12 Intel cores(X5660 @2.8 GHz) Keeneland system, using one node 3 NVIDIA GPUs (M2070@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB) • T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.

  19. Further GPU Kernel Optimizations Fermi 2070 A. Abdelfattah, J. Dongarra, D. Keyes and H. Ltaief, Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators, VECPAR, Japan, 2012.

  20. Further GPU Kernel Optimizations Fermi 2070

  21. Further GPU Kernel Optimizations Fermi 2070

  22. From Static to Dynamic Scheduling… • Static may stall in situations where work is available • Hand tuned optimizations • Hardware heterogeneity • Kernel heterogeneity • Separation of concerns • Dynamic Runtime System

  23. Punch Lines Productivity! Productivity! Productivity! Oh… Did I say Productivity?

  24. Block Algorithms • Panel-Update Sequence • Transformations are blocked/accumulated within the Panel (Level 2 BLAS) • Transformations applied at once on the trailing submatrix (Level 3 BLAS) • Parallelism hidden inside the BLAS • Fork-join Model

  25. Block QR Factorization

  26. Fork-Join Paradigm

  27. Leveraging Block Algorithms… Column-major data layout Tile data layout

  28. Lessons Learnt from PLASMA • PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures: http://icl.cs.utk.edu/plasma/ • Tile Algorithms on homogeneous x86 cores • Parallelism is brought to the fore • May require the redesign of linear algebra algorithms • Tile data layout translation • Remove unnecessary synchronization points between Panel-Update sequences • DAG execution where nodes represent tasks and edges define dependencies between them • Dynamic runtime system environment QUARK

  29. Example: Tile QR Factorization First panel factorization and corresponding updates DAG for a 4x4 tiles matrix

  30. Let’s go crazy! DAG of 20x20 tile QR Factorization

  31. Dynamic Scheduling • Conceptually similar to out-of-order processor scheduling because it has: • Dynamic runtime DAG scheduler • Out-of-order execution flow of fine-grained tasks • Task scheduling as soon as dependencies are satisfied • Producer-Consumer • Data Flow Programming Model: five decades old concept • Think "how things connect" rather than "how things happen” • Assembly line • Inherently parallel

  32. Matrices Over Runtime Systems at Exascale (MORSE) • Mission statement: "Design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale Hybrid systems” • Runtime challenges due to the ever growing hardware complexity • Algorithmic challenges to exploit the hardware capabilities at most • Integrated into MAGMA software stack

  33. MAGMA-MORSE: x86 + MultiGPUs • Lessons Learned from PLASMA! • CUDA-based hybrid systems • New high performance numerical kernels • StarPU Runtime System (Augonnet et. Al, INRIA, Bordeaux) • Both: x86 and GPUs=> Hybrid Computations • Similar to LAPACK in functionality

  34. Achieving High Level of Productivity • From Sequential Nested-Loop Code to Parallel Execution for (k = 0; k < min(MT, NT); k++){ zgeqrt(A[k;k], ...); for (n = k+1; n < NT; n++) zunmqr(A[k;k], A[k;n], ...); for (m = k+1; m < MT; m++){ ztsqrt(A[k;k],,A[m;k], ...); for (n = k+1; n < NT; n++) ztsmqr(A[m;k], A[k;n], A[m;n], ...); } }

  35. Achieving High Level of Productivity • From Sequential Nested-Loop Code to Parallel Execution for (k = 0; k < min(MT, NT); k++){ starpu_Insert_Task(&cl_zgeqrt, k , k, ...); for (n = k+1; n < NT; n++) starpu_Insert_Task(&cl_zunmqr, k, n, ...); for (m = k+1; m < MT; m++){ starpu_Insert_Task(&cl_ztsqrt, m, k, ...); for (n = k+1; n < NT; n++) starpu_Insert_Task(&cl_ztsmqr, m, n, k, ...); } }

  36. Hybrid Architecture Targeted • ⇒ PCI Interconnect 16X 64Gb/s, very thin pipe! • ⇒ Fermi C2050 448 cuda cores 515 Gflop/s

  37. Performance Charts • Cholesky • QR • LU • Symmetric Matrix Inversion

  38. Cholesky Performance 8 Intel x86 cores + 3 Tesla GPUs E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, S. Thibault, R. Namyst, S. Thibault, S. Tomov, Software for GPUs, GPU Computing GEMs, vol.2, 2011.

  39. QR Performance 8 AMD x86 cores + 4 Tesla GPUs E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and Distributed Processing Symposium, 2011.

  40. QR Performance +~200 Gflop/s but 12 cores = ~150 Gflop/s 8 AMD x86 cores + 4 Tesla GPUs E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and Distributed Processing Symposium, 2011.

  41. Performance Breakdown • Task distribution observed on StarPU: • sgeqrt : 20% of tasks on GPUs • stsmqr : 92.5% of tasks on GPUs • Taking advantage of heterogeneity ! • Only do what you are good for • Don’t do what you are not good for

  42. LU Performance 8 Intel x86 cores + 3 Tesla GPUs E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief and S. Tomov, ACS/IEEE International Conference on Computer Systems and Applications (best paper award), 2011

  43. Symmetric Matrix Inversion • A−1, Seriously??? • YES! • Critical component of the variance-covariance matrix computation in statistics • Three steps: • Cholesky factorization (DPOTRF) • Inverting the Cholesky factor (DTRTRI) • Calculating the product of the inverted Cholesky factor with its transpose (DLAUUM) Built on previous work from E. Agullo, H. Bouwmeester, J. Dongarra, J. Kurzak, J. Langou, and L. Rosenberg

  44. Scheduling Algorithms as DAGs

  45. A−1 Performance 8 Intel x86 cores + 2 Fermi GPUs H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief, HIPC'11, India

  46. Summary and Future Directions • Two methodologies for solving challenging DLA • Static Scheduling (performance) • Dynamic Scheduling (productivity) • LAPACK compliant API • Source codes freely available in MAGMA • What’s next? • Extended numerical functionality • Distributed Memory Systems

  47. Colloborators / Support • MAGMA [Matrix Algebra on GPUand Multicore Architectures] teamhttp://icl.cs.utk.edu/magma/ • PLASMA [Parallel Linear Algebrafor Scalable Multicore Architectures] teamhttp://icl.cs.utk.edu/plasma • Collaborating partners University of Tennessee, KnoxvilleUniversity of California, BerkeleyUniversity of Colorado, DenverINRIA, FranceKAUST, Saudi Arabia

  48. Questions?

More Related