1 / 99

Retreat into BLIS

Retreat into BLIS. Field G. Van Zee. Funding and publications. NSF Award OCI-1148125: SI2-SSI : A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. (Funded June 1, 2012 - May 31, 2015 .) Other sources (e.g. Microsoft)

taylor
Download Presentation

Retreat into BLIS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Retreat into BLIS Field G. Van Zee

  2. Funding and publications • NSF • Award OCI-1148125: SI2-SSI: A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. (Funded June 1, 2012 - May 31, 2015.) • Other sources (e.g. Microsoft) • ACM Transactions of Mathematical Software • “BLIS: A Framework for Rapid Instantiation of BLAS Functionality” (submitted) • “The BLIS Framework: Experiments in Portability” (submitted)

  3. Preview • What is BLIS? • Why BLIS and not BLAS? How is BLIS an improvement over existing BLAS implementations? • I’ve heard BLIS will make me more productive. How? • What kind of performance can I expect? • (and many other questions)

  4. What is BLIS? • BLAS-like Library Instantiation Software • BLIS is a framework for • Quickly instantiating high-performance BLAS-like libraries • “Why ‘BLAS-like’?”… • For now, just assume BLAS-like = BLAS

  5. What is BLAS? • Basic Linear Algebra Subprograms • Level 1: vector-vector [Lawson et al. 1979] • Level 2: matrix-vector [Dongarra et al. 1988] • Level 3: matrix-matrix [Dongarra et al. 1990] • Why are BLAS important?

  6. Why are BLAS important? • BLAS constitute the “bottom of the food chain” for most dense linear algebra applications, as well as other libraries • LAPACK, libflame, MATLAB, PETSc, etc.

  7. Why are BLAS important? • BLAS constitute the “bottom of the food chain” for most dense linear algebra applications, as well as other libraries • LAPACK, libflame, MATLAB, PETSc, etc. • The idea is simple: • if the BLAS interface is “standardized”, and • if optimized / high-performance implementation exists for your architecture then higher-level applications can easily benefit

  8. Why are BLAS important? • Plenty of BLAS implementations available • Vendor • ACML (AMD), ESSL (IBM), MKL (Intel), cuBLAS (NVIDIA), MLIB (HP), MathKeisan (NEC), Accelerate (Apple), etc. • Open source • netlib, GotoBLAS, OpenBLAS, ATLAS, etc. • So why do we need BLIS?

  9. Why do we need BLIS? • Actually, there are two questions • Why do we need BLIS? • Why do should we want BLIS? • Let’s look at the first question

  10. Why do we need BLIS? • The BLAS interface is limiting for some applications • To be expected – it was finalized 20-30 years ago! • How exactly is the BLAS interface limiting? • After all, it’s served us well for a long time

  11. Limitations of BLAS interface • Interface only allows column-major storage • We want to support column-major storage, row-major storage, and general stride (tensors). • Further yet, we want to support operands of mixed storage formats. Example: where A is column-stored, B is row-stored, and C has general stride.

  12. Limitations of BLAS interface • Why do we need general stride storage?

  13. Limitations of BLAS interface • Why do we need general stride storage? • Example: three-dimensional tensor

  14. Limitations of BLAS interface • Why do we need general stride storage? • Example: three-dimensional tensor • How do we take an arbitrary slice?

  15. Limitations of BLAS interface • Why do we need general stride storage? • Example: three-dimensional tensor • How do we take an arbitrary slice? • It may be non-contiguous in both dimensions Non-contiguous elements

  16. Limitations of BLAS interface • Incomplete support for complex operations (no “conjugate without transposition”) Examples: • axpy • gemv • gemv, gemm • her, herk • trmv, trmm • trsv, trsm

  17. Limitations of BLAS interface • BLAS API is opaque • No uniform way to access lower-level kernels • Why would one want access to these kernels? • Optimize higher-level (LAPACK-level) operations • Control packing, computation for multithreading • Implement new operations (without “reinventing the wheel”)

  18. Limitations of BLAS interface • Operation support has not changed in over two decades • BLAST Technical Forum attempted to ratify some improvements • Revisions largely ignored by implementors. Why? • Best guess: No official reference implementation

  19. Why do we need BLIS? • Why does this mean we need BLIS? • The BLAS API cannot be improved • We can’t get a better interface by building a better BLAS – we need something else altogether • This was actually one of the primary motivations for developing BLIS

  20. Why do we need BLIS? • BLIS addresses the interface issues with BLAS • Independent row and column stride properties allow flexible matrix storage • Any input operand can be conjugated • Experts can directly call lower-level packing, computation kernels • Operation support can grow over time, as needed

  21. Why do we need BLIS? • BLIS addresses the interface issues with BLAS • Independent row and column stride properties allow flexible matrix storage • Any input operand can be conjugated • Experts can directly call lower-level packing, computation kernels • Operation support can grow over time, as needed • This is why BLIS needs to exist

  22. Why should we want BLIS? • Now, why should someone want BLIS?

  23. Why should we want BLIS? • Now, why should someone want BLIS? • If you’re an end-user • Improved interface • You can still use BLAS compatibility layer

  24. Why should we want BLIS? • Now, why should someone want BLIS? • If you’re an end-user • Improved interface • You can still use BLAS compatibility layer • If you’re a developer • As a framework, BLIS makes it easier to implement high-performance BLAS • Case study: Intel SCC

  25. Why should we want BLIS? • How does BLIS make implementing high-performance BLAS easier? • First, let’s discuss: Why is it normally so time-consuming? • Let’s look at general matrix-matrix multiplication (gemm) as implemented by Kazushige Goto in GotoBLAS • [Goto and van de Geijn 2008]

  26. The gemm algorithm +=

  27. The gemm algorithm NC NC +=

  28. The gemm algorithm +=

  29. The gemm algorithm KC KC +=

  30. The gemm algorithm +=

  31. The gemm algorithm += Pack row panel of B

  32. The gemm algorithm += Pack row panel of B NR

  33. The gemm algorithm +=

  34. The gemm algorithm MC +=

  35. The gemm algorithm +=

  36. The gemm algorithm += Pack block of A

  37. The gemm algorithm += Pack block of A MR

  38. The gemm algorithm +=

  39. The gemm algorithm • Goto called this the “inner kernel” • Typically takes shape of a block-panel multiply • Consists of three loops • Coded entirely in assembly language (≈ 2000 lines) +=

  40. Level-3 BLAS • So I just write one “inner kernel” and I’m done, right? • That would be great! But no.

  41. Level-3 BLAS • General matrix multiply (gemm) • Nine cases H T += += += T T T H T += += += H H H H T += += +=

  42. Level-3 BLAS • So we need three packing routines (at least) • One for each of: No transpose, Transpose, Conjugate-transpose • Three more if packing of A and B isn’t consolidated

  43. Level-3 BLAS • Symmetric matrix multiplication (symm) • Four cases += += += +=

  44. Level-3 BLAS • Needs special packing routine for each case • Lower- and upper-stored A, left and right sides • Then, we can call gemm inner kernel as if block had no structure • Symmetric matrix multiplication (symm) +=

  45. Level-3 BLAS • So to support gemm and symm, we need one inner kernel and seven pack routines • Hermitian matrix multiply (hemm)? • Can reuse inner kernel • Needs different packing on matrix A (to conjugate the unstored regions) • Okay, one inner kernel and 11 pack routines • What else?

  46. Level-3 BLAS • Symmetric rank-k update (syrk) • Four cases T T += += T T += +=

  47. Level-3 BLAS • Needs two special inner kernels • Lower, upper-stored matrices C • Also needs to be able to pack transposed matrix A • Symmetric rank-k update (syrk) +=

  48. Level-3 BLAS • Total so far: three inner kernels and 12 pack routines • What about Hermitian rank-k update (herk)? • Need to be able to pack conjugate-transpose of A • Symmetric/Hermitian rank-2k updates can reuse kernels for rank-k

  49. Level-3 BLAS • Triangular matrix multiplication (trmm) • 24 cases T H := := := T H := := := T H := := := T H := := :=

  50. Level-3 BLAS • Needs two (or four) special inner kernels • Lower, upper-stored matrices A (left and right cases?) • Also needs to be able to pack only stored region of matrix A, possibly [conjugate-]transposed, unit/non-unit diagonal • Triangular matrix multiplication (trmm) +=

More Related