1 / 33

Anatomy of a High-Performance Many-Threaded Matrix Multiplication

Anatomy of a High-Performance Many-Threaded Matrix Multiplication. Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee. Introduction. Shared memory parallelism for GEMM Many-threaded architectures require more sophisticated methods of parallelism

jirair
Download Presentation

Anatomy of a High-Performance Many-Threaded Matrix Multiplication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anatomy of a High-Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee

  2. Introduction • Shared memory parallelism for GEMM • Many-threaded architectures require more sophisticated methods of parallelism • Explore the opportunities for parallelism to explain which we will exploit • Need finer grain parallelism

  3. Outline • GotoBLAS approach • Opportunities for Parallelism • Many-threaded Results

  4. GotoBLAS Approach The GEMM operation: n k n C A k B m += m

  5. registers L1 cache L2 cache L3 cache += Main Memory

  6. registers L1 cache L2 cache L3 cache nc nc += Main Memory

  7. registers L1 cache L2 cache L3 cache kc kc += Main Memory

  8. registers L1 cache L2 cache L3 cache Main Memory mc += mc

  9. registers L1 cache L2 cache nr L3 cache nr nr += Main Memory

  10. registers L1 cache L2 cache L3 cache += Main Memory mr mr

  11. Outline • GotoBLAS approach • Opportunities for Parallelism • Many-threaded Results

  12. 3 Loops to Parallelize in GotoBLAS +=

  13. 5 Opportunities for Parallelism +=

  14. Multiple Levels of Parallelism += ir • All threads share micro-panel of B • Each thread has its own micro-panel of A • Fixed number of iterations:

  15. Multiple Levels of Parallelism jr jr += • All threads share block of A • Each thread has its own micro-panel of B • Fixed number of iterations • Good if shared L2 cache

  16. Multiple Levels of Parallelism • All threads share panel of B • Each thread has its own block of A • Number of iterations is not fixed • Good if multiple L2 caches

  17. Multiple Levels of Parallelism • Each iteration updates entire C • Iterations of the loop are not independent • Requires mutex when updating C • Or a reduction

  18. Multiple Levels of Parallelism • Each iteration updates entire C • Iterations of the loop are not independent • Requires mutex when updating C • Or a reduction

  19. Multiple Levels of Parallelism • All threads share matrix A • Each thread has its own panel of B • Number of iterations is not fixed • Good if multiple L3 caches • Good for NUMA reasons

  20. Outline • GotoBLAS approach • Opportunities for Parallelism • Many-threaded Results

  21. Intel Xeon Phi • Many Threads • 60 cores, 4 threads per core • Need to use > 2 threads per core to utilize FPU • We do not block for the L1 cache • Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache • We consider part of the L2 cache as a virtual L1 • Each core has its own L2 cache

  22. IBM Blue Gene/Q • (Not quite as) Many Threads • 16 cores, 4 threads per core • Need to use > 2 threads per core to utilize FPU • We do not block for the L1 cache • Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache • We consider part of the L2 cache as a virtual L1 • Single large, shared L2 cache

  23. Thank You • Questions? • Source code available at: • code.google.com/p/blis/

More Related