1 / 37

SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks

SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks. Ernie Chan. Outline. Inversion of a Symmetric Positive Definite Matrix Algorithms-by-Blocks Flow vs. Anti-Dependencies Performance Conclusion. Inversion of an SPD Matrix. Three Sweeps

iren
Download Presentation

SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks Ernie Chan PPoPP 2008

  2. Outline • Inversion of a Symmetric Positive Definite Matrix • Algorithms-by-Blocks • Flow vs. Anti-Dependencies • Performance • Conclusion PPoPP 2008

  3. Inversion of an SPD Matrix • Three Sweeps • Cholesky factorization (Chol) A → U U • Inversion of a triangular matrix (Trinv) R := U • Triangular matrix multiplication by its transpose (Ttmm) A := R R T -1 -1 T PPoPP 2008

  4. Inversion of an SPD Matrix • Exposing Parallelism • Parallelizing each of the three sweeps independently creates inherent synchronization points • Programmability • Use tools provided by FLAME PPoPP 2008

  5. Outline • Inversion of a Symmetric Positive Definite Matrix • Algorithms-by-Blocks • Flow vs. Anti-Dependencies • Performance • Conclusion PPoPP 2008

  6. Algorithms-by-Blocks PPoPP 2008

  7. Algorithms-by-Blocks FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0,0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABR, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------------------------------*/ FLA_Chol( FLA_UPPER_TRIANGULAR, A11 ); FLA_Trsm( FLA_LEFT, FLA_UPPER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A12 ); FLA_Syrk( FLA_UPPER_TRIANGULAR, FLA_TRANSPOSE, FLA_MINUS_ONE, A12, FLA_ONE, A22 ); /*---------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABR, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } PPoPP 2008

  8. PPoPP 2008

  9. Algorithms-by-Blocks • Cholesky Factorization • Iteration 1 CHOL TRSM TRSM SYRK GEMM SYRK PPoPP 2008

  10. Algorithms-by-Blocks • Cholesky Factorization • Iteration 2 CHOL TRSM SYRK PPoPP 2008

  11. Algorithms-by-Blocks • Cholesky Factorization • Iteration 3 CHOL PPoPP 2008

  12. Algorithms-by-Blocks PPoPP 2008

  13. Algorithms-by-Blocks PPoPP 2008

  14. PPoPP 2008

  15. Algorithms-by-Blocks • FLASH • Matrix of matrices PPoPP 2008

  16. Algorithms-by-Blocks FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0,0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABR, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------------------------------*/ FLASH_Chol( FLA_UPPER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_LEFT, FLA_UPPER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, /* Chol Variant 3 */ FLA_ONE, A11, A12 ); FLASH_Syrk( FLA_UPPER_TRIANGULAR, FLA_TRANSPOSE, FLA_MINUS_ONE, A12, FLA_ONE, A22 ); /*---------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABR, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } PPoPP 2008

  17. PPoPP 2008

  18. Algorithms-by-Blocks FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0,0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABR, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------------------------------*/ FLASH_Trsm( FLA_LEFT, FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_MINUS_ONE, A11, A12 ); FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, /* Trinv Variant 3 */ FLA_ONE, A01, A12, FLA_ONE, A02 ); FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A01 ); FLASH_Trinv( FLA_UPPER_TRIANGULAR, FLA_NONUNIT_DIAG, A11 ); /*---------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABR, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } PPoPP 2008

  19. Algorithms-by-Blocks FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0,0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABR, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------------------------------*/ FLASH_Syrk( FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_ONE, A01, FLA_ONE, A00 ); FLASH_Trmm( FLA_RIGHT, FLA_UPPER_TRIANGULAR, /* Ttmm Variant 1 */ FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A01 ); FLASH_Ttmm( FLA_UPPER_TRIANGULAR, A11 ); /*---------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABR, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } PPoPP 2008

  20. Algorithms-by-Blocks • SuperMatrix: Analyzer • Decomposes subproblems into component tasks • Enqueue tasks onto global task queue • Internally calculates all dependencies between tasks which form a directed acyclic graph FLASH_Chol_op ( A ); FLASH_Trinv_op( A ); FLASH_Ttmm_op ( A ); FLASH_Queue_exec( ); PPoPP 2008

  21. Algorithms-by-Blocks • SuperMatrix: Dispatcher • Place ready and available tasks on global waiting queue • Threads asynchronously dequeue tasks from head of waiting queue • Once a task completes execution, notify dependent tasks and update waiting queue • Loop until all tasks complete execution PPoPP 2008

  22. PPoPP 2008

  23. Outline • Inversion of a Symmetric Positive Definite Matrix • Algorithms-by-Blocks • Flow vs. Anti-Dependencies • Performance • Conclusion PPoPP 2008

  24. Performance • Target Architecture • 16 CPU Itanium2 • ccNUMA • 8 dual-processor nodes • OpenMP • Intel Compiler 9.0 • BLAS • GotoBLAS 1.15 • Intel MKL 8.1 PPoPP 2008

  25. Performance • Implementations • SuperMatrix + serial BLAS • FLAME + multithreaded BLAS • LAPACK + multithreaded BLAS • Block size = 192 • Processors = 16 PPoPP 2008

  26. Performance • SuperMatrix Implementation • Fixed block size • Varying block sizes can lead to better performance • Experiments show 192 generally the best • Simplest scheduling • No sorting to execute task on critical path earlier • No attempt to improve data locality in these experiments PPoPP 2008

  27. Performance PPoPP 2008

  28. Performance PPoPP 2008

  29. Performance PPoPP 2008

  30. Performance PPoPP 2008

  31. Performance • Results • Only difference between FLAME and LAPACK is the use of different algorithmic variants • GotoBLAS and MKL get similar performance curves • SuperMatrix performance ramps up much faster PPoPP 2008

  32. Outline • Inversion of a Symmetric Positive Definite Matrix • Algorithms-by-Blocks • Flow vs. Anti-Dependencies • Performance • Conclusion PPoPP 2008

  33. Conclusion • Abstractions hide details of parallelization from users • SuperMatrix extracts parallelism across subroutine boundaries PPoPP 2008

  34. Authors • Field G. Van Zee • Paolo Bientinesi • Enrique S. Quintana-Ortí • Gregorio Quintana-Ortí • Robert van de Geijn • The University of Texas at Austin • Duke University • Universidad Jaume I PPoPP 2008

  35. Acknowledgements • We thank the other members of the FLAME team for their support • Funding • NSF Grants • CCF—0540926 • CCF—0702714 PPoPP 2008

  36. References [1] Paolo Bientinesi, Brian Gunter, and Robert van de Geijn. Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix. ACM Transactions on Mathematical Software. To appear. [2] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix Out-of-Order Scheduling of Matrix Operations on SMP and Multi-Core Architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007. [3] Ernie Chan, Field G. Van Zee, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. Satisfying Your Dependencies with SuperMatrix. In Proceedings of the 2007 IEEE International Conference on Cluster Computing, pages 91-99, Austin, TX, USA, September 2007. [4] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: The LU Factorization. Accepted to MTAAP 2008. [5] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures. Accepted to Euromicro PDP 2008. PPoPP 2008

  37. Conclusion • More Information http://www.cs.utexas.edu/~flame • Questions? echan@cs.utexas.edu PPoPP 2008

More Related