1 / 36

Satisfying Your Dependencies with SuperMatrix

Satisfying Your Dependencies with SuperMatrix. Ernie Chan. Motivation. Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures Schedule submatrix operations out-of-order via dependency analysis Programmability

brandi
Download Presentation

Satisfying Your Dependencies with SuperMatrix

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Satisfying Your Dependencies with SuperMatrix Ernie Chan Cluster 2007

  2. Motivation • Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures • Schedule submatrix operations out-of-order via dependency analysis • Programmability • High-level abstractions to hide details of parallelization from user Cluster 2007

  3. Outline • SuperMatrix • Implementation • Performance Results • Conclusion Cluster 2007

  4. SuperMatrix Cluster 2007

  5. SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) && FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*------------------------------------------------------------------*/ FLA_LU_nopiv( A11 ); FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_UNIT_DIAG, FLA_ONE, A11, A12 ); FLA_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 ); /*------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } Cluster 2007

  6. SuperMatrix • LU Factorization Without Pivoting • Iteration 1 LU TRSM TRSM TRSM GEMM GEMM GEMM TRSM GEMM Cluster 2007

  7. SuperMatrix • LU Factorization Without Pivoting • Iteration 2 LU TRSM TRSM GEMM Cluster 2007

  8. SuperMatrix • LU Factorization Without Pivoting • Iteration 3 LU Cluster 2007

  9. SuperMatrix • FLASH • Matrix of matrices Cluster 2007

  10. SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) && FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*------------------------------------------------------------------*/ FLASH_LU_nopiv( A11 ); FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_UNIT_DIAG, FLA_ONE, A11, A12 ); FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 ); /*------------------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( ); Cluster 2007

  11. SuperMatrix • Analyzer • Delay execution and place tasks on queue • Tasks are function pointers annotated with input/output information • Compute dependence information (flow, anti, output) between all tasks • Create DAG of tasks Cluster 2007

  12. SuperMatrix • Dispatcher • Use DAG to execute tasks out-of-order in parallel • Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation • SuperScalar vs. SuperMatrix Cluster 2007

  13. SuperMatrix • Dispatcher • 4 threads • 5 x 5 matrix of blocks • 55 tasks • 18 stages LU TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM LU TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM LU TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM LU TRSM TRSM GEMM LU Cluster 2007

  14. Outline • SuperMatrix • Implementation • Performance Results • Conclusion Cluster 2007

  15. Implementation • Analyzer LU LU TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM GEMM Task Queue DAG of tasks GEMM LU GEMM GEMM TRSM TRSM LU TRSM TRSM GEMM GEMM LU LU Cluster 2007

  16. Implementation • Analyzer • FLASH routines enqueue tasks onto global task queue • Dependencies between each task are calculated and stored in the task structure • Each submatrix block stores the last task enqueued that writes to it • Flow dependencies occur when a subsequent task reads that block • DAG is embedded in task queue Cluster 2007

  17. Implementation • Dispatcher Task Queue Waiting Queue LU LU … LU TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM TRSM GEMM GEMM GEMM GEMM LU TRSM TRSM GEMM LU Threads Cluster 2007

  18. Implementation • Dispatcher • Place ready and available tasks on global waiting queue • First task on task queue always ready and available • Threads asynchronously dequeue tasks from head of waiting queue • Once a task completes execution, notify dependent tasks and update waiting queue • Loop until all tasks complete execution Cluster 2007

  19. Outline • SuperMatrix • Implementation • Performance Results • Conclusion Cluster 2007

  20. Performance Results Cluster 2007

  21. Performance Results • GotoBLAS 1.13 installed on all machines • Supported Operations • LAPACK-level functions • Cholesky factorization • LU factorization without pivoting • All level-3 BLAS • GEMM, TRMM, TRSM • SYMM, SYRK, SYR2K • HEMM, HERK, HER2K Cluster 2007

  22. Performance Results • Implementations • SuperMatrix + serial BLAS • FLAME + multithreaded BLAS • LAPACK + multithreaded BLAS • Block size = 192 • Processing elements = 8 Cluster 2007

  23. Performance Results • SuperMatrix Implementation • Fixed block sized • Varying block sizes can lead to better performance • Experiments show 192 generally the best • Simplest scheduling • No sorting to execute task on critical path earlier • No attempt to improve data locality in these experiments Cluster 2007

  24. Performance Results Cluster 2007

  25. Performance Results Cluster 2007

  26. Performance Results Cluster 2007

  27. Performance Results Cluster 2007

  28. Performance Results Cluster 2007

  29. Performance Results Cluster 2007

  30. Outline • SuperMatrix • Implementation • Performance Results • Conclusion Cluster 2007

  31. Conclusion • Apply out-of-order execution techniques to schedule tasks • The whole is greater than the sum of the parts • Exploit parallelism between operations • Despite having to calculate dependencies, SuperMatrix only has small performance penalties Cluster 2007

  32. Conclusion • Programmability • Code at a high level without needing to deal with aspects of parallelization Cluster 2007

  33. Authors • Ernie Chan • Field G. Van Zee • Enrique S. Quintana-Ortí • Gregorio Quintana-Ortí • Robert van de Geijn • The University of Texas at Austin • Universidad Jaume I Cluster 2007

  34. Acknowledgements • We thank the Texas Advanced Computing Center (TACC) for access to their machines and their support • Funding • NSF Grants • CCF—0540926 • CCF—0702714 Cluster 2007

  35. References [1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix Out-of-Order Scheduling of Matrix Operations on SMP and Multi-Core Architectures. In SPAA ‘07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007. [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks. Submitted to PPoPP 2008. [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures. Submitted to Euromicro PDP 2008. Cluster 2007

  36. Conclusion • More Information http://www.cs.utexas.edu/users/flame • Questions? echan@cs.utexas.edu Cluster 2007

More Related