1 / 47

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures. Ernie Chan The University of Texas at Austin. Motivation. Motivating Example Cholesky Factorization A → L L. T. Motivation. Peak Performance 96 Gflops. Better. Outline. Performance FLAME

Download Presentation

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas at Austin SPAA 2007

  2. Motivation • Motivating Example • Cholesky Factorization • A → L L T SPAA 2007

  3. Motivation Peak Performance 96 Gflops Better SPAA 2007

  4. Outline • Performance • FLAME • SuperMatrix • Conclusion SPAA 2007

  5. Performance • Target Architecture • 16 CPU Itanium2 • NUMA • 8 dual-processor nodes • OpenMP • Intel Compiler 9.0 • BLAS • GotoBLAS 1.06 • Intel MKL 8.1 SPAA 2007

  6. Performance • Implementations • Multithreaded BLAS (Sequential algorithm) • LAPACK dpotrf • FLAME var3 • Serial BLAS (Parallel algorithm) • Data-flow SPAA 2007

  7. Performance • Implementations • Column-major order storage • Varying block sizes • { 64, 96, 128, 160, 192, 224, 256 } • Select best performance for each problem size SPAA 2007

  8. Performance SPAA 2007

  9. Outline • Performance • FLAME • SuperMatrix • Conclusion SPAA 2007

  10. FLAME • Formal Linear Algebra Method Environment • High-level abstraction away from indices • “Views” into matrices • Seamless transition from algorithms to code SPAA 2007

  11. FLAME • Cholesky Factorization for ( j = 0; j < n; j++ ) { A[j,j] = sqrt( A[j,j] ); for ( i = j+1; i < n; i++ ) A[i,j] = A[i,j] / A[j,j]; for ( k = j+1; k < n; k++ ) for ( i = k; i < n; i++ ) A[i,k] = A[i,k] – A[i,j] * A[k,j]; } SPAA 2007

  12. FLAME • LAPACK dpotrf • Different variant (right-looking) DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO SPAA 2007

  13. FLAME • Partitioning Matrices SPAA 2007

  14. FLAME SPAA 2007

  15. FLAME SPAA 2007

  16. FLAME FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } SPAA 2007

  17. FLAME SPAA 2007

  18. Outline • Performance • FLAME • SuperMatrix • Data-flow • 2D data affinity • Contiguous storage • Conclusion SPAA 2007

  19. SuperMatrix • Cholesky Factorization • Iteration 1 Chol Syrk Trsm Trsm Gemm Syrk SPAA 2007

  20. SuperMatrix • Cholesky Factorization • Iteration 2 Chol Trsm Syrk SPAA 2007

  21. SuperMatrix • Cholesky Factorization • Iteration 3 Chol SPAA 2007

  22. SuperMatrix • Analyzer • Delay execution and place tasks on queue • Tasks are function pointers annotated with input/output information • Compute dependence information (flow, anti, output) between all tasks • Create DAG of tasks SPAA 2007

  23. SuperMatrix • Analyzer Chol Chol Task Queue DAG of tasks Trsm Trsm Trsm Trsm Syrk Syrk Gemm Syrk … … Gemm Chol Syrk Chol … SPAA 2007

  24. SuperMatrix • FLASH • Matrix of matrices SPAA 2007

  25. SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( ); SPAA 2007

  26. SuperMatrix • Dispatcher • Use DAG to execute tasks out-of-order in parallel • Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation • SuperScalar vs. SuperMatrix SPAA 2007

  27. SuperMatrix Chol • Dispatcher • 4 threads • 5 x 5 matrix of blocks • 35 tasks • 14 stages Trsm Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Gemm Gemm Gemm Syrk Chol Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Chol Trsm Trsm Syrk Gemm Syrk Chol Trsm Syrk Chol SPAA 2007

  28. SuperMatrix SPAA 2007

  29. SuperMatrix Chol • Dispatcher • Tasks write to block [2,2] • No data affinity Trsm Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Gemm Gemm Gemm Syrk Chol Trsm Trsm Trsm Syrk Gemm Syrk Gemm Gemm Syrk Chol Trsm Trsm Syrk Gemm Syrk Chol Trsm Syrk Chol SPAA 2007

  30. SuperMatrix Blocks of Matrices Threads Data Affinity CPU Affinity Binding Threads to Processors Owner Computes Rule Assigning Tasks to Threads Denote Tasks by the Blocks Overwritten Tasks Processors SPAA 2007

  31. SuperMatrix • Data Affinity • 2D block cyclic decomposition (ScaLAPACK) • 4 x 4 matrix of blocks assigned to 2 x 2 mesh SPAA 2007

  32. SuperMatrix SPAA 2007

  33. SuperMatrix • Contiguous Storage • One level of blocking User inherently does not need to know about the underlying storage of data SPAA 2007

  34. SuperMatrix SPAA 2007

  35. SuperMatrix • GotoBLAS vs. MKL • All previous graphs link with GotoBLAS • MKL better tuned for small matrices on Itanium2 SPAA 2007

  36. SuperMatrix SPAA 2007

  37. SuperMatrix SPAA 2007

  38. SuperMatrix SPAA 2007

  39. SuperMatrix SPAA 2007

  40. SuperMatrix • Results • LAPACK chose a bad variant • Data affinity and contiguous storage have clear advantage • Multithreaded GotoBLAS tuned for large matrices • MKL better tuned for small matrices SPAA 2007

  41. SuperMatrix SPAA 2007

  42. Outline • Performance • FLAME • SuperMatrix • Conclusion SPAA 2007

  43. Conclusion • Key Points • View blocks of matrices as units of computation instead of scalars • Apply instruction-level parallelism to blocks • Abstractions away from low-level details of scheduling SPAA 2007

  44. Authors • Ernie Chan • Enrique S. Quintana-Ortí • Gregorio Quintana-Ortí • Robert van de Geijn • Universidad Jaume I • The University of Texas at Austin SPAA 2007

  45. Acknowledgements • We thank the other members of the FLAME team for their support • Field Van Zee • Funding • NSF grant CCF-0540926 SPAA 2007

  46. References [1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225-233, New York, NY, USA, 1989. [2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2):214-244, 2001. [3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004. [4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4):422-455, 2001. [5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para 2006. B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, 2007. [6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967. SPAA 2007

  47. Conclusion • More Information http://www.cs.utexas.edu/users/flame • Questions? echan@cs.utexas.edu SPAA 2007

More Related