1 / 49

Runtime Data Flow Graph Scheduling of Matrix Computations

Runtime Data Flow Graph Scheduling of Matrix Computations. Ernie Chan. Teaser. Better. Theoretical Peak Performance. Goals. Programmability Use tools provided by FLAME Parallelism Directed acyclic graph ( DAG) scheduling. Outline. 7. Introduction

Download Presentation

Runtime Data Flow Graph Scheduling of Matrix Computations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan

  2. Teaser Better Theoretical Peak Performance NEC Labs talk

  3. Goals • Programmability • Use tools provided by FLAME • Parallelism • Directed acyclic graph (DAG) scheduling NEC Labs talk

  4. Outline 7 • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 6 5 5 4 3 4 3 2 1 NEC Labs talk

  5. SuperMatrix • Formal Linear Algebra Method Environment (FLAME) • High-level abstractions for expressing linear algebra algorithms • Cholesky Factorization NEC Labs talk

  6. SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*-----------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } NEC Labs talk

  7. SuperMatrix • Cholesky Factorization • Iteration 1 Iteration 2 * CHOL Chol( A11 ) SYRK A22 – A21 A21T CHOL Chol( A11 ) * TRSM A21 A11-T TRSM A21 A11-T SYRK A22 – A21 A21T NEC Labs talk

  8. SuperMatrix • LAPACK-style Implementation DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO NEC Labs talk

  9. SuperMatrix • FLASH • Storage-by-blocks, algorithm-by-blocks NEC Labs talk

  10. SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } NEC Labs talk

  11. SuperMatrix CHOL0 • Cholesky Factorization • Iteration 1 CHOL0 Chol( A0,0 ) NEC Labs talk

  12. SuperMatrix CHOL0 • Cholesky Factorization • Iteration 1 TRSM1 TRSM2 CHOL0 Chol( A0,0 ) TRSM1 A1,0A0,0-T TRSM2 A2,0A0,0-T NEC Labs talk

  13. SuperMatrix CHOL0 • Cholesky Factorization • Iteration 1 TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL0 Chol( A0,0 ) TRSM1 A1,0 A0,0-T SYRK3 A1,1 – A1,0 A1,0T TRSM2 A2,0 A0,0-T GEMM4 A2,1 – A2,0A1,0T SYRK5 A2,2 – A2,0 A2,0T NEC Labs talk

  14. SuperMatrix CHOL0 • Cholesky Factorization • Iteration 2 TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 CHOL6 Chol( A1,1 ) TRSM7 TRSM7 A2,1 A1,1-T SYRK8 A2,2 – A2,1 A2,1T SYRK8 NEC Labs talk

  15. SuperMatrix CHOL0 • Cholesky Factorization • Iteration 3 TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 TRSM7 CHOL9 Chol( A2,2 ) SYRK8 CHOL9 NEC Labs talk

  16. SuperMatrix • Cholesky Factorization • matrix of blocks NEC Labs talk

  17. SuperMatrix • Separation of Concerns • Analyzer • Decomposes subproblems into component tasks • Store tasks in global task queue sequentially • Internally calculates all dependencies between tasks, which form a DAG, only using input and output parameters for each task • Dispatcher • Spawn threads • Schedule and dispatch tasks to threads in parallel NEC Labs talk

  18. Outline 7 • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 6 5 5 4 3 4 3 2 1 NEC Labs talk

  19. Scheduling 7 • Dispatcher foreach task in DAG do If task is ready then Enqueue task end end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end 6 5 5 4 3 4 3 2 1 NEC Labs talk

  20. Scheduling 7 • Dispatcher foreach task in DAG do If task is ready then Enqueue task end end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end 6 5 5 4 3 4 3 2 1 NEC Labs talk

  21. Scheduling • Supermarket • lines for each cashiers • Efficient enqueue and dequeue • Schedule depends on task to thread assignment • Bank • 1 line for tellers • Enqueue and dequeue become bottlenecks • Dynamic dispatching of tasks to threads NEC Labs talk

  22. Scheduling • Single Queue • Set of all ready and available tasks • FIFO, priority Enqueue Dequeue PE1 PE0 … PEp-1 NEC Labs talk

  23. Scheduling • Multiple Queues • Work stealing, data affinity Enqueue … Dequeue PE1 PE0 … PEp-1 NEC Labs talk

  24. Scheduling • Data Affinity • Assign all tasks that write to a particular block to the same thread • Owner computes rule • 2D block cyclic distribution • Execution Trace • Cholesky factorization: • Total time: 2D data affinity ~ FIFO queue • Idle threads: 2D ≈ 27% and FIFO ≈ 17% 2 0 0 3 1 1 2 0 0 NEC Labs talk

  25. Scheduling • Data Granularity • Cost of task >> enqueue and dequeue • Single vs. Multiple Queues • FIFO queue increases load balance • 2D data affinity decreases data communication • Combine best aspects of both! NEC Labs talk

  26. Scheduling • Cache Affinity • Single priority queue sorted by task height • Software cache • LRU • Line = block • Fully associative Enqueue Dequeue PE1 PE0 … PEp-1 … $0 $1 $p-1 NEC Labs talk

  27. Scheduling • Cache Affinity • Dequeue • Search queue for task with output block in software cache • If found return task • Otherwise return head task • Enqueue • Insert task • Sort queue via task heights • Dispatcher • Update software cache via cache coherency protocol with write invalidation NEC Labs talk

  28. Scheduling • Multiple Graphics Processing Units • View a GPU as a single accelerator as opposed to being composed of hundreds of streaming processors • Must explicitly transfer data from main memory to GPU • No hardware cache coherency provided • Hybrid Execution Model • Execute tasks on both CPU and GPU NEC Labs talk

  29. Scheduling • Software Managed Cache Coherency • Use software caches developed for cache affinity to handle data transfers! • Allow blocks to be dirty on GPU until it is requested by another GPU • Apply any scheduling algorithm when utilizing GPUs, particularly cache affinity NEC Labs talk

  30. Outline 7 • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 6 5 5 4 3 4 3 2 1 NEC Labs talk

  31. Performance • CPU Target Architecture • 4 socket 2.66 GHz Intel Dunnington • 24 cores • Linux and Windows • 16 MB shared L3 cache per socket • OpenMP • Intel compiler 11.1 • BLAS • Intel MKL 10.2 NEC Labs talk

  32. Performance • Implementations • SuperMatrix + serial MKL • FIFO queue, cache affinity • FLAME + multithreaded MKL • Multithreaded MKL • PLASMA + serial MKL • Double precision real floating point arithmetic • Tuned block size NEC Labs talk

  33. Performance NEC Labs talk

  34. Performance NEC Labs talk

  35. Performance • Inversion of a Symmetric Positive Definite Matrix • Cholesky factorization CHOL • Inversion of a triangular matrix TRINV • Triangular matrix multiplication by its transpose TTMM NEC Labs talk

  36. Performance • Inversion of an SPD Matrix NEC Labs talk

  37. Performance NEC Labs talk

  38. Performance • Generalized Eigenproblem where and is symmetric and is symmetric positive definite • Cholesky Factorization where is a lower triangular matrix so that NEC Labs talk

  39. Performance then multiply the equation by • Standard Form where and • Reduction from Symmetric Definite Generalized Eigenproblem to Standard Form NEC Labs talk

  40. Performance • Reduction from … NEC Labs talk

  41. Performance NEC Labs talk

  42. Performance • GPU Target Architecture • 2 socket 2.82 GHz Intel Harpertown with NVIDIA Tesla S1070 • 4 602 MHz Tesla C1060 GPUs • 4 GB DDR memory per GPU • Linux • CUDA • CUBLAS 3.0 • Single precision real floating point arithmetic NEC Labs talk

  43. Performance NEC Labs talk

  44. Performance • Results • Cache affinity vs. FIFO queue • SuperMatrix out-of-order vs. PLASMA in-order • High variability of work stealing vs. predictable cache affinity performance • Strong scalability on CPU and GPU • Representative performance of other dense linear algebra operations NEC Labs talk

  45. Outline 7 • Introduction • SuperMatrix • Scheduling • Performance • Conclusion 6 5 5 4 3 4 3 2 1 NEC Labs talk

  46. Conclusion • Separation of Concerns • Allows us to experiment with different scheduling algorithms • Port runtime system to multiple GPUs • Locality, Locality, Locality • Data communication is important as load balance for scheduling matrix computations NEC Labs talk

  47. Current Work • Intel Single-chip Cloud Computer • 48 cores on a single die • Cores communicate via message passing buffer • RCCE_send • RCCE_recv • Software managed cache coherency for off-chip shared memory • RCCE_shmalloc NEC Labs talk

  48. Acknowledgments • We thank the other members of the FLAME team for their support • Funding • Intel • Microsoft • NSF grants • CCF–0540926 • CCF–0702714 NEC Labs talk

  49. Conclusion • More Information http://www.cs.utexas.edu/~flame • Questions? echan@cs.utexas.edu NEC Labs talk

More Related