1 / 40

Ernie Chan

Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures. Ernie Chan. How to Program SCC?. Tile. 48 cores in 6×4 mesh with 2 cores per tile 4 DDR3 memory c ontrollers. Core 1. Core 1. L2$1. Router. MPB. Tile. Tile. Tile.

melina
Download Presentation

Ernie Chan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan

  2. How to Program SCC? Tile • 48 cores in 6×4 mesh with 2 cores per tile • 4 DDR3 memory controllers Core 1 Core 1 L2$1 Router MPB Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile R Core 0 Core 0 R R R R R R R R R R R R R R R R R R R R R R R L2$0 Memory Controller Memory Controller Memory Controller Memory Controller System I/F MARC symposium

  3. Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion MARC symposium

  4. Elemental • New, Modern Distributed-Memory Dense Linear Algebra Library • Replacement for PLAPACK and ScaLAPACK • Object-oriented data structures for matrices • Coded in C++ • Torus-wrap/elemental mapping of matrices to a two-dimensional process grid • Implemented entirely using bulk synchronous communication MARC symposium

  5. Elemental • Two-Dimensional Process Grid: • Tile the process grid over the matrix to assign each matrix element to a process 0 2 4 1 3 5 MARC symposium

  6. Elemental • Two-Dimensional Process Grid: • Tile the process grid over the matrix to assign each matrix element to a process 0 2 4 1 3 5 MARC symposium

  7. Elemental • Two-Dimensional Process Grid: • Tile the process grid over the matrix to assign each matrix element to a process 0 2 4 1 3 5 MARC symposium

  8. Elemental • Redistributing the Matrix Over a Process Grid • Collective communication MARC symposium

  9. Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion MARC symposium

  10. Collective Communication • RCCE Message Passing API • Blocking send and receive int RCCE_send( char *buf, size_t num, int dest ); int RCCE_recv( char *buf, size_t num, int src ); • Potential for deadlock 0 1 2 3 4 5 MARC symposium

  11. Collective Communication • Avoiding Deadlock • Even number of cores in cycle 0 1 2 3 4 5 0 1 2 3 4 5 MARC symposium

  12. Collective Communication • Avoiding Deadlock • Odd number of cores in cycle 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 MARC symposium

  13. Collective Communication • Scatter int RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); Before MARC symposium

  14. Collective Communication • Scatter int RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); After MARC symposium

  15. Collective Communication • Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); Before MARC symposium

  16. Collective Communication • Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); After MARC symposium

  17. Collective Communication • Minimum Spanning Tree Algorithm • Scatter MARC symposium

  18. Collective Communication • Minimum Spanning Tree Algorithm • Scatter MARC symposium

  19. Collective Communication • Minimum Spanning Tree Algorithm • Scatter MARC symposium

  20. Collective Communication • Minimum Spanning Tree Algorithm • Scatter MARC symposium

  21. Collective Communication • Cyclic (Bucket) Algorithm • Allgather MARC symposium

  22. Collective Communication • Cyclic (Bucket) Algorithm • Allgather MARC symposium

  23. Collective Communication • Cyclic (Bucket) Algorithm • Allgather MARC symposium

  24. Collective Communication • Cyclic (Bucket) Algorithm • Allgather MARC symposium

  25. Collective Communication • Cyclic (Bucket) Algorithm • Allgather MARC symposium

  26. Collective Communication • Cyclic (Bucket) Algorithm • Allgather MARC symposium

  27. Collective Communication MARC symposium

  28. Elemental MARC symposium

  29. Elemental MARC symposium

  30. Elemental MARC symposium

  31. Elemental MARC symposium

  32. Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion MARC symposium

  33. Off-Chip Shared-Memory • Distributed vs. Shared-Memory System I/F Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile R R R R R R R R R R R R R R R R R R R R R R R R Memory Controller Memory Controller Memory Controller Memory Controller Shared-Memory Distributed Memory MARC symposium

  34. Off-Chip Shared-Memory CHOL0 • SuperMatrix • Map dense matrix computation to a directed acyclic graph • No matrix distribution • Store DAG and matrix on off-chip shared- memory TRSM1 TRSM2 SYRK3 GEMM4 SYRK5 CHOL6 TRSM7 SYRK8 CHOL9 MARC symposium

  35. Off-Chip Shared-Memory • Non-cacheable vs. Cacheable Shared-Memory • Non-cacheable • Allow for a simple programming interface • Poor performance • Cacheable • Need software managed cache coherency mechanism • Execute on data stored in cache • Interleave distributed and shared-memory programming concepts MARC symposium

  36. Off-Chip Shared-Memory MARC symposium

  37. Outline • How to Program SCC? • Elemental • Collective Communication • Off-Chip Shared-Memory • Conclusion MARC symposium

  38. Conclusion • Distributed vs. Shared-Memory • Elemental vs. SuperMatrix? • A Collective Communication Library for SCC • RCCE_comm: released under LGPL and available on the public Intel SCC software repository http://marcbug.scc-dc.com/svn/repository/trunk/ rcce_applications/UT/RCCE_comm/ MARC symposium

  39. Acknowledgments • We thank the other members of the FLAME team for their support • Bryan Marker, Jack Poulson, and Robert van de Geijn • We thank Intel for access to SCC and their help • Timothy G. Mattson and Rob F. Van Der Wijngaart • Funding • Intel Corporation • National Science Foundation MARC symposium

  40. Conclusion • More Information http://www.cs.utexas.edu/~flame • Questions? echan@cs.utexas.edu MARC symposium

More Related