1 / 20

新型异构并行计算机上的 数据传输与程序设计 陈一 峯 北京大学信息学院

新型异构并行计算机上的 数据传输与程序设计 陈一 峯 北京大学信息学院. Manycore-Accelerated Clusters. Tianhe 1A: 1 GPU/ 2CPUs Tsubame : 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUs PKU McClus: 2GPUs/1CPU. Various Manycore Designs. Tilera. APU. Fermi/Kepler. MIC. Single-Chip Cloud. Larrabee. Cell.

yama
Download Presentation

新型异构并行计算机上的 数据传输与程序设计 陈一 峯 北京大学信息学院

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 新型异构并行计算机上的数据传输与程序设计陈一峯北京大学信息学院 新型异构并行计算机上的数据传输与程序设计陈一峯北京大学信息学院

  2. Manycore-Accelerated Clusters Tianhe 1A: 1 GPU/ 2CPUs Tsubame:3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUs PKU McClus: 2GPUs/1CPU

  3. Various Manycore Designs Tilera APU Fermi/Kepler MIC Single-Chip Cloud Larrabee Cell

  4. Parallel Programming:Toolbox vs Writing Case OpenMP Irregular Structures CUDA MPI Array-Data-Parallel Task-Parallel

  5. “Only-need-to-learn”of Parray • Dimensions in a tree • A dimension may refer to another array type.

  6. Memory Layout Re-ordering for (inti=0;i<4;i++) for (int j=0;j<4;j++) memcpy(b+i*8+j*2, a+i*2+j*8, 2); #parray {paged float [4][4][2]} D #parray {paged float [[#D_0][#D_1]][#D_2]} A #parray {paged float [[#D_1][#D_0]][#D_2]} B #insert DataTransfer(a, A, b, B) {} 6

  7. Network Communication MPI_Alltoall #parray {mpi[4]} M #parray {paged float [4][2]} D #parray {[[#M][#D_0]][#D_1]} A #parray {[[#D_0][#M]][#D_1]} B #insert DataTransfer(a, A, b, B) {} 7

  8. PCI Data Transfer GPU0 GPU0 GPU1 GPU1 GPU2 GPU2 GPU3 GPU3 cudaMemcpy(d2d) #parray {pthd[4]} P #parray {dmem float[4][2]} D #parray {[[#P][#D_0]][#D_1]} A #parray {[[#D_0][#P]][# D_1]} B #insert DataTransfer(a, A, b, B) {} 8

  9. MPI or IB/verbs CUDA + Pthread #parray {mpi[4]} M #parray {paged float [4][2]} D #parray {[[#M][#D_0]][#D_1]} A #parray {[[#D_0][#M]][# D_1]} B #mainhost{ #detour M { float *a, *b; #create D(a) #create D(b) #insert DataTransfer(a, A, b, B){} #destroy D(a) #destroy D(b) } } #parray {pthd[4]} P #parray {dmem float [4][2]} D #parray {[[#P][#D_0]][#D_1]} A #parray {[[#D_0][#P]][# D_1]} B #mainhost{ #detour P { float *a, *b; INIT_GPU($tid$); #create D(a) #create D(b) #insert DataTransfer(a, A, b, B){} #destroy D(a) #destroy D(b) } }

  10. Discontiguous Communication #parray { mpi[7168] } M #parray { pinned[2][14336][14336] } D #parray {[[#M][#D_0][#D_1]][#D_2]} S #parray {[[#D_1][#M][#D_0]][#D_2]} T #insert DataTransfer(t,T,s,S) {}

  11. Hierarchical SPMDs #mainhost { #parallel { #detour pthd[3] { …… #detour mpi[4] { …… } } …… #detour cuda[2][128] { …… #detour cuda[4][256] { …… } …… } …… } }

  12. GPU SGEMM 16

  13. 620Gflops on Fermi C1060 17

  14. Large FFT(ICS 10, PPoPP 12)

  15. Direct Simulation of Turbulent Flows • Scale • 12 distributed arrays 128TB • Entire Tianhe-1A with 7168 GPUs • Progress • 4096 3D completed • 8192 3D half-way • and 14336 3D tested for performance. • Software Technologies • PARRAY (ACM PPoPP’12) code only 300 lines.

  16. Discussions Performance transparency:macros are compiled out. Completeness: any index expressionsusing add/mul/mod/div/fcomp Regular structures from applications and target manycore hardware Irregular structures allowed but better supported by other tools Typical training = 3 days Release in http://code.google.com/p/parray-parallel-array/

More Related