新型异构并行计算机上的数据传输与程序设计陈一峯北京大学信息学院

新型异构并行计算机上的数据传输与程序设计陈一峯北京大学信息学院 新型异构并行计算机上的数据传输与程序设计陈一峯北京大学信息学院

Manycore-Accelerated Clusters Tianhe 1A: 1 GPU/ 2CPUs Tsubame：3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUs PKU McClus: 2GPUs/1CPU

Various Manycore Designs Tilera APU Fermi/Kepler MIC Single-Chip Cloud Larrabee Cell

Parallel Programming:Toolbox vs Writing Case OpenMP Irregular Structures CUDA MPI Array-Data-Parallel Task-Parallel

“Only-need-to-learn”of Parray • Dimensions in a tree • A dimension may refer to another array type.

Memory Layout Re-ordering for (inti=0;i<4;i++) for (int j=0;j<4;j++) memcpy(b+i*8+j*2, a+i*2+j*8, 2); #parray {paged float [4][4][2]} D #parray {paged float [[#D_0][#D_1]][#D_2]} A #parray {paged float [[#D_1][#D_0]][#D_2]} B #insert DataTransfer(a, A, b, B) {} 6

Network Communication MPI_Alltoall #parray {mpi[4]} M #parray {paged float [4][2]} D #parray {[[#M][#D_0]][#D_1]} A #parray {[[#D_0][#M]][#D_1]} B #insert DataTransfer(a, A, b, B) {} 7

PCI Data Transfer GPU0 GPU0 GPU1 GPU1 GPU2 GPU2 GPU3 GPU3 cudaMemcpy(d2d) #parray {pthd[4]} P #parray {dmem float[4][2]} D #parray {[[#P][#D_0]][#D_1]} A #parray {[[#D_0][#P]][# D_1]} B #insert DataTransfer(a, A, b, B) {} 8

MPI or IB/verbs CUDA + Pthread #parray {mpi[4]} M #parray {paged float [4][2]} D #parray {[[#M][#D_0]][#D_1]} A #parray {[[#D_0][#M]][# D_1]} B #mainhost{ #detour M { float *a, *b; #create D(a) #create D(b) #insert DataTransfer(a, A, b, B){} #destroy D(a) #destroy D(b) } } #parray {pthd[4]} P #parray {dmem float [4][2]} D #parray {[[#P][#D_0]][#D_1]} A #parray {[[#D_0][#P]][# D_1]} B #mainhost{ #detour P { float *a, *b; INIT_GPU($tid$); #create D(a) #create D(b) #insert DataTransfer(a, A, b, B){} #destroy D(a) #destroy D(b) } }

Discontiguous Communication #parray { mpi[7168] } M #parray { pinned[2][14336][14336] } D #parray {[[#M][#D_0][#D_1]][#D_2]} S #parray {[[#D_1][#M][#D_0]][#D_2]} T #insert DataTransfer(t,T,s,S) {}

Hierarchical SPMDs #mainhost { #parallel { #detour pthd[3] { …… #detour mpi[4] { …… } } …… #detour cuda[2][128] { …… #detour cuda[4][256] { …… } …… } …… } }

GPU SGEMM 16

620Gflops on Fermi C1060 17

Large FFT(ICS 10, PPoPP 12)

Direct Simulation of Turbulent Flows • Scale • 12 distributed arrays 128TB • Entire Tianhe-1A with 7168 GPUs • Progress • 4096 3D completed • 8192 3D half-way • and 14336 3D tested for performance. • Software Technologies • PARRAY (ACM PPoPP’12) code only 300 lines.

Discussions Performance transparency:macros are compiled out. Completeness: any index expressionsusing add/mul/mod/div/fcomp Regular structures from applications and target manycore hardware Irregular structures allowed but better supported by other tools Typical training = 3 days Release in http://code.google.com/p/parray-parallel-array/

新型异构并行计算机上的 数据传输与程序设计 陈一 峯 北京大学信息学院