Matrix Transpose Results with Hybrid OpenMP / MPI. O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ). SCICOMP 2000, SDSC, La Jolla. Overview. Hybrid Programming Model Distributed Matrix Transpose Performance Measurements Summary of Results.
Matrix Transpose Resultswith Hybrid OpenMP / MPI
O. Haan
Gesellschaft für wissenschaftliche DatenverarbeitungGöttingen, Germany( GWDG )
SCICOMP 2000, SDSC, La Jolla
Two level hierarchy
SPMD programwith MPI tasksOpenMP threadswithin each taskcommunicationbetween MPI tasks
program hybrid_example
include “mpif.h“
com = MPI_COMM_WORLD
call MPI_INIT(ierr)
call MPI_COMM_SIZE(com,nk,ierr)
call MPI_COMM_RANK(com,my_task,ierr)
kp = OMP_GET_NUM_PROCS()
!$OMP PARALLEL PRIVATE(my_thread)
my_thread = OMP_GET_THREAD_NUM()
call work(my_thread,kp,my_task,nk,thread_res)
!$OMP END PARALLEL
do i = 0 , kp1
node_res = node_res + thread_res(i)
end do
call MPI_REDUCE(node_res,glob_res,1,
: MPI_REAL,MPI_SUM,0,com,ierr)
call MPI_FINALIZE(ierr)
stop
end
+

the net score depends on the problem
n1 x n2 matrix A( i1 , i2 ) > n2 x n1 matrix B( i2 , i1 )
decompose n1, n2 in local and global parts:
n1 = n1l * np n2 = n2l * np
write matrices A, B as 4dim arrays:
A( i1l , i1g , i2l ; i2g ) , B( i2l , i2g , i1l ; i1g )
step 1 : local reorder
A( i1l , i1g , i2l ; i2g ) > a1( i1l , i2l , i1g ; i2g )
step 2 : global reorder
a1( i1l , i2l , i1g ; i2g ) > a2( i1l , i2l , i2g ; i1g )
step 3 : local transpose
a2( i1l , i2l , i2g ; i1g ) > B( i2l , i2g , i1l ; i1g )
a1( *, *, i1g ; i2g ) > a2( * , * , i2g ; i1g )
global reorder on np processors in np steps
p0 p1 p2
step0
step1
step2
Hardware model: nk nodes with kp procs each
np = nk * kp is total procs count
Switch model:nk concurrent links between nodes
latency tlat , bandwidth c
execution model for Hybrid: reorder on nk nodes:
nk steps with n1*n2 / nk**2 data per node
execution model for MPI: reorder on np processors:
np steps with n1*n2 / np**2 data per nodeswitch links shared between kp procs
Hybrid timing model:
MPI timing model:
2dim complex array of size
Execution time on nk nodes :
where r : computational speed per nodec : transpose speed per node
effective execution speed per node :
r = 4 * 200 Mflop/s = 800 Mflop/sc depends on n, nk and programming model
nk = 16 n = 10**6 10**9
hybrid c = 5.67.8 Mword/sMPI c = 2.57.0 Mword/s
effective execution speed per node
hybrid =208338 Mflop/s
MPI =108317 Mflop/s
