Matrix Transpose Results with Hybrid OpenMP / MPI. O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ). SCICOMP 2000, SDSC, La Jolla. Overview. Hybrid Programming Model Distributed Matrix Transpose Performance Measurements Summary of Results.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
O. Haan
Gesellschaft für wissenschaftliche DatenverarbeitungGöttingen, Germany( GWDG )
SCICOMP 2000, SDSC, La Jolla
O. Haan, Matrix Transpose Results, SCICOMP 2000
Two level hierarchy
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
SPMD programwith MPI tasksOpenMP threadswithin each taskcommunicationbetween MPI tasks
O. Haan, Matrix Transpose Results, SCICOMP 2000
program hybrid_example
include “mpif.h“
com = MPI_COMM_WORLD
call MPI_INIT(ierr)
call MPI_COMM_SIZE(com,nk,ierr)
call MPI_COMM_RANK(com,my_task,ierr)
kp = OMP_GET_NUM_PROCS()
!$OMP PARALLEL PRIVATE(my_thread)
my_thread = OMP_GET_THREAD_NUM()
call work(my_thread,kp,my_task,nk,thread_res)
!$OMP END PARALLEL
do i = 0 , kp1
node_res = node_res + thread_res(i)
end do
call MPI_REDUCE(node_res,glob_res,1,
: MPI_REAL,MPI_SUM,0,com,ierr)
call MPI_FINALIZE(ierr)
stop
end
O. Haan, Matrix Transpose Results, SCICOMP 2000
+

the net score depends on the problem
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
n1 x n2 matrix A( i1 , i2 ) > n2 x n1 matrix B( i2 , i1 )
decompose n1, n2 in local and global parts:
n1 = n1l * np n2 = n2l * np
write matrices A, B as 4dim arrays:
A( i1l , i1g , i2l ; i2g ) , B( i2l , i2g , i1l ; i1g )
step 1 : local reorder
A( i1l , i1g , i2l ; i2g ) > a1( i1l , i2l , i1g ; i2g )
step 2 : global reorder
a1( i1l , i2l , i1g ; i2g ) > a2( i1l , i2l , i2g ; i1g )
step 3 : local transpose
a2( i1l , i2l , i2g ; i1g ) > B( i2l , i2g , i1l ; i1g )
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
a1( *, *, i1g ; i2g ) > a2( * , * , i2g ; i1g )
global reorder on np processors in np steps
p0 p1 p2
step0
step1
step2
O. Haan, Matrix Transpose Results, SCICOMP 2000
Hardware model: nk nodes with kp procs each
np = nk * kp is total procs count
Switch model: nk concurrent links between nodes
latency tlat , bandwidth c
execution model for Hybrid: reorder on nk nodes:
nk steps with n1*n2 / nk**2 data per node
execution model for MPI: reorder on np processors:
np steps with n1*n2 / np**2 data per node switch links shared between kp procs
O. Haan, Matrix Transpose Results, SCICOMP 2000
Hybrid timing model:
MPI timing model:
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
O. Haan, Matrix Transpose Results, SCICOMP 2000
2dim complex array of size
Execution time on nk nodes :
where r : computational speed per nodec : transpose speed per node
effective execution speed per node :
O. Haan, Matrix Transpose Results, SCICOMP 2000
r = 4 * 200 Mflop/s = 800 Mflop/sc depends on n, nk and programming model
nk = 16 n = 10**6 10**9
hybrid c = 5.6 7.8 Mword/sMPI c = 2.5 7.0 Mword/s
effective execution speed per node
hybrid = 208 338 Mflop/s
MPI = 108 317 Mflop/s
O. Haan, Matrix Transpose Results, SCICOMP 2000