Matrix transpose results with hybrid openmp mpi
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Matrix Transpose Results with Hybrid OpenMP / MPI PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

Matrix Transpose Results with Hybrid OpenMP / MPI. O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ). SCICOMP 2000, SDSC, La Jolla. Overview. Hybrid Programming Model Distributed Matrix Transpose Performance Measurements Summary of Results.

Download Presentation

Matrix Transpose Results with Hybrid OpenMP / MPI

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Matrix transpose results with hybrid openmp mpi

Matrix Transpose Resultswith Hybrid OpenMP / MPI

O. Haan

Gesellschaft für wissenschaftliche DatenverarbeitungGöttingen, Germany( GWDG )

SCICOMP 2000, SDSC, La Jolla


Overview

Overview

  • Hybrid Programming Model

  • Distributed Matrix Transpose

  • Performance Measurements

  • Summary of Results

O. Haan, Matrix Transpose Results, SCICOMP 2000


Architecture of scalable parallel computers

Architecture of Scalable Parallel Computers

Two level hierarchy

  • cluster of SMP nodesdistributed memoryhigh speed interconnect

  • SMP nodes with multiple processorsshared memorybus or switch connected

O. Haan, Matrix Transpose Results, SCICOMP 2000


Programming models

Programming Models

  • message passing over all processorsMPI implementation for shared memorymultiple access to switch adaptersSP: 4-way Winterhawk2+8-way Nighthawk-

  • shared memory over all processorsvirtual global address space SP: -

  • hybrid message passing - shared memorymessage passing between nodesshared memory within nodesSP:+

O. Haan, Matrix Transpose Results, SCICOMP 2000


Hybrid programming model

Hybrid Programming Model

SPMD programwith MPI tasksOpenMP threadswithin each taskcommunicationbetween MPI tasks

O. Haan, Matrix Transpose Results, SCICOMP 2000


Example of hybrid program

Example of Hybrid Program

program hybrid_example

include “mpif.h“

com = MPI_COMM_WORLD

call MPI_INIT(ierr)

call MPI_COMM_SIZE(com,nk,ierr)

call MPI_COMM_RANK(com,my_task,ierr)

kp = OMP_GET_NUM_PROCS()

!$OMP PARALLEL PRIVATE(my_thread)

my_thread = OMP_GET_THREAD_NUM()

call work(my_thread,kp,my_task,nk,thread_res)

!$OMP END PARALLEL

do i = 0 , kp-1

node_res = node_res + thread_res(i)

end do

call MPI_REDUCE(node_res,glob_res,1,

: MPI_REAL,MPI_SUM,0,com,ierr)

call MPI_FINALIZE(ierr)

stop

end

O. Haan, Matrix Transpose Results, SCICOMP 2000


Hybrid programming vs pure message passing

Hybrid Programming vs.Pure Message Passing

+

  • works on all SP configuration

  • coarser internode communication granularity

  • faster intranode communication

    -

  • larger programming effort

  • additional synchronization steps

  • reduced reuse of cached data

the net score depends on the problem

O. Haan, Matrix Transpose Results, SCICOMP 2000


Distributed matrix transpose

Distributed Matrix Transpose

O. Haan, Matrix Transpose Results, SCICOMP 2000


3 step transpose

3-step Transpose

n1 x n2 matrix A( i1 , i2 ) --> n2 x n1 matrix B( i2 , i1 )

decompose n1, n2 in local and global parts:

n1 = n1l * np n2 = n2l * np

write matrices A, B as 4-dim arrays:

A( i1l , i1g , i2l ; i2g ) , B( i2l , i2g , i1l ; i1g )

step 1 : local reorder

A( i1l , i1g , i2l ; i2g ) -> a1( i1l , i2l , i1g ; i2g )

step 2 : global reorder

a1( i1l , i2l , i1g ; i2g ) -> a2( i1l , i2l , i2g ; i1g )

step 3 : local transpose

a2( i1l , i2l , i2g ; i1g ) -> B( i2l , i2g , i1l ; i1g )

O. Haan, Matrix Transpose Results, SCICOMP 2000


Local steps copy with reorder

Local Steps: Copy with Reorder

  • data in memory:speed limited by performance of bus and memory subsystemsWinterhawk2 : all processors share the same bus bandwidth : 1.6 GB/s

  • data in cache:speed limited by processor performanceWinterhawk2 : one load plus one store per cyclebandwidth : 8 MB / (1/375) s =3 GB / s

O. Haan, Matrix Transpose Results, SCICOMP 2000


Copy data in memory

Copy: Data in Memory

O. Haan, Matrix Transpose Results, SCICOMP 2000


Copy prefetch

Copy : Prefetch

O. Haan, Matrix Transpose Results, SCICOMP 2000


Copy data in cache

Copy : Data in Cache

O. Haan, Matrix Transpose Results, SCICOMP 2000


Global reorder

Global Reorder

a1( *, *, i1g ; i2g ) -> a2( * , * , i2g ; i1g )

global reorder on np processors in np steps

p0 p1 p2

step0

step1

step2

O. Haan, Matrix Transpose Results, SCICOMP 2000


Performance modelling

Performance Modelling

Hardware model: nk nodes with kp procs each

np = nk * kp is total procs count

Switch model:nk concurrent links between nodes

latency tlat , bandwidth c

execution model for Hybrid: reorder on nk nodes:

nk steps with n1*n2 / nk**2 data per node

execution model for MPI: reorder on np processors:

np steps with n1*n2 / np**2 data per nodeswitch links shared between kp procs

O. Haan, Matrix Transpose Results, SCICOMP 2000


Performance modelling1

Performance Modelling

Hybrid timing model:

MPI timing model:

O. Haan, Matrix Transpose Results, SCICOMP 2000


Timing of global reorder internode part

Timing of Global Reorder (internode part)

O. Haan, Matrix Transpose Results, SCICOMP 2000


Timing of global reorder internode part1

Timing of Global Reorder (internode part)

O. Haan, Matrix Transpose Results, SCICOMP 2000


Timing of global reorder

Timing of Global Reorder

O. Haan, Matrix Transpose Results, SCICOMP 2000


Timing of transpose

Timing of Transpose

O. Haan, Matrix Transpose Results, SCICOMP 2000


Scaling of transpose

Scaling of Transpose

O. Haan, Matrix Transpose Results, SCICOMP 2000


Timing of transpose steps

Timing of Transpose Steps

O. Haan, Matrix Transpose Results, SCICOMP 2000


Summary of results hardware

Summary of Results: Hardware

  • Memory access in Winterhawk2 is not adaquate:copy rate of 400 MB/s = 50 Mwords/s peak CPU rate of 6000 Mflops/sa factor of 100 between computational speed and memory speed

  • Sharing of switch link by 4 processors degrades communication speed:bandwidth smaller by more than a factor of 4 ( factor of 4 expected )latency larger by nearly a factor of 4 ( factor of 1 expected )

O. Haan, Matrix Transpose Results, SCICOMP 2000


Summary of results hybrid vs mpi

Summary of Results: Hybrid vs. MPI

  • hybrid OpenMP / MPI programming is profitable for distributed matrix tranpose :1000 x 1000 matrix on 16 nodes : 2.3 times faster10000 x 10000 matrix on 16 nodes : 1.1 times faster

  • Competing influences :MPI programming enhances use of cached dataHybrid programming has lower communication latency and coarser communication granularity

O. Haan, Matrix Transpose Results, SCICOMP 2000


Summary of results use of transpose in fft

Summary of Results: Use of Transpose in FFT

2-dim complex array of size

Execution time on nk nodes :

where r : computational speed per nodec : transpose speed per node

effective execution speed per node :

O. Haan, Matrix Transpose Results, SCICOMP 2000


Summary of results use of transpose in fft example sp

Summary of Results: Use of Transpose in FFT- Example SP

r = 4 * 200 Mflop/s = 800 Mflop/sc depends on n, nk and programming model

nk = 16 n = 10**6 10**9

hybrid c = 5.67.8 Mword/sMPI c = 2.57.0 Mword/s

effective execution speed per node

hybrid =208338 Mflop/s

MPI =108317 Mflop/s

O. Haan, Matrix Transpose Results, SCICOMP 2000


  • Login