1 / 9

Halo exchange in Cosmo

Non blocking communications in RK dynamics. Current status and future work. Stefano Zampini, CASPUR/CNMCA WG-6 PP POMPA @ Cosmo GM Rome, September 6 2011. Halo exchange in Cosmo. 3 types of point to point communications: 2 partially non blocking and 1 full blocking (with MPI_SENDRECV)

doris
Download Presentation

Halo exchange in Cosmo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Non blocking communications in RK dynamics.Current status and future work.Stefano Zampini, CASPUR/CNMCAWG-6 PP POMPA @ Cosmo GMRome, September 6 2011

  2. Halo exchange in Cosmo • 3 types of point to point communications: 2 partially non blocking and 1 full blocking (with MPI_SENDRECV) • Halo swapping needs completion of East to West before starting South to North communication (implicit corner exchange) • Also: choice for explicit buffering or derived MPI datatypes

  3. Details on nonblocking exchange • Full halo exchange including corners: 2x messages, same amount of data on network. • 3 different stages: send, receive and wait. • Minimizing overhead: at first time step persistent requests are created using calls to MPI_SEND_INIT and MPI_RECV_INIT. • During model run: MPI_STARTALL used for starting requests. MPI_TESTANY/MPI_WAITANY used for completion. • Actual implementation with explicit send and receive buffering only: needs to be extended to derived MPI datatypes. • Strategy used in RK dynamics (manual implementation): - Sends are posted whenever needed data has been locally computed. - Receives are posted whenever receive buffer is ready to be used. - Waits are posted just before data is needed for next local computation.

  4. New synopsis for swap subroutine • Actual call to subroutine exchg_boundaries • 4 more argument in call to subroutine iexchg_boundaries - ilocalreq(16): array of request (integer declared as module variable, one for each swap scenario inside the module) - operation(3): array of logicals indicating stage to perform (send,recv,wait) - istartpar,iendpar: needed for corners' definition

  5. New Implementation

  6. Benchmark details • COSMO RAPS 5.0 with MeteoSwiss namelist (25h hours of forecast) • Cosmo2 (520x350x60, dt 20) and Cosmo7 (393x338x60, dt 60) • Decompositions: tiny (10x12+4), small (20x24+4) and usual (28x35+4) • Code compiled with Intel ifort 11.1.072 and HPMPI COMFLG1 = -xssse3 -O3 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG2 = -xssse3 -O3 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG3 = -xssse3 -O2 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG4 = -xssse3 -O2 -fp-model precise -free -fpp -override-limits -convert big_endian LDFLG = -finline-functions -O3 • Runs on PORDOI linux cluster at CNMCA:128 dual-socket quad-core nodes (1024 total cores) • Each socket: quad core Intel Xeon E5450 @3.00 Ghz with 1 GB RAM for each core • Profiling with Scalasca 1.3.3 (very small overhead)

  7. Early results: COSMO 7 • Total time (s) for model runs Mean total time for RK dynamics

  8. Early results: COSMO2 • Total time (s) for model runs Mean total time for RK dynamics

  9. Comments and future works • Almost same computational times for test cases considered with INTEL-HPMPI configuration • Not shown: 5% improve in computational times for PGI-MVAPICH2 (but with worse absolute times) • CFL check performed only locally with izdebug<2. • Still a lot of sinchronization in collective calls during multiplicative filling in semi-lagrange scheme: Allreduce and Allgather operations in multiple calls to sum_DDI subroutine (bottleneck for number of cores > 1000) • Bad perfomances in w_bbc_rk_up5 during RK loop over small time steps. Rewrite loop code? • What about automatic detection/insertion of swapping calls in microphysics and other parts of code? • Is Testany/Waitany the most efficient way to assure completion?

More Related