1 / 33

Recent Developments for Parallel CMAQ

Recent Developments for Parallel CMAQ. Jeff Young AMDB/ASMD/ARL/NOAA David Wong SAIC – NESCC/EPA. AQF-CMAQ. Running in “quasi-operational” mode at NCEP. 26+ minutes for a 48 hour forecast (on 33 processors). Code modifications to improve data locality. Some vdiff optimization

thelma
Download Presentation

Recent Developments for Parallel CMAQ

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recent Developments for Parallel CMAQ • Jeff Young AMDB/ASMD/ARL/NOAA • David Wong SAIC – NESCC/EPA

  2. AQF-CMAQ Running in “quasi-operational” mode at NCEP 26+ minutes for a 48 hour forecast (on 33 processors)

  3. Code modifications to improve data locality Some vdiff optimization Jerry Gipson’s MEBI/EBI chem solver Tried CGRID( SPC, LAYER, COLUMN, ROW ) – tests Indicated not good for MPI communication of data Some background: MPI data communication – ghost (halo) regions

  4. Ghost (Halo) Regions DO J = 1, N DO I = 1, M DATA( I,J ) = A( I+2,J ) * A( I,J-1 ) END DO END DO

  5. Horizontal Advection and Diffusion Data Requirements 3 4 5 3 4 Ghost Region 0 1 2 Exterior Boundary Stencil Exchange Data Communication Function

  6. Architectural Changes for Parallel I/O Tests confirmed latest releases (May, Sep 2003) not very scalable Background: Original Version Latest Release AQF Version

  7. Parallel I/O 2002 Release Computation only Read, write and computation 2003 Release Read and computation Write and computation asynchronous Write only AQF Data transfer by message passing

  8. Modifications to the I/O API for Parallel I/O For 3D data, e.g. … time interpolation INTERP3 ( FileName, VarName, ProgName, Date, Time, Ncols*Nrows*Nlays, Data_Buffer ) spatial subset XTRACT3 ( FileName, VarName, StartLay, EndLay, StartRow, EndRow, StartCol, EndCol, Date, Time, Data_Buffer ) INTERPX ( FileName, VarName, ProgName, StartCol, EndCol, StartRow, EndRow, StartLay, EndLay, Date, Time, Data_Buffer ) WRPATCH ( FileID, VarID, TimeStamp, Record_No ) Called from PWRITE3 (pario)

  9. Standard (May 2003 Release) DRIVER read IC’s into CGRID begin output timestep loop advstep (determine sync timestep) couple begin sync timestep loop SCIPROC X-Y-Z advect adjadv hdiff decouple vdiff DRYDEP cloud WETDEP gas chem (aero VIS) couple end sync timestep loop decouple write conc and avg conc CONC, ACONC end output timestep loop

  10. DRIVER set WORKERs, WRITER if WORKER, read IC’s into CGRID begin output timestep loop if WORKER advstep (determine sync timestep) couple begin sync timestep loop SCIPROC X-Y-Z advect hdiff decouple vdiff cloud gas chem (aero) couple end sync timestep loop decouple MPI send conc, aconc, drydep, wetdep, (vis) if WRITER completion-wait: for conc, write conc CONC ; for aconc, write aconc, etc. end output timestep loop AQFCMAQ

  11. Power3 Cluster (NESCC’s LPAR) 11 12 13 14 15 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory 2X slower than NCEP’s Switch The platform (cypress00, cypress01, cypress02) consists of 3 SP-Nighthawk nodes All cpu’s share user applications with file servers, interactive use, etc.

  12. Power4 p690 Servers (NCEP’s LPAR) 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Memory Memory Memory Memory 2 “colony” SW connectors/node ~2X performance of cypress nodes Switch Each platform (snow, frost) is composed of 22 p690 (Regatta) servers Each server has 32 cpu’s LPAR-ed into 8 nodes per server (4 cpu’s per node) Some nodes are dedicated to file servers, interactive use, etc. There are effectively 20 servers for general use (160 nodes, 640 cpu’s)

  13. “ice” Beowulf Cluster Pentium 3 – 1.4 GHz 0 1 0 1 0 1 0 1 0 1 0 1 Memory Memory Memory Memory Memory Memory Internal Network Isolated from outside network traffic

  14. “global” MPICH Cluster Pentium 4 XEON – 2.4 GHz global global1 global2 global3 global4 global5 0 1 0 1 0 1 0 1 0 1 0 1 Memory Memory Memory Memory Memory Memory Network

  15. RESULTS 5 hour, 12Z – 17Z and 24 hour, 12Z – 12Z runs 20 Sept 2002 test data set used for developing the AQF-CMAQ Input Met from ETA, processed thru PRDGEN and PREMAQ 166 columns X 142 rows X 22 layers at 12 km resolution Domain seen on following slides CB4 mechanism, no aerosols, Pleim’s Yamartino advection for AQF-CMAQ PPM advection for May 2003 Release

  16. The Matrix 4 = 2 X 2 8 = 4 X 2 16 = 4 X 4 32 = 8 X 4 64 = 8 X 8 64 = 16 X 4 = comparison of run times run = comparison of relative wall times wall for main science processes

  17. 24–Hour AQF vs. 2003 Release 2003 Release AQF-CMAQ Data shown at peak hour

  18. 24–Hour AQF vs. 2003 Release Almost 9 ppb max diff between Yamartino and PPM Less than 0.2 ppb diff between AQF on cypress and snow

  19. AQF vs. 2003 Release Absolute Run Times 24 Hours, 8 Worker Processors Sec

  20. AQF vs. 2003 Release on cypress and AQF on snow Absolute Run Times 24 Hours, 32 Worker Processors Sec

  21. AQF CMAQ – Various Platforms Relative Run Times 5 Hours, 8 Worker Processors % of Slowest

  22. AQF-CMAQ cypress vs. snow Relative Run Times 5 Hours % of Slowest

  23. AQF-CMAQ on Various Platforms Relative Run Times 5 Hours % of Slowest Number of Worker Processors

  24. AQF-CMAQ on cypress Relative Wall Times 5 Hours %

  25. AQF-CMAQ on snow Relative Wall Times 5 Hours %

  26. AQF vs. 2003 Release on cypress Relative Wall Times 24 hr, 8 Worker Processors PPM % Yamo

  27. AQF vs. 2003 Release on cypress Relative Wall Times 24 hr, 8 Worker Processors % Add snow for 8 and 32 processors

  28. AQF Horizontal Advection Relative Wall Times 24 hr, 8 Worker Processors % x-r-l y-c-l x-row-loop y-column-loop -hppm kernel solver in each loop

  29. AQF Horizontal Advection Relative Wall Times 24 hr, 32 Worker Processors %

  30. AQF & Release Horizontal Advection Relative Wall Times 24 hr, 32 Worker Processors % r- release

  31. Future Work Add aerosols back in TKE vdiff Improve horizontal advection/diffusion scalability Some I/O improvements Layer variable horizontal advection time steps

  32. Aspect Ratios 71 71 142 83 41 42 166 35 35 36 36 42 41 21 20

  33. Aspect Ratios 17 35 36 18 21 11 10 20

More Related