1 / 23

PP POMPA (WG6) Overview Talk

PP POMPA (WG6) Overview Talk. st Birthday. COSMO GM11, Rome. Who is POMPA?. ARPA-EMR Davide Cesari C2SM/ETH Xavier Lapillonne, Anne Roches, Carlos Osuna CASPUR Stefano Zampini, Piero Lanucara, Cristiano Padrin Cray Pozanovich Jeffrey, Roberto Ansaloni

sloan
Download Presentation

PP POMPA (WG6) Overview Talk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PP POMPA (WG6)Overview Talk st Birthday COSMO GM11, Rome

  2. Who is POMPA? • ARPA-EMR Davide Cesari • C2SM/ETH Xavier Lapillonne, Anne Roches, Carlos Osuna • CASPUR Stefano Zampini, Piero Lanucara, Cristiano Padrin • Cray Pozanovich Jeffrey, Roberto Ansaloni • CSCS Matthew Cordery, Mauro Biancho, Jean-Guillaume Piccinali, William Sawyer, Neil Stringfellow, Thomas Schulthess, Ugo Varetto • DWD Ulrich Schättler, Kristina Fröhlich • KIT Andrew Ferrone, Hartwig Anzt • MeteoSwiss Petra Baumann, Oliver Fuhrer, André Walser • NVIDIA Tim Schröder, Thomas Bradley • Roshydromet Dmitry Mikushin • SCS Tobias Gysi, Men Muheim, David Müller, Katharina Riedinger • USAM David Palella, Alessandro Cheloni, Pier Francesco Coppola • USI Daniel Ruprecht

  3. Kickoff Workshop • May 3-4 2011, hosted by CSCS in Manno • 15 talks, 18 participants • Goal get to know each other, report on work already done, plan and coordinate future activities • Revised project plan

  4. Task Overview • Task 1 Performance analysis and documentation • Task 2 Redesign memory layout and data structures • Closely linked to work in Task 5 and 6 • Task 3 Improve current parallelization • Task 4 Parallel I/O • Focus on NetCDF (which is still from 1 core) • Technical problems • New person (Carlos Osuna, C2SM) starting work on 15.09.2011 • Task 5 Redesign implementation of dynamical core • Task 6 Explore GPU acceleration • Task 7 Implementation documentation • No progress

  5. Performance Analysis Goal • Understand the code from a performance perspective (workflow, data movement, bottlenecks, problems, …) • Guide and prioritize the work in the other tasks • Try to ensure exchange of information and performance portability developments

  6. Performance Analysis (Task 1) Work • COSMO RAPS 5.0 benchmark with DWD, MeteoSwiss and IPCC/ETH runscripts on hpcforge.org (Ulrich Schättler, Oliver Fuhrer, Anne Roches) • Workflowof RK timestep (Ulrich Schättler)http://www.c2sm.ethz.ch/research/COSMO-CCLM/hp2c_one_year_meeting/2a_schaettler • Performance analysis • COSMO RAPS 5.0 on Cray XT4, XT5 and XE6 (Jean-Guillaume Piccinali, Anne Roches) • COSMO-ART (Oliver Fuhrer) • Wiki page

  7. Jean-Guillaume Piccinali and Anne Roches

  8. Jean-Guillaume Piccinali and Anne Roches

  9. Jean-Guillaume Piccinali and Anne Roches

  10. Problem: Overfetching • Computational intensity is the ration of floating point operations (ops) per memory reference (ref) • When accessing a single array value, a complete cache line (64 Bytes = 8 double precision values) is loaded into L1 cache • do i = 1+nbounlines, ie-nbounlines A(i) = 0.0d0end do …also loads A(1), A(2), A(3) • If subdomain on processor is very small many values loaded from memory never get used for computation

  11. Performance Analysis: Wiki https://wiki.c2sm.ethz.ch/Wiki/ProjPOMPATask1

  12. Improve Current Parallelization (Task 2) • Loop level hybrid parallelization (OpenMP/MPI) (Matthew Cordery, Davide Cesari, Stefano Zampini) • No clear benefit of this approach vs. flat MPI parallelization • Approach suitable for memory bandwidth bound code? • Restructuring of code (into blocks) may help! • Overlap communication with computation using non-blocking MPI calls (Stefano Zampini) • Lumped halo-updates for COSMO-ART (Christoph Knote, Andrew Ferrone)

  13. Halo exchange in Cosmo • 3 types of point to pointcommunications: 2 partially non-blocking and 1 full blocking (with MPI_SENDRECV) • Halo swappingneedscompletion of East to West beforestarting South to North communication (implicit corner exchange) • New versionwhichcommunicates corners (2x more messages) Stefano Zampini

  14. New halo-exchange routine OLD CALL exch_boundaries(A) communication time NEW CALL exch_boundaries(A,2) CALL exch_boundaries(A,2) CALL exch_boundaries(A,3) communication time Stefano Zampini

  15. Earlyresults: COSMO2 Total time (s) for model runsMeantotal time for RK dynamics • IsTestany / Waitany the mostefficient way to assurecompletion? • Restructuring of code to find more work (B) could help!

  16. Explore GPU Acceleration (Task 6) Goal • Investigate whether and how GPUs can be leveraged for numerical weather prediction with COSMO Background • Early investigations by Michalakes et al. using WRF physical parametrizations • Full port of JMA next-generation model (ASUCA) to GPUs via a rewrite in CUDA • New model developments (e.g. NIM at NOAA) which have GPUs as a target architecture in mind from the very start

  17. GPU Motivation Chip Architecture Peak Performance Memory Bandwidth Power Consumption Price per Node Intel Westmere 6 cores @ 3.4 GHz 81.6 GFlops 32 GB/s 130 Watt X $ NVIDIA Fermi M2090 512 cores @ 1.3 GHz 665 GFlops 155 GB/s 225 Watt X $ compute bound × 8 memory bound × 5 “power bound” × 1.7

  18. Programming GPUs • Programming languages (OpenCL, CUDA C, CUDA Fortran, …) • Two codes to maintain • Highest control, but require complete rewrite • Highest performance (if done by expert) • Directive based approach (PGI, OpenMP-acc, HMPP, …) • Smaller modifications to original code • The resulting code is still understandable by Fortran programmers and can be easily modified • Possible performance sacrifice (w.r.t. rewrite) • No standard for the moment • Source-to-source translation (F2C-acc, Kernelgen, …) • One source code • Can achieve very good performance • Legacy codes often don’t map very well onto GPUs • Hard to debug

  19. Challenges • How to change a wheel on a moving car? • GPU hardware and programming models are rapidly changing • Several approaches are vendor bound and/or not part of a standard • COSMO is also rapidly evolving • How to have a single readable code which also compiles onto GPUs? • Efficiency may require restructuring or even a change of algorithm • Directives jungle • Efficient GPU implementation requires… • to execute all of COSMO on the GPU • enough fine grain parallelism (i.e. threads)

  20. Explore GPU Acceleration (Task 6) Work • Source-to-source translation of the whole model (Dmitry Mikushin) • Porting of physical parametrizations using PGI directives or f2c-acc (Xavier Lapillone, Cristiano Padrin) next talk • Rewrite of dynamical core for GPUs (Oliver Fuhrer) talk after next talk

  21. HP2C OPCODE Project • Additional proposal to the Swiss HP2C initiative to build an “OPerational COSMO DEmonstrator (OPCODE)” • Project proposal accepted • Start of project 1 June 2011 until end of 2012 • Project lead: André Walser • Project resources: • second contract with IT company SCS to continue collaboration until end of 2012 • 2 new positions at MeteoSwiss for about 1 year • contribution to position at C2SM • contribution from CSCS

  22. GPU based hardware (a few rack units) Cray XT4 (3 cabinets) HP2C OPCODE Project Main Goals • Leverage the research results of the ongoing HP2C COSMO project • Prototype implementation of the MeteoSwiss production suite making aggressive use of GPU technology • Similar time-to-solution on hardware with substantially lower power consumption and price

  23. Thank you!

More Related