GPU Requirements for Large Scale Scientific Applications

GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology UCRL-PRES-206123 August 7, 2004 Presented to GP2 Workshop This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48. GPU Req

Overview • Code Characteristics • Hardware requirements • Software requirements • Runtime requirements Gilmer, 10MAtoms, 800 CPU @ 2 GiB, 40(110 GB output, 48 HR). 10K LOC, 35% PE efficiency, 95% parallel efficiency Bringa, 350MAtoms, 1,944 CPU @ 1GiB, 50 TB output, 7 Days. 10K LOC, 35% PE efficiency, 95% parallel efficiency GPU Req

Simulations and experimental program are tightly coupled for overall confidence in the stockpile Simulation’s value is dependent on the other elements of the integrated program GPU Req

Code Characteristics • Complex multi-physics package applications • Typically solving multiple types of PDEs • Time evolution calculations (100K time steps  weeks of runtime) • Non-linear solves in each package (100s) • Linear solves within non-linear solve (1Ks) • Multiple physical properties databases • Languages include C, C++, Fortran90, Python • 50K-1.5M LOC • Heavy use of complex structures and C++ templates • Need programming model and platform architecture stability for horizontal (platform) and vertical (time-dependent) portability • Very complex makefiles, controllers (perl & python) and pre- and post-processing • Designed and written from the ground up for MPI and OpenMP style parallelism • Targeted at hierarchical memory systems • Lots of low level parallelism left to be exploited, but with short vector lengths • Written by large (5-25 people) teams • Core physics physicists • Computer scientists • Mathematicians • TImespans • 3-5 years to develop, 10 years usage, 5-10 years legacy • Constant evolution of codes to add physics features, debug, improved validation and databases GPU Req

Application Performance Characteristics • Node code • No hot spots – e.g., package has 20 routines with 5% runtime each • Compute intensive with 5-35% performance efficiency • 5-20% FMA • Random access and block access memory patterns • Most don’t have math library (e.g., BLAS3) usage • Typically use 0.5-4.0 GiB of memory • MPI • Long and short messages, depending on package • Exchanges for FEM • Random connections for sparse matrix ops • Highly dependent on Barrier, ALL_REDUCE GPU Req

GPU hardware requirements • 64b arithmetic predominates, but some 32b is acceptable • Need better IEEE arithmetic • Better FP behavior, not full compliance • Exception generation mechanism • Large memory and access to node memory • Streaming access to node memory • Random access and block access modes • Reduced texture memory restrictions • Efficient Gather and Scatter mechanisms • Short vectors  low overhead to start parallelism • Conditional execution essential for vectorization of if-tests GPU Req

GPU software requirements • Languages • The closer to C and C++ the better • Porting to OpenGL is not an option • Challenge is to be able to express data parallelism (streams) in portable C • Ability to debug essential • How to efficiently utilize multiple GPUs? • Multiple levels of parallelism (data parallel, mGPU, GPU-CPU, mCPU) • Open source • Device drivers, compilers, debuggers, etc GPU Req

Runtime requirements • Dynamically load programs into GPU with dynamic linked libraries • Need exception mechanism • Ability to cleanly map node memory into GPU memory • Move data with portable constructs GPU Req

Possible approaches • HPC market potential could be used to induce vendors to improve environment • 1K clusters have 2-4K slots for GPUs… • Large market • Libraries • Not widely used • Few key applications could benefit • Key functions • Monte carlo random number generation • EOS evaluation utilizing “free” interpolation • FEM Elem by Elem Mx operations • Secondary calculations (diagnostics, visualization) • Work with early adopters GPU Req

Conclusions Woodward, 8 BZone, TurbHydro, 2K CPU @ 1.5GiB, 25 days 25 TB vis data • Large scientific simulations have enormous computational requirements • GPUs offer unique capabilities and are becoming more usable • Wide spread adoption awaits more general purpose usability Langer, 6.8 TZone, Laser Plasma interaction, 1920 CPU @ 1GiB, 10 days for 35 pico-sec, 14 TB vis data GPU Req

GPU Requirements for Large Scale Scientific Applications