1 / 10

GPU Requirements for Large Scale Scientific Applications

GPU Requirements for Large Scale Scientific Applications. “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology UCRL-PRES-206123 August 7, 2004. Presented to GP 2 Workshop.

fanchon
Download Presentation

GPU Requirements for Large Scale Scientific Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology UCRL-PRES-206123 August 7, 2004 Presented to GP2 Workshop This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48. GPU Req

  2. Overview • Code Characteristics • Hardware requirements • Software requirements • Runtime requirements Gilmer, 10MAtoms, 800 CPU @ 2 GiB, 40(110 GB output, 48 HR). 10K LOC, 35% PE efficiency, 95% parallel efficiency Bringa, 350MAtoms, 1,944 CPU @ 1GiB, 50 TB output, 7 Days. 10K LOC, 35% PE efficiency, 95% parallel efficiency GPU Req

  3. Simulations and experimental program are tightly coupled for overall confidence in the stockpile Simulation’s value is dependent on the other elements of the integrated program GPU Req

  4. Code Characteristics • Complex multi-physics package applications • Typically solving multiple types of PDEs • Time evolution calculations (100K time steps  weeks of runtime) • Non-linear solves in each package (100s) • Linear solves within non-linear solve (1Ks) • Multiple physical properties databases • Languages include C, C++, Fortran90, Python • 50K-1.5M LOC • Heavy use of complex structures and C++ templates • Need programming model and platform architecture stability for horizontal (platform) and vertical (time-dependent) portability • Very complex makefiles, controllers (perl & python) and pre- and post-processing • Designed and written from the ground up for MPI and OpenMP style parallelism • Targeted at hierarchical memory systems • Lots of low level parallelism left to be exploited, but with short vector lengths • Written by large (5-25 people) teams • Core physics physicists • Computer scientists • Mathematicians • TImespans • 3-5 years to develop, 10 years usage, 5-10 years legacy • Constant evolution of codes to add physics features, debug, improved validation and databases GPU Req

  5. Application Performance Characteristics • Node code • No hot spots – e.g., package has 20 routines with 5% runtime each • Compute intensive with 5-35% performance efficiency • 5-20% FMA • Random access and block access memory patterns • Most don’t have math library (e.g., BLAS3) usage • Typically use 0.5-4.0 GiB of memory • MPI • Long and short messages, depending on package • Exchanges for FEM • Random connections for sparse matrix ops • Highly dependent on Barrier, ALL_REDUCE GPU Req

  6. GPU hardware requirements • 64b arithmetic predominates, but some 32b is acceptable • Need better IEEE arithmetic • Better FP behavior, not full compliance • Exception generation mechanism • Large memory and access to node memory • Streaming access to node memory • Random access and block access modes • Reduced texture memory restrictions • Efficient Gather and Scatter mechanisms • Short vectors  low overhead to start parallelism • Conditional execution essential for vectorization of if-tests GPU Req

  7. GPU software requirements • Languages • The closer to C and C++ the better • Porting to OpenGL is not an option • Challenge is to be able to express data parallelism (streams) in portable C • Ability to debug essential • How to efficiently utilize multiple GPUs? • Multiple levels of parallelism (data parallel, mGPU, GPU-CPU, mCPU) • Open source • Device drivers, compilers, debuggers, etc GPU Req

  8. Runtime requirements • Dynamically load programs into GPU with dynamic linked libraries • Need exception mechanism • Ability to cleanly map node memory into GPU memory • Move data with portable constructs GPU Req

  9. Possible approaches • HPC market potential could be used to induce vendors to improve environment • 1K clusters have 2-4K slots for GPUs… • Large market • Libraries • Not widely used • Few key applications could benefit • Key functions • Monte carlo random number generation • EOS evaluation utilizing “free” interpolation • FEM Elem by Elem Mx operations • Secondary calculations (diagnostics, visualization) • Work with early adopters GPU Req

  10. Conclusions Woodward, 8 BZone, TurbHydro, 2K CPU @ 1.5GiB, 25 days 25 TB vis data • Large scientific simulations have enormous computational requirements • GPUs offer unique capabilities and are becoming more usable • Wide spread adoption awaits more general purpose usability Langer, 6.8 TZone, Laser Plasma interaction, 1920 CPU @ 1GiB, 10 days for 35 pico-sec, 14 TB vis data GPU Req

More Related