Developing a computational infrastructure for parallel high performance FE/FVM simulations

Developing a computational infrastructure for parallel high performance FE/FVM simulations Dr. Stan Tomov Brookhaven National Laboratory August 11, 2003

Outline • Motivation and overview • Mesh generation • Mesh partitioning and load balancing • Code optimization • Parallel FEM/FVM using pthreads/OpenMP/MPI • Code organization and data structures • Applications • Visualization • Extensions and future work • Conclusions

Motivation • Technological advances facilitate research requiring very large scale computations • High computing power is needed in many FE/FVM simulations (fluid flow & transport in porous media, heat & mass transfer, elasticity, etc.) • higher demand for simulation accuracy  higher demand for computing power • To meet the demand for high computational power: • the use of sequential machines is often insufficient (physical limitations of both system memory and computer processing speed)  use parallel machines • develop better algorithms • accuracy and reliability of the computational method • efficient use of the available computing resources Closely related to: - error control andadaptive mesh refinement - optimization

Motivation Parallel HP FE/FVM simulations (issues) • Choose the solver: direct or iterative • sparse matrices, storage considerations, parallelization, preconditioners • How to parallelize: • extract parallelism from sequential algorithm, or • develop ones with enhanced parallelism • domain decomposition data distribution • Mesh generation • Importance of finding a “good” mesh • in parallel, adaptive! • Data structures to maintain • preconditioners

Overview MPI OpenMP pthreads OpenGL

Mesh generation • Importance and requirements • Sequential generators • Triangle (2D triangular meshes) • Netgen (3D tetrahedral meshes) • ParaGrid • Based on sequential generators • Adaptively refines a starting mesh in parallel • Provides data structures suitable for domain decomposition and multilevel type preconditioners

Mesh refinement

Mesh partitioning • Mesh partitioners • Metis (University of Minnesota) • Chaco (Sandia National Laboratory) • Jostle (University of Greenwich, London) • Requirements • Balance of elements and minimum interface

Load balancing (in AMR) • For steady state problems • Algorithm 1: locally adapt the mesh (sequentially); split using Metis; refine uniformly in parallel • Algorithm 2: use error estimates as weights in splitting the mesh; do parallel AMR • For transient problems • Algorithm 3: ParMetis is used to check the load balance, and if needed there is “transfer” of elements between sub-domains

Code optimization • Main concepts: • Locality of reference(to improve memory performance) • Software pipelining(to improve CPU performance) • Locality of reference(or “keep things used together close together”): • Due to memory hierarchies • - Disc, network  RAM (200 CP) Cache levels (L2: 6 CP, L1: 3 CP) Registers (0 CP) / data for SGI Origin 2000, Mips R10000, 250 MHz • Techniques (for cache friendly algorithms in NA) • - Loop interchange :for i, j, k = 0 .. 100 do A[i][j][k] += B[i][j][k]*C[i][j][k], 10 x faster than k, j, i = 0..100 • - Vertex reordering:for example Cuthill-McKee algorithm (CG example 1.16 x faster) • - Blocking : related to domain decomposition data distribution • - Fusion :merge multiple loops into 1, for example vector operations in CG, GMRES, etc. to improve reuse

Code optimization • Software pipelining: • Machine dependence - if CPU functional units are pipelined • Can be turned on with compiler options: - computing with SWP • A[i][j][k] += B[i][j][k]*C[i][j][k], i, j, k=0..100 increased performance 100 x • Techniques to improve SWP : - inlining, splitting/fusing, loop unrolling • Performance monitoring & benchmarking: • importance (in code optimization) • on SGI we use ssrun, prof, and perfex • SGI’s pmchart to monitor cluster network traffic

Parallel FE/FVM with pthreads • Pthreads are portable and simple • Used in shared memory parallel systems • Low level parallel programming • User has to create more complicated parallel constructs • not widely used in parallel FE/FVM simulations • We use it on HP Systems that are both Distributed Memory Parallel & Shared Memory Parallel extern pthread_mutex_t mlock; extern pthread_cond_t sync_wait; extern int barrier_counter;extern int number_of_threads; void pthread_barrier(){ pthread_mutex_lock(&mlock); if (barrier_counter){ barrier_counter --; pthread_cond_wait(&sync_wait, &mlock); } else{ barrier_counter = munber_of_threads-1; pthread_cond_signal(&sync_wait); } pthread_mutex_unlock(&mlock); } • We use : (1) “Peer model” parallelism (threads working concurrently) • (2) “main thread” deals with MPI communications

Parallel FE/FVM with OpenMP Table 3. Parallel CG on problem of size 1024x1024 • OpenMP is a portable and simple set of compiler directives and functions for parallel shared memory programming • Higher level parallel programming • Implementation often based on pthreads • Iterative solvers scale well • Used as pthreads in mixed distributed and shared parallel systems • On NUMA architectures we need to have arrays properly distributed among the processors: • #pragma distribute, #pragma redistribute • #pragma distribute_reshape • We use • domain decomposition data distribution • Programming model similar to MPI • Model : one parallel region … // sequential initialization #pragma omp parallel { int myrank = omp_get_thread_num(); // distribution using “first touch rule” S[myrank] = new Subdomain(myrank, …); … }

Parallel FE/FVM with MPI • MPI is a system of functions for parallel distributed memory programming • Parallel processes communicate by sending and receiving messages • Domain decomposition data distribution approach • Usually 6 or 7 functions are used • MPI_Allreduce: in computing dot-products • MPI_Isend and MPI_Recv: in computing Matrix-vector products • MPI_Barrier: many uses • MPI_Bcast: to broadcast sequential input • MPI_Comm_rank, MPI_Comm_size

Mixed implementations • MPI & pthreads/OpenMP in a cluster environment - Example: Parallel CG on (1) a problem of size 314,163, on (2) commodity- based cluster (4 nodes, each node with 2 Pentium III, running at 1GHz, 100Mbit or 1Gbit network) Table 1. MPI implementation scalability over the two networks. Table 2. MPI implementation scalability vs mixed (pthreads on the dual processors).

ParaGrid code organization

ParaGrid data structures • Connections between the different subdomains • in terms of packets • A vertex packet is all verticesshared by the same subdomains • The subdomains sharing packethave: • their own packet copy • “pointers” to the packet copies in the other subdomains • only one subdomain is owner of the packet • Similarly for edges and faces, used in: • refinement • problems with degrees of freedom in edges or faces

Applications • Generation of large, sparse linear systems of equations on massively parallel computers • Generated on fly, no need to store large meshes or linear systems • Distributed among processing nodes • Used at LLNL to generate test problems for their HYPRE project (scalable software for solving such problems) • Various FE/FVM discretizations (used at TAMU and LLNL) with applications to : • Heat and mass transfer • Linear elasticity • Flow and transport in porous media

Applications • A posteriori error control and AMR (at TAMU and BNL) • Accuracy and reliability of a computational method • Efficient use of available computational resources • Studies in domain decomposition and multigrid preconditioners (at LLNL, TAMU) • Studies in domain decomposition on non-matching grids (at LLNL and TAMU) • interior penalty discontinuous approximations • mortar finite element approximations • Visualization (at LLNL, TAMU, and BNL) • Benchmarking hardware (at BNL) • CPU performance • network traffic, etc.

Visualization • Importance • Integration of ParaGrid with visualization (not compiled together): • - save mesh & solution in files for later visualization • - send directly mesh & solution through sockets for visualization • GLVis • - portable, based on OpenGL (compiled also with Mesa) • - visualize simple geometric primitives (vertices, lines, and polygons) • - can be used as a “server” • - waits for data to be visualized • - uses fork after every data set received • - combines parallel input (from ParaGrid) into a sequential visualization • VTK based • - added to support volume visualization

Visualization GLVis code structure and features Abstract classes 2D scalar data visualization 3D scalar data visualization 3D vector data visualization 2D vector data visualization

Extensions and future work • Extend and use the technology developed with other already existing tools for HPC • Legacy FE/FVM (or just user specific) software • Interfaces to external solvers (including direct) and preconditioners, etc. • Extend the use to various applications • Electromagnetics • Elasticity, etc. • Tune the code to particular architectures • Benchmarking and optimization • Commodity-based clusters

Extensions and future work • Further develop methods and tools for adaptive error control and mesh refinement • Time dependent and non-linear problems • Better study of the constants involved in the estimates • Visualization • User specific • GPU as coprocessor? • Create user-friendly interfaces

Conclusions A step toward developing computational infrastructure for parallel HPC • Domain decomposition framework • Fundamental concept/technique for parallel computing with wide area of applications • Needed for parallel HPC research in numerical PDEs • Benefit to computational researchers • Require efficient techniques to solve linear systems with millions of unknowns • Finding a “good” mesh essential for developing efficient computational methodology based on FE/FVM

Developing a computational infrastructure for parallel high performance FE/FVM simulations