1 / 24

Developing a computational infrastructure for parallel high performance FE/FVM simulations

Developing a computational infrastructure for parallel high performance FE/FVM simulations. Dr. Stan Tomov Brookhaven National Laboratory August 11, 2003. Outline. Motivation and overview Mesh generation Mesh partitioning and load balancing Code optimization

amacklin
Download Presentation

Developing a computational infrastructure for parallel high performance FE/FVM simulations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing a computational infrastructure for parallel high performance FE/FVM simulations Dr. Stan Tomov Brookhaven National Laboratory August 11, 2003

  2. Outline • Motivation and overview • Mesh generation • Mesh partitioning and load balancing • Code optimization • Parallel FEM/FVM using pthreads/OpenMP/MPI • Code organization and data structures • Applications • Visualization • Extensions and future work • Conclusions

  3. Motivation • Technological advances facilitate research requiring very large scale computations • High computing power is needed in many FE/FVM simulations (fluid flow & transport in porous media, heat & mass transfer, elasticity, etc.) • higher demand for simulation accuracy  higher demand for computing power • To meet the demand for high computational power: • the use of sequential machines is often insufficient (physical limitations of both system memory and computer processing speed)  use parallel machines • develop better algorithms • accuracy and reliability of the computational method • efficient use of the available computing resources Closely related to: - error control andadaptive mesh refinement - optimization

  4. Motivation Parallel HP FE/FVM simulations (issues) • Choose the solver: direct or iterative • sparse matrices, storage considerations, parallelization, preconditioners • How to parallelize: • extract parallelism from sequential algorithm, or • develop ones with enhanced parallelism • domain decomposition data distribution • Mesh generation • Importance of finding a “good” mesh • in parallel, adaptive! • Data structures to maintain • preconditioners

  5. Overview MPI OpenMP pthreads OpenGL

  6. Mesh generation • Importance and requirements • Sequential generators • Triangle (2D triangular meshes) • Netgen (3D tetrahedral meshes) • ParaGrid • Based on sequential generators • Adaptively refines a starting mesh in parallel • Provides data structures suitable for domain decomposition and multilevel type preconditioners

  7. Mesh refinement

  8. Mesh partitioning • Mesh partitioners • Metis (University of Minnesota) • Chaco (Sandia National Laboratory) • Jostle (University of Greenwich, London) • Requirements • Balance of elements and minimum interface

  9. Load balancing (in AMR) • For steady state problems • Algorithm 1: locally adapt the mesh (sequentially); split using Metis; refine uniformly in parallel • Algorithm 2: use error estimates as weights in splitting the mesh; do parallel AMR • For transient problems • Algorithm 3: ParMetis is used to check the load balance, and if needed there is “transfer” of elements between sub-domains

  10. Code optimization • Main concepts: • Locality of reference(to improve memory performance) • Software pipelining(to improve CPU performance) • Locality of reference(or “keep things used together close together”): • Due to memory hierarchies • - Disc, network  RAM (200 CP) Cache levels (L2: 6 CP, L1: 3 CP) Registers (0 CP) / data for SGI Origin 2000, Mips R10000, 250 MHz • Techniques (for cache friendly algorithms in NA) • - Loop interchange :for i, j, k = 0 .. 100 do A[i][j][k] += B[i][j][k]*C[i][j][k], 10 x faster than k, j, i = 0..100 • - Vertex reordering:for example Cuthill-McKee algorithm (CG example 1.16 x faster) • - Blocking : related to domain decomposition data distribution • - Fusion :merge multiple loops into 1, for example vector operations in CG, GMRES, etc. to improve reuse

  11. Code optimization • Software pipelining: • Machine dependence - if CPU functional units are pipelined • Can be turned on with compiler options: - computing with SWP • A[i][j][k] += B[i][j][k]*C[i][j][k], i, j, k=0..100 increased performance 100 x • Techniques to improve SWP : - inlining, splitting/fusing, loop unrolling • Performance monitoring & benchmarking: • importance (in code optimization) • on SGI we use ssrun, prof, and perfex • SGI’s pmchart to monitor cluster network traffic

  12. Parallel FE/FVM with pthreads • Pthreads are portable and simple • Used in shared memory parallel systems • Low level parallel programming • User has to create more complicated parallel constructs • not widely used in parallel FE/FVM simulations • We use it on HP Systems that are both Distributed Memory Parallel & Shared Memory Parallel extern pthread_mutex_t mlock; extern pthread_cond_t sync_wait; extern int barrier_counter;extern int number_of_threads; void pthread_barrier(){ pthread_mutex_lock(&mlock); if (barrier_counter){ barrier_counter --; pthread_cond_wait(&sync_wait, &mlock); } else{ barrier_counter = munber_of_threads-1; pthread_cond_signal(&sync_wait); } pthread_mutex_unlock(&mlock); } • We use : (1) “Peer model” parallelism (threads working concurrently) • (2) “main thread” deals with MPI communications

  13. Parallel FE/FVM with OpenMP Table 3. Parallel CG on problem of size 1024x1024 • OpenMP is a portable and simple set of compiler directives and functions for parallel shared memory programming • Higher level parallel programming • Implementation often based on pthreads • Iterative solvers scale well • Used as pthreads in mixed distributed and shared parallel systems • On NUMA architectures we need to have arrays properly distributed among the processors: • #pragma distribute, #pragma redistribute • #pragma distribute_reshape • We use • domain decomposition data distribution • Programming model similar to MPI • Model : one parallel region … // sequential initialization #pragma omp parallel { int myrank = omp_get_thread_num(); // distribution using “first touch rule” S[myrank] = new Subdomain(myrank, …); … }

  14. Parallel FE/FVM with MPI • MPI is a system of functions for parallel distributed memory programming • Parallel processes communicate by sending and receiving messages • Domain decomposition data distribution approach • Usually 6 or 7 functions are used • MPI_Allreduce: in computing dot-products • MPI_Isend and MPI_Recv: in computing Matrix-vector products • MPI_Barrier: many uses • MPI_Bcast: to broadcast sequential input • MPI_Comm_rank, MPI_Comm_size

  15. Mixed implementations • MPI & pthreads/OpenMP in a cluster environment - Example: Parallel CG on (1) a problem of size 314,163, on (2) commodity- based cluster (4 nodes, each node with 2 Pentium III, running at 1GHz, 100Mbit or 1Gbit network) Table 1. MPI implementation scalability over the two networks. Table 2. MPI implementation scalability vs mixed (pthreads on the dual processors).

  16. ParaGrid code organization

  17. ParaGrid data structures • Connections between the different subdomains • in terms of packets • A vertex packet is all verticesshared by the same subdomains • The subdomains sharing packethave: • their own packet copy • “pointers” to the packet copies in the other subdomains • only one subdomain is owner of the packet • Similarly for edges and faces, used in: • refinement • problems with degrees of freedom in edges or faces

  18. Applications • Generation of large, sparse linear systems of equations on massively parallel computers • Generated on fly, no need to store large meshes or linear systems • Distributed among processing nodes • Used at LLNL to generate test problems for their HYPRE project (scalable software for solving such problems) • Various FE/FVM discretizations (used at TAMU and LLNL) with applications to : • Heat and mass transfer • Linear elasticity • Flow and transport in porous media

  19. Applications • A posteriori error control and AMR (at TAMU and BNL) • Accuracy and reliability of a computational method • Efficient use of available computational resources • Studies in domain decomposition and multigrid preconditioners (at LLNL, TAMU) • Studies in domain decomposition on non-matching grids (at LLNL and TAMU) • interior penalty discontinuous approximations • mortar finite element approximations • Visualization (at LLNL, TAMU, and BNL) • Benchmarking hardware (at BNL) • CPU performance • network traffic, etc.

  20. Visualization • Importance • Integration of ParaGrid with visualization (not compiled together): • - save mesh & solution in files for later visualization • - send directly mesh & solution through sockets for visualization • GLVis • - portable, based on OpenGL (compiled also with Mesa) • - visualize simple geometric primitives (vertices, lines, and polygons) • - can be used as a “server” • - waits for data to be visualized • - uses fork after every data set received • - combines parallel input (from ParaGrid) into a sequential visualization • VTK based • - added to support volume visualization

  21. Visualization GLVis code structure and features Abstract classes 2D scalar data visualization 3D scalar data visualization 3D vector data visualization 2D vector data visualization

  22. Extensions and future work • Extend and use the technology developed with other already existing tools for HPC • Legacy FE/FVM (or just user specific) software • Interfaces to external solvers (including direct) and preconditioners, etc. • Extend the use to various applications • Electromagnetics • Elasticity, etc. • Tune the code to particular architectures • Benchmarking and optimization • Commodity-based clusters

  23. Extensions and future work • Further develop methods and tools for adaptive error control and mesh refinement • Time dependent and non-linear problems • Better study of the constants involved in the estimates • Visualization • User specific • GPU as coprocessor? • Create user-friendly interfaces

  24. Conclusions A step toward developing computational infrastructure for parallel HPC • Domain decomposition framework • Fundamental concept/technique for parallel computing with wide area of applications • Needed for parallel HPC research in numerical PDEs • Benefit to computational researchers • Require efficient techniques to solve linear systems with millions of unknowns • Finding a “good” mesh essential for developing efficient computational methodology based on FE/FVM

More Related