1 / 33

Parallel Computing 2

Objectives. Discuss major classes of parallel programming modelsHybrid programmingExamples. Introduction. Parallel computer architectures have evolved and so has the programming styles needed to effectively use these architecturesTwo styles have become defacto standardsMPI library for message pa

aneko
Download Presentation

Parallel Computing 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Parallel Computing 2 Parallel Programming Styles and Hybrids

    2. Objectives Discuss major classes of parallel programming models Hybrid programming Examples

    3. Introduction Parallel computer architectures have evolved and so has the programming styles needed to effectively use these architectures Two styles have become defacto standards MPI library for message passing and OpenMP compiler directives for multithreading Both are widely used and are virtually on every parallel system. With current parallel systems it makes more sense to mix multithreading and message passing to maximize performance

    4. HPC Architectures Based on memory distribution Shared all processors share equal access to one or more banks of memory CRAY-YMP, SGI challenge, dual and quad workstations Distributed each processor has its own memory which may or may not be visible to other processors IBM SP2 and clusters of single uniprocessor machines

    5. Distributed shared memory NUMA (non-uniform memory access) SGI origin 3000, HP superdome Cluster of SMP (shared memory system) IBM SP, Beowulf clusters

    6. Parallel Programming Styles Explicit threading Not commonly used on distributed systems Uses locks, semaphores, and mutexes Synchronization and parallelization handled by programmer POSIX threads (pthreads library)

    7. Message passing interface (MPI) Application consists of several processes Communicates by passing data to one another (send/receive) (broadcast/gather) Synchronization is still require of the programmer, however, locking is not since nothing is shared Common approach is domain decomposition where each task is assigned a subset and communicates it edge values to neighbouring subdomains

    8. Compiler directives (OpenMP) Special comments are added to serial programs in parallelizable regions Requires a compiler that understands the special directives Locking and synchronization handled by the compiler unless overwritten by directives (implicit and explicit) Decomposition is done primarily by the programmer Scalability is limited than that of MPI applications due to lesser amount of control the programmer has over how the code is parallelized

    9. Hybrid Mixture of MPI and OpenMP Used on distributed shared memory systems Applications are usually consists of computationally expensive loops punctuated by calls to MPI. These loops, in many cases, can be further parallelized by adding OpenMP directives Not a solution to all parallel programs but quite suitable for certain algorithms

    10. Why Hybrid? Performance considerations Scalability. For a fixed problem size, hybrid code will scale to higher processor counts before being overwhelmed by communication overhead A good example is the Laplace equation May not be effective where performance is limited by the speed of interconnect rather than the processor

    11. Computer architecture Some architectural limitations force the use of hybrid computing (i.E. MPI process limits per 8 or cluster blocks) Some algorithms namely FFTs run better on machines where the local bandwidth is much greater than that of the network due to the O(N) behaviour of the bandwidth required. With a hybrid approach the number of MPI processes can be lowered while retaining the same number of processors used

    12. Algorithms Some algorithms such as computational fluid dynamics benefit greatly from a hybrid approach. The solution space is separated into interconnecting zones. The interaction between zones is handled by MPI while the fine grained computations required inside a zone are handled by OpenMP

    13. Considerations on MPI, OpenMP, and Hybrid Styles General considerations Amdahls law Amdahls law states that the speedup through parallelization is limited by the portion of the serial code that cannot be parallelized. But in a hybrid program, if the fraction of MPI processes parallelized by OpenMP is not high, then the overall speedup is limited

    14. Communication patterns How do the programs communication needs match the underlining hardware? It might increase performance if a hybrid approach is used where MPI code leads to rapid growth in communication traffic

    15. Machine balance How does memory, cpu, and interconnect affect the performance of a program? If the processors are fast then communications might be a problem. Or if the primary cache is accessed differently in clusters (older machines in a Beowulf cluster)

    16. Memory access patterns Cache memory has to be effectively used in order to achieve better performance in clusters (i.e. primary, secondary, tertiary)

    17. Advantages and Disadvantages of OpenMP Advantages Comparatively easy to implement. In particular, it is easy to refit an existing serial code for parallel execution Same source code can be used for both parallel and serial versions More natural for shared-memory architectures Dynamic scheduling (load balancing is easier than with MPI) Useful for both fine and course grained problems

    18. Disadvantages Can only run on shared memory systems Limits the number of processors that can be used Data placement and locality may become serious issues Especially true for SGI NUMA architectures where the cost of remote memory access may be high Thread creation overhead can be significant unless enough work is performed in each parallel loop Implementing course-grained solutions in OpenMP is usually about as involved as constructing the analogous MPI application Explicit synchronization is required

    19. General characteristics Most effective for problems with fine-grain parallelism (i.E. Loop-level) Can also be used for course-grained parallelism Overall intra-node memory bandwidth may limit the number of processors that can effectively be used Each thread sees the same global memory, but has its own private memory Implicit messaging High level of abstraction (higher than MPI)

    20. Advantages and Disadvantages of MPI Advantages Any parallel algorithm can be expressed in terms of the MPI paradigm Runs on both distributed and shared-memory systems. Performance is generally good in either environment Allows explicit control over communication, leading to high efficiency due to overlapping communication and computation Allows for static task handling Data placement problems are rarely observed For suitable problems MPI scales well to very large numbers of processors MPI is portable Current implementations are efficient and optimized

    21. Disadvantages Application development is difficult. Re-fitting existing serial code using MPI is often a major undertaking, requiring extensive restructuring of the serial code It is less useful with fine-grained problems where communication costs may dominate For all-to-all type operations, the effective number of point-to-point interactions increases as the square of the number of processors resulting in rapidly increasing communication costs Dynamic load balancing is difficult to implement Variations exist in different manufacturers implementation of the entire MPI library. Some may not implement all the calls, while others offer extensions

    22. General characteristics MPI is most effective for problems with course-grained parallelism, for which The problem decomposes into quasi-independent pieces and Communication needs are minimized

    23. The Best of Both Worlds Use hybrid programming when The code exhibits limited scaling with MPI The code could make use of dynamic load balancing The code exhibits fine-grained or a combination of both fine-grained and course-grained parallelism The application makes use of replicated data

    24. Problems When Mixing Modes Environment variables may not be passed correctly to the remote MPI processes. This has negative implications for hybrid jobs because each MPI process needs to access the MP_SET_NUMTHREADS environment variable in order to start up the proper number of OpenMP threads. It can be solved by always setting the number of OpenMP threads within the code

    25. Calling MPI communication functions within OpenMP parallel regions Hybrid programming works with having OpenMP threads spawned from MPI processes. It does not work the other way. It will result in a runtime error

    26. Laplace Example Outline Serial MPI OpenMP Hybrid

    27. Outline The Laplace equation in two dimensions is also know as the potential equation and is usually one of the first PDEs (partial differential equations) encountered c(?u/?x + ?u/?y) = 0 Governing equation for electrostatics, heat diffusion, and fluid flow. By adding a function f(x,y) we get Poissons equation, a first derivative in time we get the diffusion equation and by adding a second derivative in time we get the wave equation A numerical solution to this PDE can be computed by using a finite difference based approach

    28. Using an iterative method to solve the equation we get the following: du[sup(n+1)sub(ij)] = (u[sup(n)sub(I-1j)] + u[sup(n)sub(I+1j] + u[sup(n)sub(ij-1)] + u[sup(n)sub(ij+1)]) / 4- u[sup(n)sub(ij)] u[sup(n+1)sub(ij)] = u[sup(n)sub(ij) + du[sup(n+1)sub(ij)] *note* - n represents iteration number , not an exponent

    29. Serial cache-friendly approach incrementally computes du values and compares it with the current maximum, then updates all u values. Usually can be done without any additional memory operations good for clusters See code

    30. MPI Currently the most utilized method for distributed memory systems Note processes are not the same as processors. An MPI process can be thought of as a thread and multiple threads can run on a single processor. The system is responsible for mapping the MPI processes to physical processors Each process is an exact copy of the program with the exception that each copy has its own unique id

    31. Hello World PROGRAM hello INCLUDE mpi.f INTEGER ierror, rank, size CALL MPI_INIT(ierror) CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) CALL MPI_COMM_SIZE(MPI_COM_WORLD, size, ierror) if (rank .EQ. 2) print *, P:, rank, Hello World print *, I have rank , rank, out of, size CALL MPI_FINALIZE(ierror) END #include <mpi.h> #include <stdio.h> void main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==2) printf ("P:%d Hello World\n",rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("I am %d out of %d.\n", rank, size); MPI_Finalize(); }

    32. OpenMP OpenMP is a tool for writing multi-threaded applications in a shared memory environment. It consists of a set of compiler directives and library routines. The compiler generates multi-threaded code based on the specified directives. OpenMP is essentially a standardization of the last 18 years or so of SMP (Symmetric Multi-Processor) development and practice See code

    33. Hybrid Remember you are running both MPI and OpenMP so f90 O3 mp file.f90 lmpi See code

More Related