Parallel Computing ExplainedHow to Parallelize a Code Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009
Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 2.1 Automatic Compiler Parallelism 2.2 Data Parallelism by Hand 2.3 Mixing Automatic and Hand Parallelism 2.4 Task Parallelism 2.5 Parallelism Issues 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690
How to Parallelize a Code • This chapter describes how to turn a single processor program into a parallel one, focusing on shared memory machines. • Both automatic compiler parallelization and parallelization by hand are covered. • The details for accomplishing both data parallelism and task parallelism are presented.
Automatic Compiler Parallelism • Automatic compiler parallelism enables you to use a single compiler option and let the compiler do the work. • The advantage of it is that it’s easy to use. • The disadvantages are: • The compiler only does loop level parallelism, not task parallelism. • The compiler wants to parallelize every do loop in your code. If you have hundreds of do loops this creates way too much parallel overhead.
Automatic Compiler Parallelism • To use automatic compiler parallelism on a Linux system with the Intel compilers, specify the following. ifort -parallel -O2 ... prog.f • The compiler creates conditional code that will run with any number of threads. • Specify the number of threads and make sure you still get the right answers with setenv: setenv OMP_NUM_THREADS 4 a.out > results
Data Parallelism by Hand • First identify the loops that use most of the CPU time (the Profiling lecture describes how to do this). • By hand, insert into the code OpenMP directive(s) just before the loop(s) you want to make parallel. • Some code modifications may be needed to remove data dependencies and other inhibitors of parallelism. • Use your knowledge of the code and data to assist the compiler. • For the SGI Origin2000 computer, insert into the code an OpenMP directive just before the loop that you want to make parallel. !$OMP PARALLEL DO doi=1,n … lots of computation ... end do !$OMP END PARALLEL DO
Data Parallelism by Hand • Compile with the mp compiler option. f90 -mp ... prog.f • As before, the compiler generates conditional code that will run with any number of threads. • If you want to rerun your program with a different number of threads, you do not need to recompile, just re-specify the setenvcommand. setenv OMP_NUM_THREADS 8 a.out > results2 • The setenvcommand can be placed anywhere before the a.outcommand. • The setenvcommand must be typed exactly as indicated. If you have a typo, you will not receive a warning or error message. To make sure that the setenv command is specified correctly, type: setenv • It produces a listing of your environment variable settings.
Mixing Automatic and Hand Parallelism • You can have one source file parallelized automatically by the compiler, and another source file parallelized by hand. Suppose you split your code into two files named prog1.f and prog2.f. f90 -c -apo … prog1.f (automatic // for prog1.f) f90 -c -mp … prog2.f (by hand // for prog2.f) f90 prog1.o prog2.o (creates one executable) a.out > results (runs the executable)
Task Parallelism • You can accomplish task parallelism as follows: !$OMP PARALLEL !$OMP SECTIONS … lots of computation in part A … !$OMP SECTION … lots of computation in part B ... !$OMP SECTION … lots of computation in part C ... !$OMP END SECTIONS !$OMP END PARALLEL • Compile with the mp compiler option. f90 -mp … prog.f • Use the setenvcommand to specify the number of threads. setenv OMP_NUM_THREADS 3 a.out > results
Parallelism Issues • There are some issues to consider when parallelizing a program. • Should data parallelism or task parallelism be used? • Should automatic compiler parallelism or parallelism by hand be used? • Which loop in a nested loop situation should be the one that becomes parallel? • How many threads should be used?