Introduction to OpenMP Part I

Introduction to OpenMPPart I • White Rose Grid Computing Training Series • Deniz Savas, Alan Real, Mike Griffiths • RTP Module • 2007- 2010

Historical Perspective • Single Processor CPU’s • Pipe-lined processors (Specialised pipes) • Vector Processors (SIMD) • Multi-Processors with Distributed Memory (MIMD) • Multi-Processors, Shared Memory (SMP meaning Symmetric Multi Processing)

Single Instruction Single Data Model Instructions Data Program Counter Program Counter will execute instructions in sequence, unless a jump Instruction changes its position. Each instruction will fetch at most two items from the data segment Code Segment DataSegment

Pipelining and Specialised Units CPU UNIT Floating Point Unit Integer/ Logical Unit Instruction Handling Unit Memory Management Unit CPU functions are distributed to specialised units, all of which can execute data in parallel with each other. Finest Grained optimisation is concerned with this level and is normally provided by the compiler. Memory Management Unit is related to Cache Control issues. Almost all CPU’s have this now.

Vector Processors asSingle Instruction Multiple Data Engines Scalar Unit(s) Single Instruction Vector of Data values Vector of Data values Mask Vector of 0 and 1s Vector of Results = Vector Processor Units Scalar Unit passes the single vector instruction to the vector processor along with the address range of the vector of values to operate on and the address of the results to be stored. The vector unit performs the operations independent of the scalar unit and signals The scalar unit when it finishes. Scalar unit can perform other scalar Instructions in parallel with the vector unit. Example: Cray, Cyber

Multi Processors with Distributed Memory Memory2 Memory1 CPU2 Disk2 CPU1 CPU3 Disk3 Disk1 Memory3 MPI Message Passing between CPU’s is the most suitable form of Communications for these architectures. Data is local to the CPU. Examples: Processor Farms, Intel Hypercube, Transputers

Shared Memory Multi-Processors‘Symmetric Multi Processing Machines’ CPU2 CPU1 CPU3 Shared Memory CPU.n CPU4 CPU5 This is the most Suitable Hardware Configuration for using OpenMP Examples: Sun , Sgi.

Why use OpenMP • OpenMP is suitable for shared memory, multi-processor architectures. The key feature is that each processor can access the memory ( I.e. single address space ) • Increase speed by utilising all the available CPUs (reduce wall clock time) • Improve ease of portability of parallel programs to other (hardware or software) platforms. Before OMP came along each vendor (i.e. Sun SG, Cray) had their own versions of non-standard compiler steering in-line directives that rendered such programs non-portable to other platforms. • Take advantage of the shared memory hardware of multi-processor machines (Sun, SGi, Cray, new Intel and AMD platforms )

OpenMP Philosophy • A program can be made up of multiple threads. A thread is a process which may have its own data. • Each tread runs under its own program control. • Data can be either private ‘to a thread’ or shared ‘between the threads’. • Communications between the threads are achieved via the shared data ( as they all have access to it ). • Master thread is responsible for coordinating the team of threads as shown later by the Fork and Join Model Diagram. • Serial code is simply a process with a single thread.

Shared and Private Data THREAD n THREAD 1 THREAD 2 . . . . PrivateData PrivateData PrivateData SHARED DATA Each thread has access to its own private data+ all the shared data, as shown by the diagram above. A private variable can only be seen ‘i.e. read/written’ by its own thread, where as all threads can read/write any shared data.

Parallel Regions • An OMP Program is a conventional program that contains Parallel regions, defined via OMP PARALLEL constructs. • Program begins as a single thread ( Master thread). When a parallel region is reached the master thread creates multiple threads, each thread executes the statements contained within the parallel region on its own until the end of that parallel region is reached. At the end of the parallel region, when all the threads have completed their tasks, the master thread continues executing the rest of the program until the end or until another parallel region is reached. This is known as the fork-join model as shown by the following diagram. • Branching in and out of a parallel region is not allowed (i.e no GOTO statements or overlapping DO loops etc )

Master thread Master thread F O R K J O I N F O R K J O I N Fork and Join Model A Serial Program A Program using OpenMP instructions Master thread

FORTRANprogram listingC/C++ PROGRAM xyz main( ) : { !$OMP PARALLEL #pragma omp parallel : { : : : !$OMP END PARALLEL } : : : : !$OMP PARALLEL #pragma omp parallel : { : : : !$OMP END PARALLEL } END } threads Master thread F O R K J O I N Master thread F O R K J O I N

Three components of OpenMP Programming OMP Directives These form the major elements of OpenMP programming, they- Create threads Share the work amongst threads Synchronise treads Library Routines These routines can be used to control and query the parallel execution environment such as the number of processors that are available for use. Environment Variables The execution environment such as the number of threads to be made available to an OMP program can also be set/queried at the operating system level before the program execution is started ( an alternative to calling library routines ).

Directives and sentinels A directive is a special line of source code with meaning only to certain compilers. A directive is distinguished by a sentinel (special character string) at the start of the line. OpenMP sentinels are: For Fortran: !$OMP (or C$OMP or *$OMP) For C/C++: #pragma omp On serial compilers these directive sentinels ( i.e. !$OMP and #pragma omp will be interpreted as comments according to the rules of the language and therefore simply ignored by compilers that do not have OpenMP features.

Conditional Compilation As the previous slide indicates the compilers that do not support OpenMP will always ignore the OpenMP directives because they will simply appear as comments. In addition, there are two further methods exist to ensure that code that is written to support OpenMP can also be compiled by serial compilers with minimal changes to the source. Method 1: Applies only to Fortran. Any line that starts with !$ will be visible ( by replacing !$ by two spaces) to any compiler supporting OpenMP. Compilers that do not support OpenMP will simply see these lines as comments. Method 2: Applies to C and cpp preprocessors and Fortran preprocessors that supports macro definitions. While using a compiler that supports OpenMP the symbol _OPENMP will have been predefined. Therefore conditional compilation statements in the form of ; #ifdef _OPENMP can be used to bracket the pieces of code that will be compiled only if OPENMP features are available. Example: NUM_WORKERS = 1; #ifdef _OPENMP NUM_WORKERS = omp_get_num_threads( ); #endif

OMP PARALLEL directive Defines a block of region which will be executed in parallel by multiple threads. Syntax : FORTRAN: !$OMP PARALLEL … optional extra clausesblock!$OMP END PARALLEL C/C++: #pragma omp parallel … optional extra clauses {block} A number of extra clauses control the relationship between the threads such as sharing of variables etc. Branching in and out of these blocks are not allowed. The actual number of threads to be used is controlled by the; OMP_NUM_THREADS environment variable (see environment vars), NUM_THREADS clause or The library calls ( see later)

Program execution schematic York Leeds Leeds Leeds Leeds Leeds Leeds Leeds Leeds Leeds Shef Hull Hull Hull Hull Hull Hull Hull Hull PROGRAM myprog : CALL york()!$OMP PARALLEL CALL leeds() !$OMP END PARALLEL CALL shef() !$OMP PARALLEL CALL hull() !$OMP END PARALLEL CALL brad() : END Execution time Hull Brad

Useful functions To find out how many threads are being used: Fortran: INCLUDE ‘omp_lib.h’ INTEGER FUNCTION OMP_GET_NUM_THREADS() C/C++: #include <omp.h>int omp_get_num_threads(void); Returns value 1 if outside the parallel region else returns the number of threads available. To identify individual threads by number: Fortran: INCLUDE ‘omp_lib.h’ INTEGER FUNCTION OMP_GET_THREAD_NUM() C/C++: #include <omp.h>int omp_get_thread_num(void) Returns value between 0 to OMP_GET_NUM_THREADS() - 1

OpenMP on the Sheffield grid node ‘iceberg’ OpenMP support is built into the PGI Fortran90 and C compilers on iceberg and its workers. Users are encouraged to do all their openmp related work by starting up an interactive or batch session by using the qsh and qsub commands respectively. Programs that use openmp work as a team of ‘threads’. For maximum efficiency we should have as many threads as available processors on the system and assign each thread to a different processor. Some of the iceberg’s workers have 6-processors each and by using OpenMP we can take advantage of these nodes. As ‘iceberg’ can only provide a maximum of six processors, we shall resort to simulating having larger numbers of processors by declaring and using more threads than the available processors. If the number of threads needed by an openmp job are greater than the number of processors available the threads start sharing the processors, hence diminishing efficiency.

Interactive shell for OpenMP on ‘iceberg’ OpenMP code development can be done interactively by starting an interactive, openMP friendly shell as follows: While logged onto iceberg type; qsh –pe openmp 6 and work in the new shell. This will make a shell available on a multi-processor worker node to facilitate the openMP code development. We also need to set an environment variable named OMP_NUM_THREADS to declare the number of threads that will be required. The value of this variable defines the number of threads an OMP job will create by default when it starts executing. export OMP_NUM_THREADS=6 If the number of threads requested this way are greater than the number of available processors then the threads will share the processors amongst themselves.

Interactive shell for OpenMP on ‘iceberg’ Setting the number of threads. If you are using the bash shell ( this is the default shell on iceberg) then you can define the number of threads that will be made available to a running openmp job by setting the OMP_NUM_THREADS environment variable as follows; export OMP_NUM_THREADS=nn where nn is the number of threads you want to use ( Maximum 16 ) e.g. export OMP_NUM_THREADS=12 If you are using the c-shell csh, tcsh then the same can be done as follows; setenv OMP_NUM_THREADSnn

Compiling on the Sheffield Grid Node OpenMP support is built into the Fortran 90 and C compilers on iceberg cluster To compile a program using OpenMP , specify the –mp flag for both Fortran and C programs. EXAMPLE: >pgf90 –mp prog.f90 Or >pgcc –mp proc.c Compiler optimisation will be raised to –O3 automatically. Can specify any additional flags on command line; e.g. –fast If compiling and linking in separate stages, be sure to use identical compiler options for both!

Running the program Run as you would a serial program: >./a.out ( or ./progname as specified by the –o progname compiler flag) e.g. >f90 program.f90 –o progname >./progname You can redefine the number of threads to use any time by resetting the environment variable OMP_NUM_THREADS. There is no need to re-compile the program after changing this environment variable. In sh and bash: export OMP_NUM_THREADS=nn In csh, tcsh : setenv OMP_NUM_THREADSnn (wherenn is the number of threads to use )

Batch execution of openmp programs Batch queues allow exclusive access to all CPUs of a worker node to run openmp jobs. To submit a batch job use the qsub command: qsub –pe openmp <np>scriptfile Where; <np> is the number of processors to reserve. scriptfile contains the commands that will be executed by the batch job The environment variable OMP_NUM_THREADS should also be set inside the script. On iceberg, do not request for more than 8 processors as 8 is the maximum number of processors per machine that can be made available.

Example batch submission script #$ –pe openmp 6 -l h_rt=1:00:00 #$ -cwd export OMP_NUM_THREADS=${NSLOTS} ./prog The NSLOTS environment variables value is set to the number of processors that a job is allocated to. Therefore it can be used to set the OMP_NUM_THREADS variable for maximum efficiency. Options can be included on the command line or within the script with #$ line prefix. Above script will request 1 hour of runtime. Avoid running on more processors than NSLOTS indicates.

Exercises: The omp examples are contained in the directory /usr/local/courses/openmp/exercises. Copy them into your own directory. The aim is to comp ile and run a few trivial OpenMP programs. Read the readme.txt file in the exercises directory to follow the instructions to compile and run the first two programs ‘either in c or in Fortran’. Vary the number of threads using OMP_NUM_THREADS environment variable and run the code again. Run the code several times. Is the output consistent?

OMP PARALLEL Directive Defines a block of region which will be executed in parallel by multiple threads. Syntax : FORTRAN: !$OMP PARALLEL (optional clauses)block!$OMP END PARALLEL C/C++: #pragma omp parallel (optional clauses) {block}

Optional OMP PARALLEL clauses The list of clauses can be comma or space separated in Fortran. Where as in C/C++ they are only allowed to be space separated. The clauses can be one or more of the following; PRIVATE(var_list) , FIRSTPRIVATE(var_list) SHARED(var_list) DEFAULT(PRIVATE|SHARED|NONE) REDUCTION({operator|intrinsic_func_name}:vars_list) COPYIN( var_list) IF(logical_expression) NUM_THREADS ( number ) All of these clauses will be explained in later slides. Sometimes these clauses render the OMP PARALLEL directive just too long to fit onto a single line, in which case the directive can be continued onto the following line(s) as shown by the next slide.

Continuation lines for OMP directives OMP directives can take a number of optional clauses which may render the directive too long to fit onto a single line and therefore may need to be split into multiple lines as described below. Fortran:free source form: The continued line must end with an &. The & after OMP on the continuation line(s) are optional. !$OMP PARALLEL DEFAULT(NONE), &!$OMP PRIVATE(i,myid), SHARED(a,n) C/C++:The continued line must end with \ #pragma omp parallel default(none) \ private(i,myid) shared(a,n)

OMP PARALLEL - If clause Parallel region can be made conditional by using the IF clause on the PARALLEL directive. This may be useful if, for example, there is not enough work to make parallel execution worthwhile as in the first example, or that there are not enough threads available. Syntax: Fortran: OMP PARALLEL … IF ( scalar_logical_expression) C: omp parallel … if ( scalar_expression) Example: Here the enclosed region will be executed serially if we have less than 100 tasks to do #pragma omp parallel if( ntasks >= 100 )

OMP PARALLEL - Num_threads clause Syntax: OMP PARALLEL ... NUM_THREADS( scalar ) Normally when a parallel region is entered the number of threads to be started is determined by the last call to the OMP_SET_NUM_THREADS routine if there was such a call. If there were no such calls, as is usually the case, then the value of the environment variable OMP_NUM_THREADS determines the number of threads to be started. However, this clause will allow the user to dictate the exact number of threads to be started for the parallel region, thus over-riding the previous two methods. Any number specified that is not sensible ( such as a negative number ) will be ignored. Also any number exceeding the number of threads allowed by the operating system will be pegged to that limit. It is therefore sensible to test the actual number of threads currently being used by the parallel region via a call to the GET_NUM_THREADS and not rely of the requested number of threads being made available. See Example-11

Shared and private clauses Within a parallel region, variables can be: shared: every thread sees the same copy or private: each thread has its own copy. The following optional clauses control the privacy attributes of the variables: Syntax: OMP PARALLEL … SHARED(var_list) OMP PARALLEL … PRIVATE(var_list) OMP PARALLEL … DEFAULT(SHARED|PRIVATE|NONE) Default is for all variables to be shared. DEFAULT(NONE): all variables must be explicitly defined as private or shared.

Shared and private examples In this example each thread initialises its own column of a shared array named a().!$OMP PARALLEL DEFAULT(NONE), &!$OMP& PRIVATE(i,myid),SHARED(a,n) myid=omp_get_thread_num()+1 do i=1,n a(i,myid)=1.0 end do!$OMP END PARALLEL i is local loop index and should be private myidis local thread number – private ais main array – shared nis only read – no need to store extra copies - shared (saves memory)

Which variables to share? Most variables are shared Loop indices are private Loop temporaries are private Read-only variables are shared Main arrays are shared ‘with caution!’ Write-before-read scalars – usually private Sometimes either is OK, however there may be performance implications in making the choice. Note: Can have private arrays as well as scalars.

Initialising the private variables There are no ambiguities with regards to the values of the shared variables at the start of a parallel region, as they will continue to have the value they have had before the start of the parallel region. However, as new copies of private variables (one per thread) comes into existence on entry into a parallel region, no such assumptions can be made about the values of these variables. By default values of the private variables are undefined on entry into a parallel region unless they are declared in the FIRSTPRIVATE clause. FIRSTPRIVATE clause: Variables declared as firstprivate are private variables that are given the values that existed immediately prior to the parallel region Syntax: OMP PARALLEL …. FIRSTPRIVATE(list)

Firstprivate clause - example In the example below, variable b is private to each thread. At the start of each thread it is initialised with its value in the master thread ( i.e. 5.0 ). For safe programming, the value of b should be treated as undefined upon exit from the parallel region, although on most platforms b on exit from the parallel region will contain the value for thread 0 version. b=5.0;#pragma omp parallel firstprivate(b) \ private(i,myid) shared(c){myid=omp_get_thread_num(); for(i=0;i<n;i++){ b+=c[myid][i]; }c[myid][n]=b;} bnew = b; /* this means the value of b for thread 0 */

THREADPRIVATE This directive makes the declared variables and commonblocks private to each thread but global within each thread. $OMP THREADPRIVATE ( list of variables and or common-block names) Note that this declaration must be repeated in each subroutine/function that declare the same common block or variables.

WORK SHARING Constructs Upon entering a parallel section, a team of threads are created and each thread executes the parallel section on its own. In this mode of operation, unless there are some controls (if statementsetc.) based on the thread_number of the process, the work is simply repeated number_of_threads times. In many circumstances, when we would like a given set of tasks to be performed once by a team of threads rather than repeated on each thread, we use one of the following three directives, known as the Work Sharing Directives. FORTRAN: !$OMP DO [clauses] …… !OMP END DO C/C++ :#pragma omp for [clauses]for loop FORTRAN :!OMP SECTIONS ….. !$OMP END SECTION C/C++ : #pragma omp SECTIONS FORTRAN: !$OMP WORKSHARE ….. !$OMP END WORKSHARE

Syntax: Fortran:!OMP DO [clauses] do loop[!$OMP END DO] C/C++:#pragma omp for [clauses] for loop Parallel DO/for loops This is optional as the compiler will interpret END DO as a sentinel Note that there are no curly braces here as the for-loop block is taken to be the block for this pragma.

$OMP DO/for directive This directive allows the enclosed do-loop or for block to be work-shared amongst threads. Note also that $OMP DO/$omp for directive itself must be enclosed within a parallel region for the parallel processing to be initiated and can take on a number of optional clauses that will be listed later. Example; !$OMP PARALLEL !$OMP DO DO i = 1, 800 A(i) = B(i) + C(i) END DO !$OMP END DO !$OMP END PARALLEL distributes the do-loop over the different threads: each thread computing part of the iterations. For example, if 4 threads are in use, then in general each thread computes 100 iterations of the do-loop: thread 0 computes from i=1 to 200, thread 1 from 201 to 400 and so on. This is shown graphically in the next slide.

OMP DO example Serial Region Thread 0 $OMP PARALLEL $OMP DO Parallel Region Thread 0 Do i = 1,200 Thread 1 Do i = 201,400 Thread 2 Do i = 401,600 Thread 3 Do i = 601,800 $OMP END DO $OMP END PARALLEL Serial Region Thread 0

OMP DO/for directive clauses • The OMP DO/for directive can take on any one or more of the following clauses to control the scheduling of the task and the behaviour of the private and shared variables. • These being; PRIVATE, FIRST PRIVATE, LASTPRIVATE, REDUCTION, SCHEDULE and ORDERED. • Of all these only the ORDERED and SCHEDULE clauses are specific to the this directive. Other clauses are general and can also be used for the OMP PARALLEL directives. • We shall therefore only study these two clauses here and leave the discussion of the others to later sections.

OMP for directive’s C/C++ restrictions As the for loop in C is a general while loop, there are restrictions on the form it can take: It has to have a determinable trip count i.e it must be of the form:for (var=a; var logical-op b; incr-exp)where; logical-op is one of <, <=, >, >= incr-exp is var = var+/- incr (or var++/var--) Also we cannot modify var within the loop body.

Parallel loop identification How can you tell if a loop is parallel or not? Every iteration should be independent of other iterations. i.e. it can be run in any order Reverse order test:Almost certainly parallel if it can be run backwards to produce the same results. Jumps out of the loop are not allowed For example the below loop can not be parallelized because it violates (1),(2) and (3) DO i=2,n a(i) = 2 * a(i-1) END DO

Parallel loop identification examples Example 2 :ix=basedo i=1,n a(ix) = a(ix)*b(i) ix = ix + strideend do Example 3:do i=1,n b(i) = (a(i)-a(i-1))*0.5end do ix is calculated during the previous iteration a() and b() are independent arrays. Independent, so no influence of previous iterations on the current one

Parallel loop example !$OMP PARALLEL DO do i=1,n b(i) = (a(i)-a(i-1))*0.5 end do !$OMP END PARALLEL DO • a, b and n are shared by default • i is private.

Parallelising, despite loop dependencies ! We have seen that loops that exhibit dependencies between iterations can not normally be parallelised. That is to say, if results of one iteration of a loop uses values that were updated in any of the previous iterations of that loop, that loop can not be parallelised. For example loops carrying either of the following expressions will not parallelised; x(i) = x(i-1) + a x[i] = x[i] + x[i+1] ; However, there is an important exception to this restriction for certain commonly occurring class of operations, namely reduction-operations.

Reductions Reduction operations are those operations that produce a single value from an associative operation (e.g. addition, multiplication, subtraction, AND (&), OR( | ) and a number of intrinsic functions such as MIN, MAX . Example: sum=0.0 DO i=1, N sum = sum + b(i) END DO When parallelising this loop, the variable sum will need to be declared in a REDUCTION clause associated with operator +. This will allow sum to be calculated correctly by making sure that each thread does the accumulation in its own private copy of sum and at the end of the loop these partial sums are added together to give a final result that is stored in the shared variable sum.

Reductions Variables can be given REDUCTION attribute: Fortran: REDUCTION(op:list) C/C++: reduction(op:list) op is of the reduction operator from the following table: list is the variables list OpenMP allows only Fortran array reduction.

Introduction to OpenMP Part I

Introduction to OpenMP Part I

Presentation Transcript

OpenMP: Introduction

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP Part II

Introduction to OpenMP

Introduction to OpenMP

Introduction to MPI, OpenMP, Threads

OpenMP EXERCISE part 1 – OpenMP v2.5

Short introduction to OpenMP

OpenMP - Introduction

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

OpenMP – Introduction *

Introduction to OpenMP