Introduction to OpenMP

Introduction to OpenMP I - Introduction ITCS4145/5145, Parallel Programming C. Ferner and B. Wilkinson Feb 3, 2016

OpenMP structure • A standard developed in the 1990’s for thread-based programming for shared memory systems. • Higher level than using low level APIs such as Pthreads or Java threads. • Consists of a set of compiler directives, and a few library routines and environment variables. • gcc supports OpenMP with the –fopenmp option, so additional software not needed.

OpenMP compiler directives • OpenMP uses #pragmacompiler directives to parallelize a program (“pragmatic” directive) • Programmer inserts #pragma statements into the sequential program to tell the compiler how to parallelize the program • When the compiler comes across a compiler directive, it creates corresponding thread-based parallel code. • Basic OpenMP pattern is the thread-pool pattern

Thread-pool pattern Master thread • Basic OpenMP pattern • Parallel region indicates sections of code executed by all threads • At the end of a parallel region, all threads wait for each other as if there were a “barrier” (unless other specified) • Code outside a parallel region executed by master thread only parallel region Multiple threads Synchronization parallel region Multiple threads Synchronization

Parallel Region omp indicates an OpenMP pragma (other compilers will ignore it) Syntax: #pragma omp parallel structured_block parallel indicates a parallel region • structured_block is either: • a single statement terminated with ; • or • a block of statements, i.e. { statement1; statement2;…}

Hello World int main (intargc, char *argv[]) { #pragmaomp parallel { printf("Hello World from thread %d of %d\n", omp_get_thread_num(), omp_get_num_threads()); } } These routines return the thread ID (from 0 onwards) and the total number of threads, respectively (easy to confuse as names similar) Very Important Opening brace must be on a new line

Compiling and Output Flag to tell gcc to interpret OpenMP directives $ gcc -fopenmp hello.c -o hello $ ./hello Hello world from thread 2 of 4 Hello world from thread 0 of 4 Hello world from thread 3 of 4 Hello world from thread 1 of 4 $

Number of threads • Three ways to indicate how many threads you want: 1. Use num_threads within the directive • E.g. #pragma omp parallel num_threads(5) 2. Use the omp_set_num_threads function • E.g. omp_set_num_threads(6); 3. Use the OMP_NUM_THREADS environmental variable • E.g $ export OMP_NUM_THREADS=8 $ ./hello

Shared versus Private Data int main (intargc, char *argv[]) { int x; inttid; #pragmaomp parallel private(tid) { tid = omp_get_thread_num(); if (tid == 0) x = 42; printf ("Thread %d, x = %d\n", tid, x); } } x is shared by all threads tid is private – each thread has its own copy Variables declared outside parallel construct are shared unless otherwise specified

Abstractly, it looks like this Processor Processor Processor Processor Local Memory Local Memory Local Memory Local Memory tid tid tid tid 1 2 3 0 x 42 Shared Memory

Shared versus Private Data $ ./data Thread 3, x = 0 Thread 2, x = 0 Thread 1, x = 0 Thread 0, x = 42 Thread 4, x = 42 Thread 5, x = 42 Thread 6, x = 42 Thread 7, x = 42 tid has a separate value for each thread x has the same value for each thread (well… almost)

Another ExampleShared versus Private a[ ] is shared int x, tid, a[100]; #pragma omp parallel private(tid, n) { tid = omp_get_thread_num(); n = omp_get_num_threads(); a[tid] = 10*n; } OR #pragmaomp parallel private(tid, n) shared(a) ... tid and n are private optional

Specifying Work Inside a Parallel Region (Work Sharing Constructs) • Four constructs: • section – each section executed by a different thread • for – one or more iterations executed by a (potentially) different thread • single – executed by a single thread (sequential) • master – executed by the master only (sequential) • Barrier after each construct (except master) unless a nowait clause is given • Constructs must be used within a parallel region

Sections Syntax Parallel region #pragmaomp parallel { #pragmaomp sections { #pragmaomp section structured_block #pragmaomp section structured_block ... } } Sections executed by available threads

Sections Example #pragmaomp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragmaomp sections nowait { #pragmaomp section { printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]); } } Threads do not wait after finishing section One thread does this

Sections example continued #pragmaomp section { printf("Thread %d doing section 2\n",tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]=%f\n",tid,i,d[i]); } } } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Another thread does this

Sections Output Thread 0 doing section 1 Thread 0: c[0]= 5.000000 Thread 0: c[1]= 7.000000 Thread 0: c[2]= 9.000000 Thread 0: c[3]= 11.000000 Thread 0: c[4]= 13.000000 Thread 3 done Thread 2 done Thread 1 doing section 2 Thread 1: d[0]= 0.000000 Thread 1: d[1]= 6.000000 Thread 1: d[2]= 14.000000 Thread 1: d[3]= 24.000000 Thread 0 done Thread 1: d[4]= 36.000000 Thread 1 done Threads do not wait (i.e. no barrier)

If we remove the nowait clause Thread 0 doing section 1 Thread 0: c[0]= 5.000000 Thread 0: c[1]= 7.000000 Thread 0: c[2]= 9.000000 Thread 0: c[3]= 11.000000 Thread 0: c[4]= 13.000000 Thread 3 doing section 2 Thread 3: d[0]= 0.000000 Thread 3: d[1]= 6.000000 Thread 3: d[2]= 14.000000 Thread 3: d[3]= 24.000000 Thread 3: d[4]= 36.000000 Thread 3 done Thread 1 done Thread 2 done Thread 0 done A barrier at the end of section. Threads wait until they are all done with the section.

Parallel For Syntax #pragmaomp parallel { #pragmaomp for for (i = 0; i < N; i++) { ... } } Enclosing parallel region Different iterations will be executed by available threads Must be a simple C for loop, where lower bound and upper bound are constants (actually loop invariant)

Iteration Space • Suppose N = 13. • The iteration space is

Iteration Partitions • Without further specification iterations per partition (chunksize) =

Mapping • Iteration Partitions are assigned to processors • Ex.

Parallel For Example #pragmaomp parallel shared(a,b,c,nthreads) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragmaomp for for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: i = %d, c[%d] = %f\n", tid, i, c[i]); } } /* end of parallel section */ Without “nowait”, threads wait after finishing loop

Parallel For Output Thread 1 starting... Thread 1: i = 2, c[1] = 9.000000 Thread 1: i = 3, c[1] = 11.000000 Thread 2 starting... Thread 2: i = 4, c[2] = 13.000000 Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = 5.000000 Thread 0: i = 1, c[0] = 7.000000 Iterations of loop are mapped to threads Mapping is In this example, mapping = Barrier here

Combining Directives • If a Parallel Region consists of only one Parallel For or Parallel Sections, they can be combined • #pragmaomp parallel sections • #pragmaomp parallel for

Combining Directives Example #pragma omp parallel for shared(a,b,c,nthreads) private(i,tid) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; } Declares a parallel region with a parallel for

Scheduling a Parallel For • By default, a parallel for is scheduled by mapping blocks (or chunks) of iterations to available threads (static mapping) Thread 1 starting... Thread 1: i = 2, c[1] = 9.000000 Thread 1: i = 3, c[1] = 11.000000 Thread 2 starting... Thread 2: i = 4, c[2] = 13.000000 Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = 5.000000 Thread 0: i = 1, c[0] = 7.000000 Default Chunk Size Barrier here

Scheduling a Parallel For • Static – Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion: #pragma omp parallel for schedule (static, chunk_size) (If chunk_size not specified chunks are approx. equal with one at least 1.) • Dynamic –Chunk-sized block of iterations assigned to threads as they become available: #pragma omp parallel for schedule (dynamic, chunk_size) (If chunk_size not specified, it defaults to 1.)

Scheduling a Parallel For • Guided – Similar to dynamic but chunk size starts large and gets smaller: #pragma omp parallel for schedule (guided) Example:* • Runtime – Uses OMP_SCHEDULE environment variable to specify which of static, dynamic or guided should be used: #pragma omp parallel for schedule (runtime) * Actual algorithm is slightly more complicated see the OpenMP standard

Static assignment • Suppose there are • N = 100 iterations • Chunk_size = 15 • P = 4 threads • There are 100/15 = 7 blocks • Each thread will get 7/4 = 2 blocks (max)

Cyclic Assignment of Blocks

Question Guided scheduling is similar to Static except that the chunk sizes start large and get smaller. What is the advantage of using Guided versus Static? Answer: Guided improves load balance Is there any disadvantage of using Guided? Answer: Overhead

Reduction • A reduction is when a binary commutative operator is applied to a collection of values producing a single value

Reduction • A binary commutative operator applied to a collection of values producing a single value • E.g., applying summation to the following values: • Produces the single value of 549 • OpenMP (and MPI) standards do not specify how reduction should be implemented; however, … Commutative: changing the order of the operand does not change the result. Associative: the order operations are performed does not matter (with the sequence not changed) i.e. rearranging parentheses does not alter result.

Reduction Implementation • A reduction could be implemented fairly efficiently on multiple processor using a tree • In which case the time is O(log(P))

Reduction Operators Subtract is in the spec but does summation??

Reduction sum = 0; #pragmaomp parallel for reduction(+:sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); } Operation Variable Private copy of sum created for each thread by compiler. Private copy will be added to sum at end. Eliminates the need for critical sections here.

Single #pragmaomp parallel { ... #pragmaomp single structured_block ... } Only one thread executes this section No guarantee of which one

Single Example #pragmaomp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragmaomp single { printf("Thread %d doing work\n",tid); ... } // end of single printf ("Thread %d done\n", tid); } // end of parallel section

Single Results Thread 0 starting... Thread 0 doing work Thread 3 starting... Thread 2 starting... ... Thread 1 starting... Thread 0 done Thread 1 done Thread 2 done Thread 3 done Only one thread executing the section “nowait” was NOT specified, so threads wait for the one thread to finish. Barrier here

Master #pragmaomp parallel { ... #pragmaomp master structured_block ... } Only one thread (the master) executes this section Cannot specify “nowait” here No barrier after this block. Threads will NOT wait.

Master Example #pragmaomp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragmaomp master { printf("Thread %d doing work\n",tid); ... } // end of master printf ("Thread %d done\n", tid); } // end of parallel section

Is there any difference between these two approaches: Master Directive: Using an if statement: #pragmaomp parallel \ private(tid) { ... tid=omp_get_thread_num(); if (tid == 0) structured_block ... } #pragmaomp parallel { ... #pragmaomp master structured_block ... } (First has a barrier unless nowait clause

Assignment 1 Assignment posted: • Part 1 a tutorial compiling and executing sample code – on your own computer • Part 2 Parallelizing matrix multiplication - on your own computer • Part 3 Executing matrix multiplication code on cluster Due Sunday January 31st, 2016 (Week 3)

Questions

More information http://openmp.org/wp/

Introduction to OpenMP