Parallel Programming with OpenMP

Parallel Programming with OpenMP Ing. Andrea Marongiu a.marongiu@unibo.it

The Multicore Revolution is Here! • More instruction-level parallelism hard to find • Very complex designs needed for small gain • Thread-level parallelism appears live and well • Clock frequency scaling is slowing drastically • Too much power and heat when pushing envelope • Cannot communicate across chip fast enough • Better to design small local units with short paths • Effective use of billions of transistors • Easier to reuse a basic unit many times • Potential for very easy scaling • Just keep adding processors/cores for higher (peak) performance

The Road Here • One processor used to require several chips • Then one processor could fit on one chip • Now many processors fit on a single chip • This changes everything, since single processors are no longer the default

Multiprocessing is Here • Multiprocessor and multicore systems are the future • Some current examples

Vocabulary in the Multi Era • AMP, Assymetric MP: Each processor has local memory, tasks statically allocated to one processor • SMP, Shared-Memory MP: Processors share memory, tasks dynamically scheduled to any processor

Vocabulary in the Multi Era • Heterogeneous: Specialization among processors. Often different instruction sets. Usually AMP design. • Homogeneous: all processors have the same instruction set, can run any task, usually SMP design.

Maintaining shared data consistent • Cache coherency: • Fundamental technology for shared memory • Local caches for each processor in the system • Multiple copies of shared data can be present in caches • To maintain correct function, caches have to be coherent • When one processor changes shared data, no other cache will contain old data (eventually)

Future Embedded Systems

The Software becomes the Problem • The advent of MPSoCs on the marketplace raised the necessity for standard parallel programming models. • Parallelism required to gain performance • Parallel hardware is “easy” to design • Parallel software is (very) hard to write • Fundamentally hard to grasp true concurrency • Especially in complex software environments • Existing software assumes single-processor • Might break in new and interesting ways • Multitasking no guarantee to run on multiprocessor • Exploiting parallelism has historically resulted in significant programmer effort • This could be alleviated by a programming model that exposes those mechanisms that are useful for the programmer to control, while hiding the details of their implementation.

Message-Passing & Shared-Memory • Local memory for each task, explicit messages for communication • All tasks can access the same memory, communication is implicit

Programming Parallel Machines • Synchronize & coordinate execution • Communicate data & status between tasks • Ensure parallelism-safe access to shared data • Components of the shared-memory solution: • All tasks see the same memory • Locks to protect shared data accesses • Synchronization primitives to coordinate execution

Programming Model: MPI • Message-passing API • Explicit messages for communication • Explicit distribution of data to each thread for work • Shared memory not visible in the programming model • Best scaling for large systems (1000s of CPUs) • Quite hard to program • Well-established in HPC

Programming model: Posix Threads • Standard API • Explicit operations • Strong programmer control, arbitrary work in each thread • Create & manipulate • Locks • Mutexes • Threads • etc.

Programming model: OpenMP • De-facto standard for the shared memory programming model • A collection of compiler directives, library routines and environment variables • Easy to specify parallel execution within a serial code • Requires special support in the compiler • Generates calls to threading libraries (e.g. pthreads) • Focus on loop-level parallel execution • Popular in high-end embedded

Fork/Join Parallelism • Initially only master thread is active • Master thread executes sequential code • Fork: Master thread creates or awakens additional threads to execute parallel code • Join: At the end of parallel code created threads die or are suspended

Pragmas • Pragma: a compiler directive in C or C++ • Stands for “pragmatic information” • A way for the programmer to communicate with the compiler • Compiler free to ignore pragmas: original sequential semantic is not altered • Syntax: #pragma omp <rest of pragma>

Components of OpenMP • Data scope attributes • private • shared • reduction Directives Clauses • Parallel regions • #pragma omp parallel • Work sharing • #pragma omp for • #pragma omp sections • Synchronization • #pragma omp barrier • #pragma omp critical • #pragma omp atomic • Thread Forking/Joining • omp_parallel_start() • omp_parallel_end() • Number of threads • omp_get_num_threads() • Thread IDs • omp_get_thread_num() Runtime Library

#pragma omp parallel • Most important directive • Enables parallel execution through a call to pthread_create • Code within its scope is replicated among threads int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } A sequential program.. ..is easily parallelized int main() { #pragma omp parallel { printf (“\nHello world!”); } }

#pragma omp parallel Code originally contained within the scope of the pragma is outlined to a new function within the compiler int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } }

#pragma omp parallel The #pragma construct in the main function is replaced with function calls to the runtime library int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } }

#pragma omp parallel First we call the runtime to fork new threads, and pass them a pointer to the function to execute in parallel int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } }

#pragma omp parallel Then the master itself calls the parallel function int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } }

#pragma omp parallel Finally we call the runtime to synchronize threads with a barrier and suspend them int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } }

#pragma omp parallelData scope attributes int main() { int id; int a = 5; #pragma omp parallel { id = omp_get_thread_num(); if (id == 0) { a = a * 2; printf (“Master: a = %d.”, a); } else printf (“Slave: a = %d.”, a); } } Call runtime to get thread ID: Every thread sees a different value Master and slave threads access the same variable a A slightly more complex example

#pragma omp parallelData scope attributes int main() { int id; int a = 5; #pragma omp parallel { id = omp_get_thread_num(); if (id == 0) { a = a * 2; printf (“Master: a = %d.”, a); } else printf (“Slave: a = %d.”, a); } } Call runtime to get thread ID: Every thread sees a different value How to inform the compiler about these different behaviors? Master and slave threads access the same variable a A slightly more complex example

#pragma omp parallelData scope attributes int main() { int id; int a = 5; #pragma omp parallel shared (a)private (id) { id = omp_get_thread_num(); if (id == 0) { a = a * 2; printf (“Master: a = %d.”, a); } else printf (“Slave: a = %d.”, a); } } Insert code to retrieve the address of the shared object from within each parallel thread Allow symbol privatization: Each thread contains a private copy of this variable A slightly more complex example

#pragma omp for • The parallel pragma instructs every thread to execute all of the code inside the block • If we encounter a for loop that we want to divide among threads, we use the for pragma #pragma omp for

N C = ceil ( ) Nthr #pragma omp forLoop partitioning algorithm Es. N=10, Nthr=4. #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } DATA CHUNK 3 elements 0 1 2 3 Thread ID (TID) LOWER BOUND 0 3 6 9 LB = C * TID 3 6 9 10 UB = min { [C * ( TID + 1) ], N} UPPER BOUND

#pragma omp for The code of the for loop is moved inside the parallel function. Every thread works on a different subset of the iteration space int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { for (i=LB; i<UB; i++) a[i] = i; } int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } }

#pragma omp for Lower and upper boundaries (LB, UB) are computed based on the thread ID, according to the previously described algorithm int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { for (i=LB; i<UB; i++) a[i] = i; } int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } }

#pragma omp sections • The for pragma allows to exploit data parallelism in loops • OpenMP also provides a directive to exploit task parallelism #pragma omp sections

Task Parallelism Example int main() { #pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta (); } #pragma omp sections { x = gamma (v, w); #pragma omp section y = delta (); } } printf (“%f\n”, epsilon (x, y)); } The dataflow graph of this application shows some parallelism (i.e. functions whose execution doesn’t depend on the outcomes of other functions)

#pragma omp sections int main() { #pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta (); } #pragma omp sections { x = gamma (v, w); #pragma omp section y = delta (); } } printf (“%f\n”, epsilon (x, y)); }

#pragma omp sections int main() { #pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta (); } #pragma omp sections { x = gamma (v, w); #pragma omp section y = delta (); } } printf (“%f\n”, epsilon (x, y)); } alpha and beta execute in parallel gamma and delta execute in parallel

#pragma omp sections int main() { #pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta (); } #pragma omp sections { x = gamma (v, w); #pragma omp section y = delta (); } } printf (“%f\n”, epsilon (x, y)); } A barrier is implied at the end of the sections pragma that prevents gamma from consuming v and w before they are produced

#pragma omp barrier • Most important synchronization mechanism in shared memory fork/join parallel programming • All threads participating in a parallel region wait until everybody has finished before computation flows on • This prevents later stages of the program to work with inconsistent shared data • It is implied at the end of parallel constructs, as well as for and sections (unless a nowait clause is specified)

#pragma omp critical • Critical Section: a portion of code that only one thread at a time may execute • We denote a critical section by putting the pragma #pragma omp critical in front of a block of C code

-finding code example double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } } pi = area/n; We must synchronize accesses to shared variable area to avoid inconsistent results. If we don’t..

Race condition • …we set up a race condition in which one process may “race ahead” of another and not see its change to shared variable area

Race condition (Cont’d)

-finding code example double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } } pi = area/n; #pragma omp critical protects the code within its scope by acquiring a lock before entering the critical section and releasing it after execution

Correctness, not performance! • As a matter of fact, using locks makes execution sequential • To dim this effect we should try use fine grainedlocking(i.e. make critical sections as small as possible) • A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler! • The programmer is not aware of the real granularity of the critical section

Correctness, not performance! • Using locks, as a matter of fact makes execution sequential • To dim this effect we should try use fine grained locking (i.e. make critical sections as small as possible) • A simple instruction to compute the value of area in the previous example is translated into much more instructions • The programmer is not aware of the real granularity of the critical section This is a dump of the intermediate representation of the program within the compiler

THIS IS DONE AT EVERY LOOP ITERATION! Correctness, not performance! This is the way the compiler represents the instruction area += 4.0/(1.0 + x*x); call runtime to acquire lock Lock-protected operations (critical section) call runtime to release lock

Correctness, not performance! • A programming pattern such as area += 4.0/(1.0 + x*x);in which we: • Fetch the value of an operand • Add a value to it • Store the updated value is called a reduction, and is so common that OpenMP provides support for that • OpenMP takes care of storing partial results in private variables and combining partial results after the loop

Correctness, not performance! double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) reduction(+:area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } } pi = area/n; The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable

Correctness, not performance! UNOPTIMIZED CODE Shared variable is only updated at the end of the loop, when partial sums are computed Shared variable is updated at every iteration. Execution of this critical section is SERIALIZED double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) reduction(+:area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } } pi = area/n; Each thread computes partial sums on private copies of the reduction variable LOOP The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable

Correctness, not performance! This is a single atomic write. Target architecture may provide such an instruction UNOPTIMIZED CODE Shared variable is updated at every iteration. Execution of this critical section is SERIALIZED double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) reduction(+:area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } } pi = area/n; __sync_fetch_and_add(&.omp_data_i->area, area); The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable

Summary

Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy • Memory latency is well recognized as a severe performance bottleneck • MPSoCs feature complex memory hierarchy, with multiple cache levels, private and shared on-chip and off-chip memories • Using efficiently the memory hierarchy is of the utmost importance to exploit the computational power of MPSoCs, but..

Parallel Programming with OpenMP