Mastering OpenMP for Efficient Parallelization in High-Performance Computing

High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics

Assignment 3 • Posted – on OpenMP and concepts in parallelization • Use HPCVL to run assignments • Note there are a lot of wrong answers for question number 3 on the web 

HPCVL Accounts • I have created a group for the course • If you don’t already have an account, go to www.hpcvl.org and fill out the account application for an existing group • Group name: hpcg1142 • Should take about a day to set up • Accounts will last for duration of course

Today’s Lecture Shared Memory Parallelism II • Part 1: OpenMP pragmas cont • Part 2: Useful features of the API • Part 3: Advanced features of the API (scaling to large numbers of processors)

Part 1: OpenMP API continued • More details on data scoping • Schedule clause: iterations scheduling options

Reminder • Recall private variables are uninitialized • Motivation: no need to copy in from serial section • Private variables do not carry a value into the serial parts of the code • Motivation: no need to copy from parallel to serial • API provides to mechanisms to circumvent this issue • FIRSTPRIVATE • LASTPRIVATE

FIRSTPRIVATE • Declaring a variable FIRSTPRIVATE will ensure that its value is copied in from any prior piece of serial code • However (of course) if the variable is not initialized in the serial section it will remain uninitialized • Happens only once for a given thread set • Try to avoid writing to variables declared FIRSTPRIVATE

FIRSTPRIVATE example • Lower bound of values is set to value of A, without FIRSTPRIVATE clause a=0.0 a=5.0 C$OMP PARALLEL DO C$OMP& SHARED(r), PRIVATE(i) C$OMP& FIRSTPRIVATE(a) do i=1,n r(i)=max(a,r(i)) end do

LASTPRIVATE • Occasionally it may be necessary to know the last value of a variable from the end of the loop • LASTPRIVATE variables will initialize the value of the variable in the serial section using the last (sequential) value of the variable from the parallel loop

Important Caveat • LASTPRIVATE values are taken on the very last (sequential) iteration • Suppose an array is declared LASTPRIVATE • Any values written in the last iteration will be stored • Values written in previous iterations (if untouched in the last iteration) will be lost • These parts of the array will be uninitialized when entering the serial section • No easy way of determining what these values should be

ORDERED • Synchronization and ordering instruction • Suppose you have a small section of code that needs to be executed always in sequential order • However, remaining work can be done in any order • Placing an ORDERED clause around the work section will force threads to execute this section of code sequentially

Example C$OMP PARALLEL DO C$OMP& ORDERED do i=1,n call work(i) end do subroutine work(k) C$OMP ORDERED write(*,*) k C$OMP END ORDERED return end Potentially useful if I/O needs to follow a certain order

COPYIN • Common blocks and global variables in C/C++ are typically shared objects, but if necessary can be declared private • The THREADPRIVATE directive is used to make the block private • In a parallel do section COPYIN will then ensure that all threads are initialized with the same values as in the serial section of the code • `FIRSTPRIVATE’ for common blocks/globals

THREADPRIVATE • COPYIN needs to be used in conjunction with THREADPRIVATE • Allows a thread to keep it own private variables that are visible in every parallel section of the program • Thread local common blocks are scoped as THREADPRIVATE where they are initially declared: common /cblock/… C$OMP threadprivate(/cblock/)

SCHEDULE • This is the mechanism for determining how work is spread among threads • Important for ensuring that work is spread evenly among the threads – just having the same number of each iterations may not guarantee they all complete at the same time • Four types of scheduling possible: STATIC, DYNAMIC, GUIDED, RUNTIME

Remember!!!! • When running in parallel you are only as fast as your slowest thread • In example, total work is 40 seconds, & have 4 cpus • Max speed up would be 40/4=10 secs • All have to equal 10 secs though to give max speed-up Example of poor load balance, only a 40/16=2.5 speed-up despite using 4 processors

STATIC scheduling • Simplest of the four • If SCHEDULE is unspecified, STATIC scheduling will result • Default behaviour is to simply divide up the iterations among the threads ~n/(# threads) • STATIC(chunksize), creates a cyclic distribution of iterations

Comparison STATIC No chunksize THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 STATIC chunksize=1 THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 9 13 2 6 10 14 3 7 11 15 4 8 12 16 5

Chunksize & Cache line issues • If you are accessing arrays using the loop index, e.g. • Ensure chunksize > words in cache line • False sharing otherwise C$OMP PARALLEL DO C$OMP& PRIVATE(I,..), SHARED(a,..) C$OMP& SCHEDULE(STATIC,1) do i=1,n **work** a(i)=… end do

Drawbacks to STATIC scheduling • Naïve method of achieving good load balance • Works well if the work in each iteration is constant • If work varies by factor ~10 then usually cyclic load balancing can help • If work varies by larger factors then dynamic balancing is usually better

Example Time(secs) Gains become greater with a larger number of threads Ncpu

DYNAMIC scheduling • DYNAMIC scheduling is a personal favourite • Specify using DYNAMIC(chunksize) • Simple implementation of master-worker type distribution of iterations • Master thread passes off values of iterations to the workers of size chunksize • Not a silver bullet: if load balance is too severe (i.e. one thread takes longer than the rest combined) an algorithm rewrite is necessary • Also not good if you need a regular access pattern for data locality

THREAD 1 THREAD 2 THREAD 3 Master-Worker Model REQUEST Master REQUEST REQUEST

GUIDED scheduling • GUIDED scheduling is a bit of a compromise • Iteration space is divided up into exponentially decreasing chunks • Final chunksize is usually 1, unless set by the programmer • Chunks of work are dynamically obtained • Works quite well provided work per iteration is constant – if unknown dynamic is better

GUIDED scheduling # of iterations Thread `chunk’

RUNTIME • RUNTIME allows you to specify the type of scheduling at execution time using an environment variable • Useful if the type of parallelism you need is determined by a data file you load in • e.g. a dataset that is irregularly populated might require dynamic scheduling to load balance well, which a regularly populated dataset can simply be statically load balanced

Comparison of scheduling options

Lock Routines • CRITICAL SECTIONS and ATOMIC provide a simple access to “locks” • However, they are limited in terms of function • OpenMP provides a set of lock routines which can be combined with source code to provide more sophisticated locking behaviour

How to use lock routines • Locks follow a “create-test-set-unset-destroy” behaviour • Any lock variable must first be initialized using OMP_INIT_LOCK • Once initialized threads that are trying to `aquire’ the lock systematically test to see if it is available • If free, the lock is set by the thread and unlocked at the end of the locked section of code

Difference between SET and TEST • Remember: OMP_SET_LOCK Blocks execution until the lock is available OMP_TEST_LOCK Does not guarantee the lock will be acquired, or block execution Allows you to do other work if the lock is not available

Example program lock external omp_test_lock logical omp_test_lock integer lck CALL OMP_INIT_LOCK(lck) C$OMP PARALLEL SHARED(lck) PRIVATE(id) id= OMP_GET_THREAD_NUM() call OMP_SET_LOCK(lck) write(*,*) ‘I am thread ’,id call OMP_UNSET_LOCK(lck) do while (.NOT. OMP_TEST_LOCK(lck)) call spin() end do call work() call OMP_UNSET_LOCK(lck) C$OMP END PARALLEL call OMP_DESTROY_LOCK(lck) end Create lock Blocks until lock is available and then sets it Unset lock Tests for lock, if available it sets it, otherwise the thread is not blocked Unset lock Destroy lock

Pitfalls of using locks • Must be careful to avoid deadlocking • Happens when one thread aquires a lock but is unable to free it (easy to do with two locks) • Same issue occurs in MPI

Example of a deadlock call OMP_INIT_LOCK(lcka) call OMP_INIT_LOCK(lckb) C$OMP PARALLEL SECTIONS C$OMP SECTION call OMP_SET_LOCK(lcka) call OMP_SET_LOCK(lckb) call use_a_and_b() call OMP_SET_LOCK(lckb) call OMP_SET_LOCK(lcka) C$OMP SECTION call OMP_SET_LOCK(lckb) call OMP_SET_LOCK(lcka) call use_b_and_a() call OMP_SET_LOCK(lcka) call OMP_SET_LOCK(lckb) C$OMP END SECTIONS Nested locks – can result in a deadlock

Performance issues - False Sharing • You may parallelize your algorithm and find performance is less than stellar: Speed-up Ncpu

Example • A possible cause of poor performance is something called `false sharing’: integer m,n,i,j real a(m,n),s(m) C$OMP PARALLEL DO C$OMP& PRIVATE(i,j) C$OMP& SHARED(s,a) do i=1,m s(i)=0.0 do j=1,n s(i)=s(i)+a(i,j) end do end do Simple code which sums rows of a matrix

Execution time line • Set m=4, what happens in each thread? • Since memory is laid out in four word cache lines, at each stage all four threads are fighting for the same cache line t=0 s(1)=0.0 s(2)=0.0 s(3)=0.0 s(4)=0.0 t=1 s(1)=s(1)+a(1,1) s(2)=s(2)+a(2,1) s(3)=s(3)+a(3,1) s(4)=s(4)+a(4,1) t=2 s(1)=s(1)+a(1,2) s(2)=s(2)+a(2,2) s(3)=s(3)+a(3,2) s(4)=s(4)+a(4,2)

Cache line invalidation • For each thread, prior to the next operation on s(), it must retrieve a new version of the s(1:4) cache line from main memory (it’s own copy of s(1:4) has been invalidated by other operations) • Fetches from anywhere other than the local cache are much slower • Result is significantly increased run time

Simple solution • Just need to spread out elements of s() so that each of them is its own cache line: integer m,n,i,j real a(m,n),s(32,m) C$OMP PARALLEL DO C$OMP& PRIVATE(i,j) C$OMP& SHARED(s,a) do i=1,m s(1,i)=0.0 do j=1,n s(1,i)=s(1,i)+a(i,j) end do end do

Layout of s(,) s(,) 1 Each item of interest is now separated by (multiple) cache lines … … 32 i-1 i i+1 i+2 8 word cache lines s(1,i-1) s(1,i) s(1,i+1) s(1,i+2)

Tips to avoid false sharing • Minimize the number of variables that are shared • Segregate rarely updated variables from those that are update frequently (“volatile”) • Isolate the volatile items into separate cache lines • Have to accept the waste of memory to improve performance

OpenMP subroutines and functions OMP_SET_NUM_THREADS (s) OMP_GET_NUM_THREADS (f) OMP_GET_MAX_THREADS (f) OMP_GET_THREAD_NUM (f) OMP_GET_NUM_PROCS (f) OMP_IN_PARALLEL (f) OMP_SET_DYNAMIC (s) OMP_GET_DYNAMIC (f) OMP_SET_NESTED (s) OMP_GET_NESTED (f) C/C++ versions are all lower case Red=subroutines

Omp_lib • In f90 include the following: use omp_lib

Useful functions/subroutines • OMP_SET_NUM_THREADS is useful to change the number of threads of execution: • However, it will accept values higher than the number of CPUs – which will result in poor execution times call OMP_SET_NUM_THREADS(num_threads) or void omp_set_num_threads(int num_threads)

OMP_GET_NUM_THREADS • Returns the number of threads currently being used for execution • Will obviously produce a different result when executed in a parallel loop, versus a serial part of the code – be careful!

OMP_GET_MAX_THREADS • While OMP_GET_NUM_THREADS will return the number of threads being used OMP_GET_MAX_THREADS returns the maximum possible number • This value will be the same whether the code is executing in a serial or parallel section • Remember: that is not true for OMP_GET_NUM_THREADS!

OMP_GET_THREAD_NUM • Very useful function – returns your thread number from 0 to (number of threads) – 1 • Can be used to control access to data by using the thread number to index to start and end points of a section • Can also be useful in debugging race conditions

Environment variables • The OpenMP standard defines a number of environment variables (some of which we have met) OMP_NUM_THREADS OMP_SCHEDULE OMP_DYNAMIC OMP_NESTED

Environment variables • All of these variables can be overridden by using the subroutines we discussed (with the exception of the OMP_SCHEDULE variable) • OMP_DYNAMIC is set to false by default • OMP_NESTED is set to false by default • If you are using nesting it is probably safer to ensure you turn on nesting within the code

Thread Stack size • One of the most significant variables is not declared within the OpenMP standard • Each parallel thread may require a large stack to declare its private variables on • Typical sizes are 1-8 MB, but certain codes may require more • I often run problems where I need over 100 MB of thread stack (for example)

Quick example • Consider the following code: • You are assigning 1283*3 words on to the thread stack=24 MB real r(3,2097152) C$OMP PARALLEL DO C$OMP PRIVATE(r,i) do i=1,10 call work(r) end do

Mastering OpenMP for Efficient Parallelization in High-Performance Computing