Computational Methods in Physics PHYS 3437

Computational Methods in Physics PHYS 3437 Dr Rob Thacker Dept of Astronomy & Physics (MM-301C) thacker@ap.smu.ca

Today’s Lecture • Recap from end of last lecture • Some technical details related to parallel programming • Data dependencies • Race conditions • Summary of other clauses you can use in setting up parallel loops

Comment pragmas for FORTRAN - ampersand necessary for continuation Denotes this is a region of code for parallel execution Good programming practice, must declare nature of all variables Thread SHARED variables: all threads can access these variables, but must not update individual memory locations simultaneously Thread PRIVATE variables: each thread must have their own copy of this variable (in this case i is the only private variable) Recap C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do

SHARED and PRIVATE • Most commonly used directives which are necessary to ensure correct execution • PRIVATE: any variable declared as private will be local only to a given thread and is inaccesible to others (also it is uninitialized) • This means that if you have a variable, say t, in the serial section of the code and then use it in a loop, the value of t in the loop will not carry over the value of t from the serial part • Watch out for this – but there is a way around it… • SHARED: any variable declared as shared will be accessible by all other threads of execution

Example • The SHARED and PRIVATE specifications can be long! C$OMP& PRIVATE(icb,icol,izt,iyt,icell,iz_off,iy_off,ibz, C$OMP& iby,ibx,i,rxadd,ryadd,rzadd,inx,iny,inz,nb,nebs,ibrf, C$OMP& nbz,nby,nbx,nbrf,nbref,jnbox,jnboxnhc,idt,mdt,iboxd, C$OMP& dedge,idir,redge,is,ie,twoh,dosph,rmind,in,ixyz, C$OMP& redaughter,Ustmp,ngpp,hpp,vpp,apps,epp,hppi,hpp2, C$OMP& rh2,hpp2i,hpp3i,hpp5i,dpp,divpp,dcvpp,nspp,rnspp, C$OMP& rad2torbin,de1,dosphflag,dosphnb,nbzlow,nbzhigh,nbylow, C$OMP& nbyhigh,nbxlow,nbxhigh,nbzadd,nbyadd,r3i,r2i,r1i, C$OMP& dosphnbnb,dogravnb,js,je,j,rad2,rmj,grc,igrc,gfrac, C$OMP& Gr,hppj,jlist,dx,rdv,rcv,v2,radii2,rbin,ibin,fbin, C$OMP& wl1,dwl1,drnspp,hppa,hppji,hppj2i,hppj3i,hppj5i, C$OMP& wl2,dwl2,w,dw,df,dppi,divppr,dcvpp2,dcvppm,divppm,csi, C$OMP& fi,prhoi2,ispp,frcij,rdotv,hpa,rmuij,rhoij,cij,qij, C$OMP& frc3,frc4,hcalc,rath,av,frc2,dr1,dr2,dr3,dr12,dr22,dr32, C$OMP& appg1,appg2,appg3,gdiff,ddiff,d2diff,dv1,dv2,dv3,rpp, C$OMP& Gro)

FIRSTPRIVATE • Declaring a variable FIRSTPRIVATE will ensure that its value is copied in from any prior piece of serial code • However (of course) if the variable is not initialized in the serial section it will remain uninitialized • Happens only once for a given thread set • Try to avoid writing to variables declared FIRSTPRIVATE

FIRSTPRIVATE example • Lower bound of values is set to value of A, without FIRSTPRIVATE clause a=0.0 a=5.0 C$OMP PARALLEL DO C$OMP& SHARED(r), PRIVATE(i) C$OMP& FIRSTPRIVATE(a) do i=1,n r(i)=max(a,r(i)) end do

LASTPRIVATE • Occasionally it may be necessary to know the last value of a variable from the end of the loop • LASTPRIVATE variables will initialize the value of the variable in the serial section using the last (sequential) value of the variable from the parallel loop

Default behaviour • You can actually omit the SHARED and PRIVATE statements – what is the expected behaviour? • Scalars are private by default • Arrays are shared by default Bad practice in my opinion – specify the types for everything

DEFAULT • I recommend using DEFAULT(NONE) at all times • Forces specification of all variable types • Alternatively, can use DEFAULT(SHARED), or DEFAULT(PRIVATE) to specify that un-scoped variables will default to the particular type chosen • e.g. choosing DEFAULT(PRIVATE) will ensure any un-scoped variable is private

The Parallel Do Pragmas • So far we’ve considered a small subset of functionality • Before we talk more about data dependencies, lets look briefly at what other statements can be used in a parallel do loop • Besides PRIVATE and SHARED variables there are a number of other clauses that can be applied

Loop Level Parallelism in more detail • For each parallel do(for) pragma, the following clauses are possible: C/C++ private shared firstprivate lastprivate reduction ordered schedule copyin FORTRAN PRIVATE SHARED FIRSTPRIVATELASTPRIVATE REDUCTION ORDERED SCHEDULE COPYIN DEFAULT Red=most frequently used Clauses in italics we have already seen.

More background on data dependencies • Suppose you try to parallelize the following loop • Won’t work as it is written since iteration i, depends upon iteration i-1and thus we can’t start anything in parallel • To see this explicitly, let n=20 and start thread 1 at i=1, and thread 2 at i=11 then thread 1 sets Y(1)=1.0 and thread 2 sets Y(11)=1.0 (which is wrong!) c=0.0 do i=1,n c=c+1.0 Y(i)=c end do

Simple solution • This loop can easily be re-written in a way that can be parallelized: • There is no longer any dependence on the previous operation • Private variables: i, Shared variables: Y(),c,n c=0.0 do i=1,n Y(i)=c+float(i) end do c=c+n

Types of Data Dependencies • Suppose we have operations O1,O2 • True Dependence: • O2 has a true dependence on O1 if O2 reads a value written by O1 • Anti Dependence: • O2 has an anti-dependence on O1 if O2 writes a value read by O1 • Output Dependence: • O2 has an output dependence on O1 if O2 writes a variable written by O1

Examples A1=A2+A3 B1=A1+B2 • True dependence: • Anti-dependence: • Output dependence: B1=A1+B2 A1=C2 B1=5 B1=2

Dealing with Data Dependencies • Any loop where iterations depend upon the previous one has a potential problem • Any result which depends upon the order of the iterations will be a problem • Good first test of whether something can be parallelized: reverse the loop iteration order • Not all data dependencies can be eliminated • Accumulations of variables (e.g. sum of elements in an array) can be dealt with easily

Accumulations • Consider the following loop: It apparently has a data dependency – however each thread can sum values of a independently OpenMP provides an explicit interface for this kind of operation (“REDUCTION”) a=0.0 do i=1,n a=a+X(i) end do

REDUCTION clause • This clause deals with parallel versions of the following loops • Outcome is determined by a `reduction’ over all the values for each thread • e.g. max over all of a set, is equivalent to the max over all max with subsets: Max(A) where A=U An= Max(U Max(An)) do i=1,N a=max(a,b(i)) end do do i=1,N a=min(a,b(i)) end do do i=1,n a=a+b(i) end do

Examples • Syntax: REDUCTION(OP: variable) where OP=max,min,+,-,* (& logic ops) C$OMP PARALLEL DO C$OMP& PRIVATE(i), SHARED(b) C$OMP& REDUCTION(max:a) do i=1,N a=max(a,b(i)) end do C$OMP PARALLEL DO C$OMP& PRIVATE(i), SHARED(b) C$OMP& REDUCTION(min:a) do i=1,N a=min(a,b(i)) end do

What is REDUCTION actually doing? • Saving you from writing more code • The reduction clause generates an array of the reduction variables, and each thread is responsible for a certain element in the array • The final reduction over all the array elements (when the loop is finished) is performed transparently to the user

Initialization • Reduction variables are initialized as follows (from the standard): Operator Initialization + 0 * 1 - 0 MAX Smallest rep. numberMIN Largest rep. number

Race Conditions • Common operation is to resolve a spatial position into an array index: consider following loop • Looks innocent enough – but suppose two particles have the same positions… C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i,j) C$OMP& SHARED(r,A) do i=1,n j=int(r(i)) A(j)=A(j)+1. end do r(): array of positions A(): array that is modified using information from r()

Thread 1: Gets A(j)=1. Adds 1. A(j)=2. Thread 1: Puts A(j)=2. End State A(j)=2. INCORRECT Thread 2: Gets A(j)=1. Adds 1. A(j)=2. Thread 2: Puts A(j)=2. Race Conditions: A concurrency problem • Two different threads of execution can concurrently attempt to update the same memory location Start A(j)=1. time

Dealing with Race Conditions • Need mechanism to ensure updates to single variables occur within a critical section • Any thread entering a critical section blocks all others • Critical sections can be established by using “lock variables” • Think of lock variables as preventing more than one thread from working on a particular piece of code at any one time • Just like a lock on door prevents people from entering a room

Deadlocks: The pitfall of locking • Must ensure a situation is not created where requests in possession create a deadlock: • Nested locks are a classic example of this • Can also create problem with multiple processes - `deadly embrace’ Resource 1 Resource 2 Holds Process 1 Process 2 Requests

Solutions • Need to ensure memory read/writes occur without any overlap • If the access occurs to a single region, we can use a critical section: do i=1,n **work** C$OMP CRITICAL(lckx) a=a+1. C$OMP END CRITICAL(lckx) end do Only one thread will be allowed inside the critical section at a time. I have given a name to the critical section but you don’t have to do this.

ATOMIC • If all you want to do is ensure the correct update of one variable you can use the atomic update facility: • Exactly the same as a critical section around one single update point C$OMP PARALLEL DO do i=1,n **work** C$OMP ATOMIC a=a+1. end do

Can be inefficient = doing work =waiting for lock • If other threads are waiting to enter the critical section then the program may even degenerate to a serial code! • Make sure there is much more work outside the locked region than inside it! FORK JOIN Parallel Section where each thread waits for the lock before being able to proceed – A complete disaster

COPYIN & ORDERED • Suppose you have a small section of code that needs to be executed always in sequential order • However, remaining work can be done in any order • Placing an ORDERED clause around the work section will force threads to execute this section of code sequentially • If a common block is specified as private in a parallel do, COPYIN will ensure that all threads are initialized with the same values as in the serial section of the code • Essentially `FIRSTPRIVATE’ for common blocks/globals

Subtle point about running in parallel • When running in parallel you are only as fast as your slowest thread • In example, total work is 40 seconds, & have 4 cpus • Max speed up would be 40/4=10 secs • All have to equal 10 secs though to give max speed-up Example of poor load balance, only a 40/16=2.5 speed-up despite using 4 processors

SCHEDULE • This is the mechanism for determining how work is spread among threads • Important for ensuring that work is spread evenly among the threads – just having the same number of each iterations may not guarantee they all complete at the same time • Four types of scheduling possible: STATIC, DYNAMIC, GUIDED, RUNTIME

STATIC scheduling • Simplest of the four • If SCHEDULE is unspecified, STATIC scheduling will result • Default behaviour is to simply divide up the iterations among the threads ~n/(# threads) • STATIC(chunksize), creates a cyclic distribution of iterations

Comparison STATIC No chunksize THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 STATIC chunksize=1 THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 9 13 2 6 10 14 3 7 11 15 4 8 12 16 5

DYNAMIC scheduling • DYNAMIC scheduling is a personal favourite • Specify using DYNAMIC(chunksize) • Simple implementation of master-worker type distribution of iterations • Master thread passes off values of iterations to the workers of size chunksize • Not a silver bullet: if load balance is too severe (i.e. one thread takes longer than the rest combined) an algorithm rewrite is necessary • Also not good if you need a regular access pattern for data locality

THREAD 1 THREAD 2 THREAD 3 Master-Worker Model REQUEST Master REQUEST REQUEST

Other ways to use OpenMP • We’ve really only skimmed the surface of what you can do • However, we have covered the important details • OpenMP provides a different programming model to just using loops • It isn’t that much harder, but you need to think slightly differently • Check out www.openmp.org for more details

Applying to algorithms used in the course • What could we apply OpenMP to? • Root finding algorithms are actually fundamentally serial! • Global bracket finder: subdivide region and let each CPU search in their allotted space in parallel • LU decomposition can be parallelized • Numerical integration can be parallelized • ODE solvers are not usually good parallelization candidates, but it is problem dependent • MC methods usually (but not always) parallelize well

The main difficulty in loop level parallel programming is figuring out whether there are data dependencies or race conditions Remember that variables do not naturally carry into a parallel loop, or for that matter out of one Use FIRSTPRIVATE and LASTPRIVATE when you need to do this SCHEDULE provides many options Use DYNAMIC when you have unknown amount of work in a loop Use STATIC when you need a regular access pattern to array Summary

Next Lecture • Introduction to visualization

Computational Methods in Physics PHYS 3437