1 / 40

Computational Methods in Physics PHYS 3437

Computational Methods in Physics PHYS 3437. Dr Rob Thacker Dept of Astronomy & Physics (MM-301C) thacker@ap.smu.ca. Today’s Lecture. Recap from end of last lecture Some technical details related to parallel programming Data dependencies Race conditions

Download Presentation

Computational Methods in Physics PHYS 3437

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Methods in Physics PHYS 3437 Dr Rob Thacker Dept of Astronomy & Physics (MM-301C) thacker@ap.smu.ca

  2. Today’s Lecture • Recap from end of last lecture • Some technical details related to parallel programming • Data dependencies • Race conditions • Summary of other clauses you can use in setting up parallel loops

  3. Comment pragmas for FORTRAN - ampersand necessary for continuation Denotes this is a region of code for parallel execution Good programming practice, must declare nature of all variables Thread SHARED variables: all threads can access these variables, but must not update individual memory locations simultaneously Thread PRIVATE variables: each thread must have their own copy of this variable (in this case i is the only private variable) Recap C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do

  4. SHARED and PRIVATE • Most commonly used directives which are necessary to ensure correct execution • PRIVATE: any variable declared as private will be local only to a given thread and is inaccesible to others (also it is uninitialized) • This means that if you have a variable, say t, in the serial section of the code and then use it in a loop, the value of t in the loop will not carry over the value of t from the serial part • Watch out for this – but there is a way around it… • SHARED: any variable declared as shared will be accessible by all other threads of execution

  5. Example • The SHARED and PRIVATE specifications can be long! C$OMP& PRIVATE(icb,icol,izt,iyt,icell,iz_off,iy_off,ibz, C$OMP& iby,ibx,i,rxadd,ryadd,rzadd,inx,iny,inz,nb,nebs,ibrf, C$OMP& nbz,nby,nbx,nbrf,nbref,jnbox,jnboxnhc,idt,mdt,iboxd, C$OMP& dedge,idir,redge,is,ie,twoh,dosph,rmind,in,ixyz, C$OMP& redaughter,Ustmp,ngpp,hpp,vpp,apps,epp,hppi,hpp2, C$OMP& rh2,hpp2i,hpp3i,hpp5i,dpp,divpp,dcvpp,nspp,rnspp, C$OMP& rad2torbin,de1,dosphflag,dosphnb,nbzlow,nbzhigh,nbylow, C$OMP& nbyhigh,nbxlow,nbxhigh,nbzadd,nbyadd,r3i,r2i,r1i, C$OMP& dosphnbnb,dogravnb,js,je,j,rad2,rmj,grc,igrc,gfrac, C$OMP& Gr,hppj,jlist,dx,rdv,rcv,v2,radii2,rbin,ibin,fbin, C$OMP& wl1,dwl1,drnspp,hppa,hppji,hppj2i,hppj3i,hppj5i, C$OMP& wl2,dwl2,w,dw,df,dppi,divppr,dcvpp2,dcvppm,divppm,csi, C$OMP& fi,prhoi2,ispp,frcij,rdotv,hpa,rmuij,rhoij,cij,qij, C$OMP& frc3,frc4,hcalc,rath,av,frc2,dr1,dr2,dr3,dr12,dr22,dr32, C$OMP& appg1,appg2,appg3,gdiff,ddiff,d2diff,dv1,dv2,dv3,rpp, C$OMP& Gro)

  6. FIRSTPRIVATE • Declaring a variable FIRSTPRIVATE will ensure that its value is copied in from any prior piece of serial code • However (of course) if the variable is not initialized in the serial section it will remain uninitialized • Happens only once for a given thread set • Try to avoid writing to variables declared FIRSTPRIVATE

  7. FIRSTPRIVATE example • Lower bound of values is set to value of A, without FIRSTPRIVATE clause a=0.0 a=5.0 C$OMP PARALLEL DO C$OMP& SHARED(r), PRIVATE(i) C$OMP& FIRSTPRIVATE(a) do i=1,n r(i)=max(a,r(i)) end do

  8. LASTPRIVATE • Occasionally it may be necessary to know the last value of a variable from the end of the loop • LASTPRIVATE variables will initialize the value of the variable in the serial section using the last (sequential) value of the variable from the parallel loop

  9. Default behaviour • You can actually omit the SHARED and PRIVATE statements – what is the expected behaviour? • Scalars are private by default • Arrays are shared by default Bad practice in my opinion – specify the types for everything

  10. DEFAULT • I recommend using DEFAULT(NONE) at all times • Forces specification of all variable types • Alternatively, can use DEFAULT(SHARED), or DEFAULT(PRIVATE) to specify that un-scoped variables will default to the particular type chosen • e.g. choosing DEFAULT(PRIVATE) will ensure any un-scoped variable is private

  11. The Parallel Do Pragmas • So far we’ve considered a small subset of functionality • Before we talk more about data dependencies, lets look briefly at what other statements can be used in a parallel do loop • Besides PRIVATE and SHARED variables there are a number of other clauses that can be applied

  12. Loop Level Parallelism in more detail • For each parallel do(for) pragma, the following clauses are possible: C/C++ private shared firstprivate lastprivate reduction ordered schedule copyin FORTRAN PRIVATE SHARED FIRSTPRIVATELASTPRIVATE REDUCTION ORDERED SCHEDULE COPYIN DEFAULT Red=most frequently used Clauses in italics we have already seen.

  13. More background on data dependencies • Suppose you try to parallelize the following loop • Won’t work as it is written since iteration i, depends upon iteration i-1and thus we can’t start anything in parallel • To see this explicitly, let n=20 and start thread 1 at i=1, and thread 2 at i=11 then thread 1 sets Y(1)=1.0 and thread 2 sets Y(11)=1.0 (which is wrong!) c=0.0 do i=1,n c=c+1.0 Y(i)=c end do

  14. Simple solution • This loop can easily be re-written in a way that can be parallelized: • There is no longer any dependence on the previous operation • Private variables: i, Shared variables: Y(),c,n c=0.0 do i=1,n Y(i)=c+float(i) end do c=c+n

  15. Types of Data Dependencies • Suppose we have operations O1,O2 • True Dependence: • O2 has a true dependence on O1 if O2 reads a value written by O1 • Anti Dependence: • O2 has an anti-dependence on O1 if O2 writes a value read by O1 • Output Dependence: • O2 has an output dependence on O1 if O2 writes a variable written by O1

  16. Examples A1=A2+A3 B1=A1+B2 • True dependence: • Anti-dependence: • Output dependence: B1=A1+B2 A1=C2 B1=5 B1=2

  17. Dealing with Data Dependencies • Any loop where iterations depend upon the previous one has a potential problem • Any result which depends upon the order of the iterations will be a problem • Good first test of whether something can be parallelized: reverse the loop iteration order • Not all data dependencies can be eliminated • Accumulations of variables (e.g. sum of elements in an array) can be dealt with easily

  18. Accumulations • Consider the following loop: It apparently has a data dependency – however each thread can sum values of a independently OpenMP provides an explicit interface for this kind of operation (“REDUCTION”) a=0.0 do i=1,n a=a+X(i) end do

  19. REDUCTION clause • This clause deals with parallel versions of the following loops • Outcome is determined by a `reduction’ over all the values for each thread • e.g. max over all of a set, is equivalent to the max over all max with subsets: Max(A) where A=U An= Max(U Max(An)) do i=1,N a=max(a,b(i)) end do do i=1,N a=min(a,b(i)) end do do i=1,n a=a+b(i) end do

  20. Examples • Syntax: REDUCTION(OP: variable) where OP=max,min,+,-,* (& logic ops) C$OMP PARALLEL DO C$OMP& PRIVATE(i), SHARED(b) C$OMP& REDUCTION(max:a) do i=1,N a=max(a,b(i)) end do C$OMP PARALLEL DO C$OMP& PRIVATE(i), SHARED(b) C$OMP& REDUCTION(min:a) do i=1,N a=min(a,b(i)) end do

  21. What is REDUCTION actually doing? • Saving you from writing more code • The reduction clause generates an array of the reduction variables, and each thread is responsible for a certain element in the array • The final reduction over all the array elements (when the loop is finished) is performed transparently to the user

  22. Initialization • Reduction variables are initialized as follows (from the standard): Operator Initialization + 0 * 1 - 0 MAX Smallest rep. numberMIN Largest rep. number

  23. Race Conditions • Common operation is to resolve a spatial position into an array index: consider following loop • Looks innocent enough – but suppose two particles have the same positions… C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i,j) C$OMP& SHARED(r,A) do i=1,n j=int(r(i)) A(j)=A(j)+1. end do r(): array of positions A(): array that is modified using information from r()

  24. Thread 1: Gets A(j)=1. Adds 1. A(j)=2. Thread 1: Puts A(j)=2. End State A(j)=2. INCORRECT Thread 2: Gets A(j)=1. Adds 1. A(j)=2. Thread 2: Puts A(j)=2. Race Conditions: A concurrency problem • Two different threads of execution can concurrently attempt to update the same memory location Start A(j)=1. time

  25. Dealing with Race Conditions • Need mechanism to ensure updates to single variables occur within a critical section • Any thread entering a critical section blocks all others • Critical sections can be established by using “lock variables” • Think of lock variables as preventing more than one thread from working on a particular piece of code at any one time • Just like a lock on door prevents people from entering a room

  26. Deadlocks: The pitfall of locking • Must ensure a situation is not created where requests in possession create a deadlock: • Nested locks are a classic example of this • Can also create problem with multiple processes - `deadly embrace’ Resource 1 Resource 2 Holds Process 1 Process 2 Requests

  27. Solutions • Need to ensure memory read/writes occur without any overlap • If the access occurs to a single region, we can use a critical section: do i=1,n **work** C$OMP CRITICAL(lckx) a=a+1. C$OMP END CRITICAL(lckx) end do Only one thread will be allowed inside the critical section at a time. I have given a name to the critical section but you don’t have to do this.

  28. ATOMIC • If all you want to do is ensure the correct update of one variable you can use the atomic update facility: • Exactly the same as a critical section around one single update point C$OMP PARALLEL DO do i=1,n **work** C$OMP ATOMIC a=a+1. end do

  29. Can be inefficient = doing work =waiting for lock • If other threads are waiting to enter the critical section then the program may even degenerate to a serial code! • Make sure there is much more work outside the locked region than inside it! FORK JOIN Parallel Section where each thread waits for the lock before being able to proceed – A complete disaster

  30. COPYIN & ORDERED • Suppose you have a small section of code that needs to be executed always in sequential order • However, remaining work can be done in any order • Placing an ORDERED clause around the work section will force threads to execute this section of code sequentially • If a common block is specified as private in a parallel do, COPYIN will ensure that all threads are initialized with the same values as in the serial section of the code • Essentially `FIRSTPRIVATE’ for common blocks/globals

  31. Subtle point about running in parallel • When running in parallel you are only as fast as your slowest thread • In example, total work is 40 seconds, & have 4 cpus • Max speed up would be 40/4=10 secs • All have to equal 10 secs though to give max speed-up Example of poor load balance, only a 40/16=2.5 speed-up despite using 4 processors

  32. SCHEDULE • This is the mechanism for determining how work is spread among threads • Important for ensuring that work is spread evenly among the threads – just having the same number of each iterations may not guarantee they all complete at the same time • Four types of scheduling possible: STATIC, DYNAMIC, GUIDED, RUNTIME

  33. STATIC scheduling • Simplest of the four • If SCHEDULE is unspecified, STATIC scheduling will result • Default behaviour is to simply divide up the iterations among the threads ~n/(# threads) • STATIC(chunksize), creates a cyclic distribution of iterations

  34. Comparison STATIC No chunksize THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 STATIC chunksize=1 THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 9 13 2 6 10 14 3 7 11 15 4 8 12 16 5

  35. DYNAMIC scheduling • DYNAMIC scheduling is a personal favourite • Specify using DYNAMIC(chunksize) • Simple implementation of master-worker type distribution of iterations • Master thread passes off values of iterations to the workers of size chunksize • Not a silver bullet: if load balance is too severe (i.e. one thread takes longer than the rest combined) an algorithm rewrite is necessary • Also not good if you need a regular access pattern for data locality

  36. THREAD 1 THREAD 2 THREAD 3 Master-Worker Model REQUEST Master REQUEST REQUEST

  37. Other ways to use OpenMP • We’ve really only skimmed the surface of what you can do • However, we have covered the important details • OpenMP provides a different programming model to just using loops • It isn’t that much harder, but you need to think slightly differently • Check out www.openmp.org for more details

  38. Applying to algorithms used in the course • What could we apply OpenMP to? • Root finding algorithms are actually fundamentally serial! • Global bracket finder: subdivide region and let each CPU search in their allotted space in parallel • LU decomposition can be parallelized • Numerical integration can be parallelized • ODE solvers are not usually good parallelization candidates, but it is problem dependent • MC methods usually (but not always) parallelize well

  39. The main difficulty in loop level parallel programming is figuring out whether there are data dependencies or race conditions Remember that variables do not naturally carry into a parallel loop, or for that matter out of one Use FIRSTPRIVATE and LASTPRIVATE when you need to do this SCHEDULE provides many options Use DYNAMIC when you have unknown amount of work in a loop Use STATIC when you need a regular access pattern to array Summary

  40. Next Lecture • Introduction to visualization

More Related