1 / 55

Computational Methods in Astrophysics

Computational Methods in Astrophysics. Dr Rob Thacker (AT319E) thacker@ap. Column vs Row major storage. In FORTRAN the storage is such that A(3,N) is A(1,1),A(2,1),A(3,1),A(1,2),A(2,2),A(3,2) in memory The first index is contiguous in memory (column major)

murphymary
Download Presentation

Computational Methods in Astrophysics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Methods in Astrophysics Dr Rob Thacker (AT319E) thacker@ap

  2. Column vs Row major storage • In FORTRAN the storage is such that A(3,N) is A(1,1),A(2,1),A(3,1),A(1,2),A(2,2),A(3,2) in memory • The first index is contiguous in memory (column major) • In C/C++ A[N][3] is laid out such that A[1][1],A[1][2],A[1][3],A[2][1],A[2][2],A[2][3] • The last index is contiguous (row major)

  3. FORTRAN do i=1,n Y(i)=a*X(i)+Y(i) end do C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do Loop Level Parallelism • Consider the single precision vector add-multiply operation Y=aX+Y (“SAXPY”) C/C++ for (i=1;i<=n;++i) { Y[i]+=a*X[i]; } #pragma omp parallel for \ private(i) shared(X,Y,n,a) for (i=1;i<=n;++i) { Y[i]+=a*X[i]; }

  4. Comment pragmas for FORTRAN - ampersand necessary for continuation Denotes this is a region of code for parallel execution Good programming practice, must declare nature of all variables Thread SHARED variables: all threads can access these variables, but must not update individual memory locations simultaneously Thread PRIVATE variables: each thread must have their own copy of this variable (in this case i is the only private variable) In more detail C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do

  5. Reminder • Recall private variables are uninitialized • Motivation: no need to copy in from serial section • Private variables do not carry a value into the serial parts of the code • Motivation: no need to copy from parallel to serial • API provides to mechanisms to circumvent this issue • FIRSTPRIVATE • LASTPRIVATE

  6. Load balancing • When running in parallel you are only as fast as your slowest thread • In example, total work is 40 seconds, & have 4 cpus • Max speed up would be 40/4=10 secs • All have to equal 10 secs though to give max speed-up Example of poor load balance, only a 40/16=2.5 speed-up despite using 4 processors

  7. STATIC scheduling • Simplest of the four • If SCHEDULE is unspecified, STATIC scheduling will result • Default behaviour is to simply divide up the iterations among the threads ~n/(# threads) • STATIC(chunksize), creates a cyclic distribution of iterations

  8. Comparison STATIC No chunksize THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 STATIC chunksize=1 THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 9 13 2 6 10 14 3 7 11 15 4 8 12 16 5

  9. Chunksize & Cache line issues • If you are accessing arrays using the loop index, e.g. • Ensure chunksize > words in cache line • False sharing otherwise C$OMP PARALLEL DO C$OMP& PRIVATE(I,..), SHARED(a,..) C$OMP& SCHEDULE(STATIC,1) do i=1,n **work** a(i)=… end do

  10. Drawbacks to STATIC scheduling • Naïve method of achieving good load balance • Works well if the work in each iteration is constant • If work varies by factor ~10 then usually cyclic load balancing can help • If work varies by larger factors then dynamic balancing is usually better

  11. DYNAMIC scheduling • DYNAMIC scheduling is a personal favourite • Specify using DYNAMIC(chunksize) • Simple implementation of master-worker type distribution of iterations • Master thread passes off values of iterations to the workers of size chunksize • Not a silver bullet: if load balance is too severe (i.e. one thread takes longer than the rest combined) an algorithm rewrite is necessary • Also not good if you need a regular access pattern for data locality

  12. THREAD 1 THREAD 2 THREAD 3 Master-Worker Model REQUEST Master REQUEST REQUEST

  13. Performance issues - False Sharing • You may parallelize your algorithm and find performance is less than stellar: Speed-up Ncpu

  14. Example • A possible cause of poor performance is something called `false sharing’: integer m,n,i,j real a(m,n),s(m) C$OMP PARALLEL DO C$OMP& PRIVATE(i,j) C$OMP& SHARED(s,a) do i=1,m s(i)=0.0 do j=1,n s(i)=s(i)+a(i,j) end do end do Simple code which sums rows of a matrix

  15. Execution time line • Set m=4, what happens in each thread? • Since memory is laid out in four word cache lines, at each stage all four threads are fighting for the same cache line t=0 s(1)=0.0 s(2)=0.0 s(3)=0.0 s(4)=0.0 t=1 s(1)=s(1)+a(1,1) s(2)=s(2)+a(2,1) s(3)=s(3)+a(3,1) s(4)=s(4)+a(4,1) t=2 s(1)=s(1)+a(1,2) s(2)=s(2)+a(2,2) s(3)=s(3)+a(3,2) s(4)=s(4)+a(4,2)

  16. Cache line invalidation • For each thread, prior to the next operation on s(), it must retrieve a new version of the s(1:4) cache line from main memory (it’s own copy of s(1:4) has been invalidated by other operations) • Fetches from anywhere other than the local cache are much slower • Result is significantly increased run time

  17. Simple solution • Just need to spread out elements of s() so that each of them is its own cache line: integer m,n,i,j real a(m,n),s(32,m) C$OMP PARALLEL DO C$OMP& PRIVATE(i,j) C$OMP& SHARED(s,a) do i=1,m s(1,i)=0.0 do j=1,n s(1,i)=s(1,i)+a(i,j) end do end do

  18. Layout of s(,) s(,) 1 Each item of interest is now separated by (multiple) cache lines … … 32 i-1 i i+1 i+2 8 word cache lines s(1,i-1) s(1,i) s(1,i+1) s(1,i+2)

  19. Tips to avoid false sharing • Minimize the number of variables that are shared • Segregate rarely updated variables from those that are update frequently (“volatile”) • Isolate the volatile items into separate cache lines • Have to accept the waste of memory to improve performance

  20. OpenMP subroutines and functions OMP_SET_NUM_THREADS (s) OMP_GET_NUM_THREADS (f) OMP_GET_MAX_THREADS (f) OMP_GET_THREAD_NUM (f) OMP_GET_NUM_PROCS (f) OMP_IN_PARALLEL (f) OMP_SET_DYNAMIC (s) OMP_GET_DYNAMIC (f) OMP_SET_NESTED (s) OMP_GET_NESTED (f) C/C++ versions are all lower case Red=subroutines

  21. Useful functions/subroutines • OMP_SET_NUM_THREADS is useful to change the number of threads of execution: • However, it will accept values higher than the number of CPUs – which will result in poor execution times call OMP_SET_NUM_THREADS(num_threads) or void omp_set_num_threads(int num_threads)

  22. OMP_GET_NUM_THREADS • Returns the number of threads currently being used for execution • Will obviously produce a different result when executed in a parallel loop, versus a serial part of the code – be careful!

  23. OMP_GET_MAX_THREADS • While OMP_GET_NUM_THREADS will return the number of threads being used OMP_GET_MAX_THREADS returns the maximum possible number • This value will be the same whether the code is executing in a serial or parallel section • Remember: that is not true for OMP_GET_NUM_THREADS!

  24. OMP_GET_THREAD_NUM • Very useful function – returns your thread number from 0 to (number of threads) – 1 • Can be used to control access to data by using the thread number to index to start and end points of a section • Can also be useful in debugging race conditions

  25. Environment variables • The OpenMP standard defines a number of environment variables (some of which we have met) OMP_NUM_THREADS OMP_SCHEDULE OMP_DYNAMIC OMP_NESTED

  26. Environment variables • All of these variables can be overridden by using the subroutines we discussed (with the exception of the OMP_SCHEDULE variable) • OMP_DYNAMIC is set to false by default • OMP_NESTED is set to false by default • If you are using nesting it is probably safer to ensure you turn on nesting within the code

  27. Thread Stack size • One of the most significant variables is not declared within the OpenMP standard • Each parallel thread may require a large stack to declare its private variables on • Typical sizes are 1-8 MB, but certain codes may require more • I often run problems where I need over 100 MB of thread stack (for example)

  28. Quick example • Consider the following code: • You are assigning 1283*3 words on to the thread stack=24 MB real r(3,2097152) C$OMP PARALLEL DO C$OMP PRIVATE(r,i) do i=1,10 call work(r) end do

  29. Conditional Parallelism • To save code replication, OpenMP allows you to determine whether a loop should be called in parallel or serial: `conditional parallelism’ • A parallel do loop can be followed by an IF statement: • Only if the if statement is true is the routine executed in parallel C$OMP PARALLEL DO C$OMP& IF(L.gt.Lrmax) C$OMP& PRIVATE(…) code….

  30. How this avoids replication • Suppose you have a routine that is applied to a certain data set, but you may have a large single data set, or alternatively many small data sets Option 2 Option 1 call foo call foo call foo call foo Single instance, parallel across machine Multiple, (parallel) serial calls

  31. Conditional Parallelism • To determine whether the routine is executed in parallel or not, the omp_in_parallel function can be used: subroutine foo(bar) real bar logical inpar inpar = omp_in_parallel() C$OMP PARALLEL DO C$OMP& IF(.NOT. inpar) C$OMP& PRIVATE(…) C$OMP& SHARED(…) do i=1,N work…. end do return

  32. Comments • Applications that have a recursive nature can often be significantly reduced in size by using conditional parallelism • Most significant benefit is probably in code correctness – any bugs need only be fixed once!

  33. Conditional Compilation • It’s good practice to use your OpenMP code in serial • However, if you include OpenMP functions you may not have these defined within your compiler • Conditional compilation allows a simple way around this

  34. Conditional Compilation • Two methods can be used to remove OpenMP subroutines • Or C preprocessor statements can be used C$ i=OMP_GET_THREAD_NUM() #IFDEF _OPENMP i=OMP_GET_THREAD_NUM() #ENDIF

  35. Simple Example • Suppose you use the thread number function given in the previous example C$OMP PARALLEL DO C$OMP& PRIVATE(j) do i=1,n j=1 C$ j=OMP_GET_THREAD_NUM() js=j*iblk+1 je=(j+1)*iblk do k=js,je ! Use j to point to ! parts of an array end do end do C$OMP END PARALLEL C$OMP PARALLEL DO C$OMP& PRIVATE(j) do i=1,n j=OMP_GET_THREAD_NUM() js=j*iblk+1 je=(j+1)*iblk do k=js,je ! Use j to point to ! parts of an array end do end do C$OMP END PARALLEL Add default value and ensure all constants are adjusted accordingly

  36. Parallel I/O • Provided you are prepared to read and write multiple files then OpenMP can give moderate I/O speed-up • Even if you are writing only one array, sections of it can be written within the files e.g. write(13) (r(i),i=1,block) write(13) (r(i),i=block+1,2*block)

  37. I/O continued • However, writing to one file system implies a bandwidth limit • Multiple file systems on one machine can show great benefits • Note: some machines require that applications be compiled with special flags • May need parallel io flag at compile time

  38. Code Example blck_len=n/OMP_GET_NUM_THREADS() C$OMP PARALLEL DO C$OMP& PRIVATE(I,filename,js,je) C$OMP& SHARED(blck_len,r) do i=1,nthread write(filename,`(``r``,i4.4,``.out``)`)i js=(i-1)*blck_len+1 je=i*blck_len open(i+10,file=filename,status=‘new’,form=‘unformatted’) write(i+10) (r(j),j=js,je) close(i+10) end do • Writes array r() out in nthread files simultaneously

  39. Performance examples Ultimately depends on system you are on

  40. Alternative programming methodology • OpenMP also allows you to program a way more akin to MPI (SPMD model) • Using the PARALLEL construct you don’t need a do loop, or sections to execute in parallel: C$OMP PARALLEL call foo(I,a,b,c) C$OMP END PARALLEL

  41. Execution Call foo Call foo Call foo Call foo

  42. PARALLEL environment • To do useful work necessary to have each subroutine call access different data • Can use thread id to select different pieces of an array – similar in principle to using process id’s in MPI • Good scaling requires that replications really do perform parallel work, rather than just replicating execution

  43. Pros/cons of this approach • Cons – may be difficult to take an already written code and convert it to this style • Can also lead to tough to debug code • Pros – codes written using this environment often scale very well • Potentially very useful if you are starting from scratch

  44. BARRIERS & SINGLE SECTIONS • Within the parallel region it may be necessary to synchronize and/or have only one thread execute a piece of code • This can be achieved by using the SINGLE clause and BARRIER C$OMP PARALLEL DEFAULT(SHARED) CALL WORK(X) C$OMP BARRIER C$OMP SINGLE CALL OUTPUT(X) CALL INPUT(Y) C$OMP END SINGLE CALL WORK(Y) C$OMP END PARALLEL Block until all threads arrive here Executed by first thread to arrive at this part of code

  45. SINGLE DIRECTIVE • The point of using the SINGLE directive is to avoid the overhead of stopping and then starting the parallel section • There is an implicitly implied barrier at the ENDSINGLE statement • This can be avoided by adding NOWAIT: C$OMP END SINGLE NOWAIT

  46. MASTER DIRECTIVE • Instead of using SINGLE, one can also use the MASTER directive to select code that only the master thread should execute • Unlike the SINGLE directive there is no implied barrier at the END MASTER statement • MASTER regions can be used to keep tabs on the values of control variables • SINGLE regions are better for I/O (for example)

  47. DO • Within parallel regions DO loops can be specified as parallel do’s • Exactly the same idea as loop level parallelism, have to specify scope of the variables, scheduling etc

  48. Example for 4 threads C$OMP PARALLEL do i=1,4 write(*,*) ‘Hello’ end do C$OMP END PARALLEL C$OMP PARALLEL C$OMP DO do i=1,4 write(*,*) ‘Hello’ end do C$OMP END DO C$OMP END PARALLEL Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Thread 1 Thread 1 Thread 2 Hello Hello Hello Hello Thread 2 Thread 3 Thread 3 Thread 4 Thread 4

  49. Task Queue in the SPMD model Program tasks Integer my_task,mytid,index index=1 C$OMP PARALLEL C$OMP& DEFAULT(NONE)C$OMP& PRIVATE(mytid),SHARED(index) C$OMP CRITICAL mytid=my_task(index) C$OMP END CRITICAL do while (mytid.ne.-1) call work(mytid) C$OMP CRITICAL mytid=my_task(index) C$OMP END CRITICAL end do C$OMP END PARALLEL end integer function my_task(index) integer index,MAX MAX=some value if (index.lt.MAX) then index=index+1 my_task=index else my_task=-1 end if return end

  50. WORKSHARE • This is a new feature to the OpenMP 2.0 standard • Allows parallelization of f90 array expressions and FORALL statements • Example, if A & B are arrays: • Directive was brought in to get around problems with f90 array syntax C$OMP WORKSHARE A=B C$OMP END WORKSHARE

More Related