1 / 67

Parallel Programming with OpenMP

Parallel Programming with OpenMP . Ing . Andrea Marongiu a.marongiu@unibo.it. Programming model: OpenMP. De-facto standard for the shared memory programming model A collection of compiler directives , library routines and environment variables

kobe
Download Presentation

Parallel Programming with OpenMP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Programming with OpenMP Ing. Andrea Marongiu a.marongiu@unibo.it

  2. Programming model: OpenMP • De-facto standard for the shared memory programming model • A collection of compiler directives, library routines and environment variables • Easy to specify parallel execution within a serial code • Requires special support in the compiler • Generates calls to threading libraries (e.g. pthreads) • Focus on loop-level parallel execution • Popular in high-end embedded

  3. Fork/Join Parallelism • Initially only master thread is active • Master thread executes sequential code • Fork: Master thread creates or awakens additional threads to execute parallel code • Join: At the end of parallel code created threads are suspended upon barrier synchronization Sequentialprogram Parallelprogram

  4. Pragmas • Pragma: a compiler directive in C or C++ • Stands for “pragmatic information” • A way for the programmer to communicate with the compiler • Compiler free to ignore pragmas: original sequential semantic is not altered • Syntax: #pragmaomp <rest of pragma>

  5. Components of OpenMP • Data scope attributes • private • shared • reduction • Loopscheduling • static • dynamic Directives • Parallel regions • #pragmaomp parallel • Work sharing • #pragmaomp for • #pragmaomp sections • Synchronization • #pragmaomp barrier • #pragmaomp critical • #pragmaomp atomic Clauses • Thread Forking/Joining • omp_parallel_start() • omp_parallel_end() • Loop scheduling • Thread IDs • omp_get_thread_num() • omp_get_num_threads() Runtime Library

  6. Outlining parallelismThe parallel directive • Fundamental construct to outline parallel computation within a sequential program • Code within its scope is replicated among threads • Defers implementation of parallel execution to the runtime (machine-specific, e.g. pthread_create) int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } A sequential program.. ..is easily parallelized int main() { #pragma omp parallel { printf (“\nHello world!”); } }

  7. #pragmaomp parallel Code originally contained within the scope of the pragma is outlined to a new function within the compiler int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } }

  8. #pragmaomp parallel The #pragma construct in the main function is replaced with function calls to the runtime library int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } intparfun(…) { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } }

  9. #pragmaomp parallel First we call the runtime to fork new threads, and pass them a pointer to the function to execute in parallel int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } }

  10. #pragmaomp parallel Then the master itself calls the parallel function int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } intparfun(…) { printf (“\nHello world!”); } int main() { #pragmaomp parallel { printf (“\nHello world!”); } }

  11. #pragmaomp parallel Finally we call the runtime to synchronize threads with a barrier and suspend them int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } intparfun(…) { printf (“\nHello world!”); } int main() { #pragmaomp parallel { printf (“\nHello world!”); } }

  12. #pragmaomp parallelData scope attributes int main() { int id; int a = 5; #pragmaomp parallel { id = omp_get_thread_num(); if (id == 0) printf(“Master: a = %d.”, a*2); else printf(“Slave: a = %d.”, a); } } Call runtime to get thread ID: Every thread sees a different value Master and slave threads access the same variable a A slightly more complex example

  13. #pragmaomp parallelData scope attributes int main() { int id; int a = 5; #pragmaomp parallel { id = omp_get_thread_num(); if (id == 0) printf(“Master: a = %d.”, a*2); else printf(“Slave: a = %d.”, a); } } Call runtime to get thread ID: Every thread sees a different value How to inform the compiler about these different behaviors? Master and slave threads access the same variable a A slightly more complex example

  14. #pragmaomp parallelData scope attributes int main() { int id; int a = 5; #pragmaompparallel shared (a)private (id) { id = omp_get_thread_num(); if (id == 0) printf(“Master: a = %d.”, a*2); else printf(“Slave: a = %d.”, a); } } Insert code to retrieve the address of the shared object from within each parallel thread Allow symbol privatization: Each thread contains a private copy of this variable A slightly more complex example

  15. #pragmaomp parallelData scope attributes • Correctnessissues • Whatif: • awasnotmarkedforsharedaccess? • idwasnotmarkedfor private access? int main() { int id; int a = 5; #pragmaompparallel shared (a)private (id) { id = omp_get_thread_num(); if (id == 0) printf(“Master: a = %d.”, a*2); else printf(“Slave: a = %d.”, a); } } Insert code to retrieve the address of the shared object from within each parallel thread Allow symbol privatization: Each thread contains a private copy of this variable A slightly more complex example

  16. #pragmaomp parallelData scope attributes int main() { int id; int a = 5; #pragmaompparallel shared (a)private (id) { id = omp_get_thread_num(); if (id == 0) printf(“Master: a = %d.”, a*2); else printf(“Slave: a = %d.”, a); } } Insert code to retrieve the address of the shared object from within each parallel thread Serialization id id id id Allow symbol privatization: Each thread contains a private copy of this variable a Performance issues A slightly more complex example

  17. Sharing work among threadsThe for directive • The parallelpragma instructs every thread to execute all of the code inside the block • If we encounter a for loop that we want to divide among threads, we use the forpragma #pragmaomp for

  18. #pragma omp for int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } intparfun(…) { int LB = …; int UB = …; for (i=LB; i<UB; i++) a[i] = i; } The code of the for loop is moved inside the outlined function. int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } }

  19. #pragma omp for int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } intparfun(…) { int LB = …; int UB = …; for (i=LB; i<UB; i++) a[i] = i; } Every thread works on a different subset of the iteration space.. int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } }

  20. #pragma omp for int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } intparfun(…) { int LB = …; int UB = …; for (i=LB; i<UB; i++) a[i] = i; } ..since lower and upper boundaries (LB, UB) are computed locally int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } }

  21. N C = ceil ( ) Nthr The schedule clauseStatic Loop Partitioning Es. 12 iterations (N), 4 threads (Nthr) #pragmaompfor { for (i=0; i<12; i++) a[i] = i; } schedule(static) DATA CHUNK 3iterationsthread • Usefulfor: • Simple, regular loops • Iterationswithequalduration Iterationspace 0 1 2 3 Thread ID (TID) LOWER BOUND 0 3 6 9 LB = C * TID 3 6 9 12 UB = min { [C * ( TID + 1) ], N} UPPER BOUND

  22. N C = ceil ( ) Nthr The schedule clauseStatic Loop Partitioning Es. 12 iterations (N), 4 threads (Nthr) #pragmaompfor { for (i=0; i<12; i++) a[i] = i; } schedule(static) DATA CHUNK Whathappenswithstaticschedulingwheniterationshavedifferentduration? 3iterationsthread • Usefulfor: • Simple, regular loops • Iterationswithequalduration Iterationspace 0 1 2 3 Thread ID (TID) LOWER BOUND 0 3 6 9 LB = C * TID 3 6 9 12 UB = min { [C * ( TID + 1) ], N} UPPER BOUND

  23. The schedule clauseStatic Loop Partitioning Iterationspace #pragmaompfor { for (i=0; i<12; i++) a[i] = i; } #pragmaompfor { for (i=0; i<12; i++) { int start = rand(); intcount = 0; while (start++ < 256) count++; a[count] = foo(); } } schedule(static) Thread 4 Thread 3 Thread 1 Thread 2 Core 1 Core 2 Core 3 Core 4 1 2 3 4 5 6 7 8 9 10 11 12 7 1 4 10 5 11 2 12 6 8 UNBALANCEDworkloads 3 A variable amount of work in each iteration 9

  24. The schedule clauseDynamic Loop Partitioning Iterationspace #pragmaompfor { for (i=0; i<12; i++) { int start = rand(); intcount = 0; while (start++ < 256) count++; a[count] = foo(); } } schedule(static) schedule(dynamic) 5 1 2 11 4 7 6 12 8 9 10 3 A thread is generated for every single iteration

  25. The schedule clauseDynamic Loop Partitioning Iterationspace Runtimeenvironment 6 1 2 3 4 5 7 9 10 11 12 8 Work is fetched by threads from the runtime environment through synchronized accesses to the work queue Work queue A work queue is implemented in the runtime environment, where tasks are stored intparfun(…) { int LB, UB; GOMP_loop_dynamic_next(&LB, &UB); for (i=LB; i<UB; i++) {…} }

  26. The schedule clauseDynamic Loop Partitioning Iterationspace Rememberresultswithstaticscheduling.. 12 11 10 8 7 9 5 4 3 2 1 6 Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 7 1 2 3 4 1 4 10 5 5 11 2 6 7 8 12 9 6 8 BALANCEDworkloads 3 10 11 12 9 Speedup!

  27. Sharing work among threadsThe sections directive • The forpragma allows to exploit data parallelism in loops • OpenMP also provides a directive to exploit task parallelism #pragmaomp sections

  28. Task Parallelism Example int main() { v = alpha(); w = beta (); y = delta (); x = gamma (v, w); z = epsilon (x, y)); printf (“%f\n”, z); } Identify independent nodes in the task graph, and outline parallel computation with the sections directive FIRST SOLUTION

  29. Task Parallelism Example int main() { #pragmaomp parallel sections { v = alpha(); w = beta (); } #pragmaomp parallel sections { y = delta (); x = gamma (v, w); } z = epsilon (x, y)); printf (“%f\n”, z); } Barriers implied here!

  30. Task Parallelism Example int main() { #pragmaomp parallel sections { v = alpha(); w = beta (); } #pragmaomp parallel sections { y = delta (); x = gamma (v, w); } z = epsilon (x, y)); printf (“%f\n”, z); } Each parallel task within a sections block identifies a section

  31. Task Parallelism Example int main() { #pragmaomp parallel sections { #pragmaomp section v = alpha(); #pragmaomp section w = beta (); } #pragmaomp parallel sections { #pragmaomp section y = delta (); #pragmaomp section x = gamma (v, w); } z = epsilon (x, y)); printf (“%f\n”, z); } Each parallel task within a sections block identifies a section

  32. Task Parallelism Example int main() { v = alpha(); w = beta (); y = delta (); x = gamma (v, w); z = epsilon (x, y)); printf (“%f\n”, z); } Identify independent nodes in the task graph, and outline parallel computation with the sections directive SECOND SOLUTION

  33. Task Parallelism Example int main() { #pragmaomp parallel sections { v = alpha(); w = beta (); y = delta (); } #pragmaomp parallel sections { x = gamma (v, w); z = epsilon (x, y)); } printf (“%f\n”, z); } Barrier implied here!

  34. Task Parallelism Example int main() { #pragmaomp parallel sections { v = alpha(); w = beta (); y = delta (); } #pragmaomp parallel sections { x = gamma (v, w); z = epsilon (x, y)); } printf (“%f\n”, z); } Each parallel task within a sections block identifies a section

  35. Task Parallelism Example int main() { #pragmaomp parallel sections { #pragmaomp section v = alpha(); #pragmaomp section w = beta (); #pragmaomp section y = delta (); } #pragmaomp parallel sections { #pragmaomp section x = gamma (v, w); #pragmaomp section z = epsilon (x, y)); } printf (“%f\n”, z); } Each parallel task within a sections block identifies a section

  36. #pragma omp barrier • Most important synchronization mechanism in shared memory fork/join parallel programming • All threads participating in a parallel region wait until everybody has finished before computation flows on • This prevents later stages of the program to work with inconsistent shared data • It is implied at the end of parallel constructs, as well as for and sections (unless a nowait clause is specified)

  37. #pragma omp critical • Critical Section: a portion of code that only one thread at a time may execute • We denote a critical section by putting the pragma #pragma omp critical in front of a block of C code

  38. -finding code example double area, pi, x; inti, n; #pragmaomp parallel for private(x) \ shared(area) { for (i=0; i<n; i++) { x = (i+ 0.5)/n; area += 4.0/(1.0 + x*x); } } pi = area/n; Synchronize accesses to shared variable areato avoid inconsistent results

  39. Race condition • Ensure atomic updates of the shared variable area to avoid a race condition in which one process may “race ahead” of another and ignore changes

  40. Race condition (Cont’d) • Thread A reads “11.667” into a localregister • Thread B reads “11.667” into a localregister • Thread A updates area with “11.667+3.765” • Thread B ignoreswritefromthread A and updates area with “11.667 + 3.563” time

  41. -finding code example double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } } pi = area/n; #pragmaomp critical protects the code within its scope by acquiring a lock before entering the critical section and releasing it after execution

  42. Correctness, not performance! • As a matter of fact, using locks makes execution sequential • To dim this effect we should try use fine grainedlocking(i.e. make critical sections as small as possible) • A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler! • The programmer is not aware of the real granularity of the critical section

  43. Correctness, not performance! • As a matter of fact, using locks makes execution sequential • To dim this effect we should try use fine grainedlocking(i.e. make critical sections as small as possible) • A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler! • The programmer is not aware of the real granularity of the critical section This is a dump of the intermediate representation of the program within the compiler

  44. Correctness, not performance! • As a matter of fact, using locks makes execution sequential • To dim this effect we should try use fine grainedlocking(i.e. make critical sections as small as possible) • A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler! • The programmer is not aware of the real granularity of the critical section This is how the compiler represents the instruction area += 4.0/(1.0 + x*x);

  45. Correctness, not performance! THIS IS DONE AT EVERY LOOP ITERATION! • As a matter of fact, using locks makes execution sequential • To dim this effect we should try use fine grainedlocking(i.e. make critical sections as small as possible) • A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler! • The programmer is not aware of the real granularity of the critical section call runtime to acquire lock Lock-protectedoperations(criticalsection) call runtime to release lock

  46. -finding code example Core 1 Core 2 Core 3 Core 4 double area, pi, x; inti, n; #pragmaomp parallel for \ private(x) \ shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; #pragmaomp critical area += 4.0/(1.0 + x*x); } } pi = area/n; Parallel Sequential Waitingforlock

  47. Correctness, not performance! • A programming pattern such as area += 4.0/(1.0 + x*x);in which we: • Fetch the value of an operand • Add a value to it • Store the updated value is called a reduction, and is commonly supported by parallel programming APIs • OpenMP takes care of storing partial results in private variables and combining partial results after the loop

  48. Correctness, not performance! double area, pi, x; inti, n; #pragmaomp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } } pi = area/n; reduction(+:area) The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable

  49. Correctness, not performance! Sharedvariableisonlyupdated at the end of the loop, whenpartialsums are computed double area, pi, x; inti, n; #pragmaomp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } } pi = area/n; reduction(+:area) Eachthreadcomputespartialsums on private copiesof the reductionvariable The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable

  50. Correctness, not performance! double area, pi, x; inti, n; #pragmaomp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } } pi = area/n; Core 1 Core 2 Core 3 Core 4 reduction(+:area) The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable

More Related