1 / 46

Introduction to OpenMP

Introduction to OpenMP. I - Introduction. ITCS4145/5145, Parallel Programming C. Ferner and B. Wilkinson Feb 3, 2016. OpenMP structure. A standard developed in the 1990’s for thread-based programming for shared memory systems.

Download Presentation

Introduction to OpenMP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to OpenMP I - Introduction ITCS4145/5145, Parallel Programming C. Ferner and B. Wilkinson Feb 3, 2016

  2. OpenMP structure • A standard developed in the 1990’s for thread-based programming for shared memory systems. • Higher level than using low level APIs such as Pthreads or Java threads. • Consists of a set of compiler directives, and a few library routines and environment variables. • gcc supports OpenMP with the –fopenmp option, so additional software not needed.

  3. OpenMP compiler directives • OpenMP uses #pragmacompiler directives to parallelize a program (“pragmatic” directive) • Programmer inserts #pragma statements into the sequential program to tell the compiler how to parallelize the program • When the compiler comes across a compiler directive, it creates corresponding thread-based parallel code. • Basic OpenMP pattern is the thread-pool pattern

  4. Thread-pool pattern Master thread • Basic OpenMP pattern • Parallel region indicates sections of code executed by all threads • At the end of a parallel region, all threads wait for each other as if there were a “barrier” (unless other specified) • Code outside a parallel region executed by master thread only parallel region Multiple threads Synchronization parallel region Multiple threads Synchronization

  5. Parallel Region omp indicates an OpenMP pragma (other compilers will ignore it) Syntax: #pragma omp parallel structured_block parallel indicates a parallel region • structured_block is either: • a single statement terminated with ; • or • a block of statements, i.e. { statement1; statement2;…}

  6. Hello World int main (intargc, char *argv[]) { #pragmaomp parallel { printf("Hello World from thread %d of %d\n", omp_get_thread_num(), omp_get_num_threads()); } } These routines return the thread ID (from 0 onwards) and the total number of threads, respectively (easy to confuse as names similar) Very Important Opening brace must be on a new line

  7. Compiling and Output Flag to tell gcc to interpret OpenMP directives $ gcc -fopenmp hello.c -o hello $ ./hello Hello world from thread 2 of 4 Hello world from thread 0 of 4 Hello world from thread 3 of 4 Hello world from thread 1 of 4 $

  8. Number of threads • Three ways to indicate how many threads you want: 1. Use num_threads within the directive • E.g. #pragma omp parallel num_threads(5) 2. Use the omp_set_num_threads function • E.g. omp_set_num_threads(6); 3. Use the OMP_NUM_THREADS environmental variable • E.g $ export OMP_NUM_THREADS=8 $ ./hello

  9. Shared versus Private Data int main (intargc, char *argv[]) { int x; inttid; #pragmaomp parallel private(tid) { tid = omp_get_thread_num(); if (tid == 0) x = 42; printf ("Thread %d, x = %d\n", tid, x); } } x is shared by all threads tid is private – each thread has its own copy Variables declared outside parallel construct are shared unless otherwise specified

  10. Abstractly, it looks like this Processor Processor Processor Processor Local Memory Local Memory Local Memory Local Memory tid tid tid tid 1 2 3 0 x 42 Shared Memory

  11. Shared versus Private Data $ ./data Thread 3, x = 0 Thread 2, x = 0 Thread 1, x = 0 Thread 0, x = 42 Thread 4, x = 42 Thread 5, x = 42 Thread 6, x = 42 Thread 7, x = 42 tid has a separate value for each thread x has the same value for each thread (well… almost)

  12. Another ExampleShared versus Private a[ ] is shared int x, tid, a[100]; #pragma omp parallel private(tid, n) { tid = omp_get_thread_num(); n = omp_get_num_threads(); a[tid] = 10*n; } OR #pragmaomp parallel private(tid, n) shared(a) ... tid and n are private optional

  13. Specifying Work Inside a Parallel Region (Work Sharing Constructs) • Four constructs: • section – each section executed by a different thread • for – one or more iterations executed by a (potentially) different thread • single – executed by a single thread (sequential) • master – executed by the master only (sequential) • Barrier after each construct (except master) unless a nowait clause is given • Constructs must be used within a parallel region

  14. Sections Syntax Parallel region #pragmaomp parallel { #pragmaomp sections { #pragmaomp section structured_block #pragmaomp section structured_block ... } } Sections executed by available threads

  15. Sections Example #pragmaomp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragmaomp sections nowait { #pragmaomp section { printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]); } } Threads do not wait after finishing section One thread does this

  16. Sections example continued #pragmaomp section { printf("Thread %d doing section 2\n",tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]=%f\n",tid,i,d[i]); } } } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Another thread does this

  17. Sections Output Thread 0 doing section 1 Thread 0: c[0]= 5.000000 Thread 0: c[1]= 7.000000 Thread 0: c[2]= 9.000000 Thread 0: c[3]= 11.000000 Thread 0: c[4]= 13.000000 Thread 3 done Thread 2 done Thread 1 doing section 2 Thread 1: d[0]= 0.000000 Thread 1: d[1]= 6.000000 Thread 1: d[2]= 14.000000 Thread 1: d[3]= 24.000000 Thread 0 done Thread 1: d[4]= 36.000000 Thread 1 done Threads do not wait (i.e. no barrier)

  18. If we remove the nowait clause Thread 0 doing section 1 Thread 0: c[0]= 5.000000 Thread 0: c[1]= 7.000000 Thread 0: c[2]= 9.000000 Thread 0: c[3]= 11.000000 Thread 0: c[4]= 13.000000 Thread 3 doing section 2 Thread 3: d[0]= 0.000000 Thread 3: d[1]= 6.000000 Thread 3: d[2]= 14.000000 Thread 3: d[3]= 24.000000 Thread 3: d[4]= 36.000000 Thread 3 done Thread 1 done Thread 2 done Thread 0 done A barrier at the end of section. Threads wait until they are all done with the section.

  19. Parallel For Syntax #pragmaomp parallel { #pragmaomp for for (i = 0; i < N; i++) { ... } } Enclosing parallel region Different iterations will be executed by available threads Must be a simple C for loop, where lower bound and upper bound are constants (actually loop invariant)

  20. Iteration Space • Suppose N = 13. • The iteration space is

  21. Iteration Partitions • Without further specification iterations per partition (chunksize) =

  22. Mapping • Iteration Partitions are assigned to processors • Ex.

  23. Parallel For Example #pragmaomp parallel shared(a,b,c,nthreads) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragmaomp for for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: i = %d, c[%d] = %f\n", tid, i, c[i]); } } /* end of parallel section */ Without “nowait”, threads wait after finishing loop

  24. Parallel For Output Thread 1 starting... Thread 1: i = 2, c[1] = 9.000000 Thread 1: i = 3, c[1] = 11.000000 Thread 2 starting... Thread 2: i = 4, c[2] = 13.000000 Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = 5.000000 Thread 0: i = 1, c[0] = 7.000000 Iterations of loop are mapped to threads Mapping is In this example, mapping = Barrier here

  25. Combining Directives • If a Parallel Region consists of only one Parallel For or Parallel Sections, they can be combined • #pragmaomp parallel sections • #pragmaomp parallel for

  26. Combining Directives Example #pragma omp parallel for shared(a,b,c,nthreads) private(i,tid) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; } Declares a parallel region with a parallel for

  27. Scheduling a Parallel For • By default, a parallel for is scheduled by mapping blocks (or chunks) of iterations to available threads (static mapping) Thread 1 starting... Thread 1: i = 2, c[1] = 9.000000 Thread 1: i = 3, c[1] = 11.000000 Thread 2 starting... Thread 2: i = 4, c[2] = 13.000000 Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = 5.000000 Thread 0: i = 1, c[0] = 7.000000 Default Chunk Size Barrier here

  28. Scheduling a Parallel For • Static – Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion: #pragma omp parallel for schedule (static, chunk_size) (If chunk_size not specified chunks are approx. equal with one at least 1.) • Dynamic –Chunk-sized block of iterations assigned to threads as they become available: #pragma omp parallel for schedule (dynamic, chunk_size) (If chunk_size not specified, it defaults to 1.)

  29. Scheduling a Parallel For • Guided – Similar to dynamic but chunk size starts large and gets smaller: #pragma omp parallel for schedule (guided) Example:* • Runtime – Uses OMP_SCHEDULE environment variable to specify which of static, dynamic or guided should be used: #pragma omp parallel for schedule (runtime) * Actual algorithm is slightly more complicated see the OpenMP standard

  30. Static assignment • Suppose there are • N = 100 iterations • Chunk_size = 15 • P = 4 threads • There are 100/15 = 7 blocks • Each thread will get 7/4 = 2 blocks (max)

  31. Cyclic Assignment of Blocks

  32. Question Guided scheduling is similar to Static except that the chunk sizes start large and get smaller. What is the advantage of using Guided versus Static? Answer: Guided improves load balance Is there any disadvantage of using Guided? Answer: Overhead

  33. Reduction • A reduction is when a binary commutative operator is applied to a collection of values producing a single value

  34. Reduction • A binary commutative operator applied to a collection of values producing a single value • E.g., applying summation to the following values: • Produces the single value of 549 • OpenMP (and MPI) standards do not specify how reduction should be implemented; however, … Commutative: changing the order of the operand does not change the result. Associative: the order operations are performed does not matter (with the sequence not changed) i.e. rearranging parentheses does not alter result.

  35. Reduction Implementation • A reduction could be implemented fairly efficiently on multiple processor using a tree • In which case the time is O(log(P))

  36. Reduction Operators Subtract is in the spec but does summation??

  37. Reduction sum = 0; #pragmaomp parallel for reduction(+:sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); } Operation Variable Private copy of sum created for each thread by compiler. Private copy will be added to sum at end. Eliminates the need for critical sections here.

  38. Single #pragmaomp parallel { ... #pragmaomp single structured_block ... } Only one thread executes this section No guarantee of which one

  39. Single Example #pragmaomp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragmaomp single { printf("Thread %d doing work\n",tid); ... } // end of single printf ("Thread %d done\n", tid); } // end of parallel section

  40. Single Results Thread 0 starting... Thread 0 doing work Thread 3 starting... Thread 2 starting... ... Thread 1 starting... Thread 0 done Thread 1 done Thread 2 done Thread 3 done Only one thread executing the section “nowait” was NOT specified, so threads wait for the one thread to finish. Barrier here

  41. Master #pragmaomp parallel { ... #pragmaomp master structured_block ... } Only one thread (the master) executes this section Cannot specify “nowait” here No barrier after this block. Threads will NOT wait.

  42. Master Example #pragmaomp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragmaomp master { printf("Thread %d doing work\n",tid); ... } // end of master printf ("Thread %d done\n", tid); } // end of parallel section

  43. Is there any difference between these two approaches: Master Directive: Using an if statement: #pragmaomp parallel \ private(tid) { ... tid=omp_get_thread_num(); if (tid == 0) structured_block ... } #pragmaomp parallel { ... #pragmaomp master structured_block ... } (First has a barrier unless nowait clause

  44. Assignment 1 Assignment posted: • Part 1 a tutorial compiling and executing sample code – on your own computer • Part 2 Parallelizing matrix multiplication - on your own computer • Part 3 Executing matrix multiplication code on cluster Due Sunday January 31st, 2016 (Week 3)

  45. Questions

  46. More information http://openmp.org/wp/

More Related