Introduction to openmp
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Introduction to OpenMP PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

Introduction to OpenMP. For a more detailed tutorial see: http://www.openmp.org Look at the presentations. Concepts. Directive based programming declare properties of language structures (sections, loops) scope variables A few service routines get information Compiler options

Download Presentation

Introduction to OpenMP

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction to openmp

Introduction to OpenMP

For a more detailed tutorial see:

http://www.openmp.org

Look at the presentations


Concepts

Concepts

  • Directive based programming

    • declare properties of language structures (sections, loops)

    • scope variables

  • A few service routines

    • get information

  • Compiler options

  • Environment variables


Openmp programming model

OpenMP Programming Model

  • fork-join parallelism

  • Master thread spawns a team of threads as needed.


Typical openmp use

Typical OpenMP Use

  • Generally used to parallelize loops

    • Find most time consuming loops

    • Split iterations up between threads

void main()

{

double Res[1000];

#pragma omp parallel for

for(int i=0;i<1000;i++) {

do_huge_comp(Res[i]);

}

}

void main()

{

double Res[1000];

for(int i=0;i<1000;i++) {

do_huge_comp(Res[i]);

}

}


Thread interaction

Thread Interaction

  • OpenMP operates using shared memory

    • Threads communicate via shared variables

  • Unintended sharing can lead to race conditions

    • output changes due to thread scheduling

  • Control race conditions using synchronization

    • synchronization is expensive

    • change the way data is stored to minimize the need for synchronization


Syntax format

Syntax format

  • Compiler directives

    • C/C++

      • #pragma omp construct [clause [clause] …]

    • Fortran

      • C$OMP construct [clause [clause] … ]

      • !$OMP construct [clause [clause] … ]

      • *$OMP construct [clause [clause] … ]

  • Since we use directives, no changes need to be made to a program for a compiler that doesn’t support OpenMP


Using openmp

Using OpenMP

  • Compilers can automatically place directives with option

    • -qsmp=auto (IBM)

    • xlf_r and xlc do a good job (IBM)

    • some loops may speed up, some may slow down

  • Compiler option required when you write in directives

    • -qsmp=omp (IBM)

    • -mp (sgi)

  • Can mix directives with automatic parallelization

    • -qsmp=auto:omp (IBM)

  • Scoping variables is the hard part!

    • shared variables, thread private variables


Openmp directives

OpenMP Directives

  • 5 categories

    • Parallel Regions

    • Worksharing

    • Data Environment

    • Synchronization

    • Runtime functions / environment variables

  • Basically the same between C/C++ and Fortran


Parallel regions

Parallel Regions

  • Create threads with omp parallel

  • Threads share A (default behavior)

  • Threads all start at same time then synchronize at a barrier at the end to continue with code.

double A[1000]

omp_set_num_threads(4);

#pragma omp parallel

{

int ID = omp_get_thread_num();

dosomething(ID, A);

}


Sections construct

Sections construct

  • The sections construct gives a different structured block to each thread

  • By default there is a barrier at the end. Use the nowait clause to turn off.

#pragma omp parallel

#pragma omp sections

{

X_calculation();

#pragma omp section

y_calculation();

#pragma omp section

z_calculation();

}


Work sharing constructs

Work-sharing constructs

  • the for construct splits up loop iterations

  • By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier.

#pragma omp parallel

#pragma omp for

for (I=0;I<N;I++)

{

NEAT_STUFF(I);

}


Short hand notation

Short-hand notation

  • Can combine parallel and work sharing constructs

  • There is also a “parallel sections” construct

#pragma omp parallel for

for (I=0;I<N;I++){

NEAT_STUFF(I);

}


A rule

A Rule

  • In order to be made parallel, a loop must have canonical “shape”

index++;

++index;

index--;

--index;

index += inc;

index -= inc;

index = index + inc;

index = inc + index;

index = index – inc;

<

<=

>=

>

for (index=start; index end; )


An example

An example

#pragma omp parallel for private(j)

for (i = 0; i < BLOCK_SIZE(id,p,n); i++)

for (j = 0; j < n; j++)

a[i][j] = MIN(a[i][j], a[i][k] + tmp[j])

By definition, private variable values are undefined at loop entry and exit

To change this behavior, you can use the

firstprivate(var) and lastprivate(var)

clauses

x[0] = complex_function();

#pragma omp parallel for private(j) firstprivate(x)

for (i = 0; i < n; i++)

for (j = 0; j < m; j++)

x[j] = g(i, x[j-1]);

answer[i] = x[j] – x[i];


Scheduling iterations

Scheduling Iterations

  • The schedule clause effects how loop iterations are mapped onto threads

  • schedule(static [,chunk])

    • Deal-out blocks of iterations of size “chunk” to each thread.

  • schedule(dynamic[,chunk])

    • Each thread grabs “chunk” iterations off a queue until all iterations have been handled.

  • schedule(guided[,chunk])

    • Threads dynamically grab blocks of iterations. The size of he block starts large and shrinks down to size “chunk” as the calculation proceeds.

  • schedule(runtime)

    • Schedule and chunk size taken from the OMP_SCHEDULE environment variable.


An example1

An example

#pragma omp parallel for private(j) schedule(static, 2)

for (i = 0; i < n; i++)

for (j = 0; j < m; j++)

x[j][j] = g(i, x[j-1]);

You can play with the chunk size to meet load balancing issues, etc.


Scheduling considerations

Scheduling considerations

  • Dynamic is most general and provides load balancing

  • If choice of scheduling has (big) impact on performance, something is wrong:

    • overhead too big => work in loop too small

  • n can be specification expression, not just constant


Synchronization directives

Synchronization Directives

  • BARRIER

    • inside PARALLEL, all threads synchronize

  • CRITICAL (lock) / END CRITICAL (lock)

    • section that can be executed by one thread only

    • lock is optional name to distinguish several critical constructs from each other


An example2

An example

double area, pi, x;

int i, n;

area = 0.0;

#pragma omp parallel for private(x)

for (i = 0; i < n; i++)

{

x = (i + 0.5)/n;

#pragma omp critical

area += 4.0/(1.0 + x*x);

}

pi = area / n;


Reductions

Reductions

  • Sometimes you want each thread to calculate part of a value then collapse all that into a single value

  • Done with reduction clause

area = 0.0;

#pragma omp parallel for private(x) reduction (+:area)

for (i = 0; i < n; i++)

{

x = (i + 0.5)/n;

area += 4.0/(1.0 + x*x);

}

pi = area / n;


Another example

OpenMP Issues

Each thread needs different

random number seeds

count is shared

we need the aggregate

Another Example

/* A Monte Carlo algorithm for calculating pi */

int count; /* points inside the unit quarter circle */

unsigned short xi[3];/* random number seed */

inti; /* loop index */

int samples;/* Number of points to generate */

double x,y;/* Coordinates of points */

doublepi;/* Estimate of pi */

xi[0] = 1;/* These statements set up the random seed */

xi[1] = 1;

xi[2] = 0;

count = 0;

for (i = 0; i < samples; i++)

{

x = erand48(xi);

y = erand48(xi);

if (x*x + y*y <= 1.0) count++;

}

pi = 4.0 * count / samples;

printf(“Estimate of pi: %7.5f\n”, pi);


Openmp version

OpenMP Version

/* A Monte Carlo algorithm for calculating pi */

int count; /* points inside the unit quarter circle */

unsigned short xi[3];/* random number seed */

inti; /* loop index */

int samples;/* Number of points to generate */

double x,y;/* Coordinates of points */

doublepi;/* Estimate of pi */

omp_set_num_threads(omp_get_num_procs());

xi[0] = 1;xi[1] = 1; xi[2] = omp_get_thread_num();

count = 0;

#pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count)

for (i = 0; i < samples; i++)

{

x = erand48(xi);

y = erand48(xi);

if (x*x + y*y <= 1.0) count++;

}

pi = 4.0 * count / samples;

printf(“Estimate of pi: %7.5f\n”, pi);


An alternate version

An alternate version

#pragma omp parallel private(xi, t,I,x,y,local_count)

{

xi[0] = 1;xi[1] = 1;

xi[2] = tid = omp_get_thread_num();

t = omp_get_num_threads();

local_count = 0;

for (i = tid; i < samples; i += t)

{

x = erand48(xi);

y = erand48(xi);

if (x*x + y*y <= 1.0) local_count++;

}

#pragma omp critical

count += local_count;

}

pi = 4.0 * count / samples;

printf(“Estimate of pi: %7.5f\n”, pi);

}


Conditional execution

Conditional Execution

  • Overhead of fork/join is high

  • If a loop is small, you don’t want to parallellize

  • But, you may not know how big until runtime

  • Conditional clause for parallel execution

    • if ( expression )

area = 0.0;

#pragma omp parallel for private(x) reduction (+:area) if (n > 5000)

for (i = 0; i < n; i++)

{

x = (i + 0.5)/n;

area += 4.0/(1.0 + x*x);

}

pi = area / n;


Scope rules

Scope Rules

  • Shared memory programming model

    • most variables are shared by default

  • Global variables are shared

  • But not everything is shared

    • stack variables in functions are private

  • variable set and then used in DO is PRIVATE

  • array whose subscript is constant w.r.t. PARALLEL DO and is set and then used within the DO is PRIVATE


Scope clauses

Scope Clauses

  • DO and for directive has extra clauses, the most important

    • PRIVATE (variable list)

    • REDUCTION (op: variable list)

      • op is sum, min, max

      • variable is scalar, XLF allows array


Scope clauses 2

Scope Clauses (2)

  • PARALLEL and PARALELL DO and PARALLEL SECTIONS have also

    • DEFAULT (variable list)

      • scope determined by rules

    • SHARED (variable list)

    • IF (scalar logical expression)

      • directives are like programming language extension, not compiler option


Introduction to openmp

integer i,j,n

real*8 a(n,n), b(n)

read (1) b

!$OMP PARALLEL DO

!$OMP PRIVATE (i,j) SHARED (a,b,n)

do j=1,n

do i=1,n

a(i,j) = sqrt(1.d0 + b(j)*i)

end do

end do

!$OMP END PARALLEL DO


Matrix multiply

Matrix Multiply

!$OMP PARALLEL DO PRIVATE(i,j,k)

do j=1,n

do i=1,n

do k=1,n

c(i,j) = c(i,j) + a(i,k) * b(k,j)

end do

end do

end do


Analysis

Analysis

  • Outer loop is parallel: columns of c

  • Not optimal for cache use

  • Can put more directives for each loop

  • Then granularity might be too fine


Omp functions

OMP Functions

  • int omp_get_num_procs()

  • int omp_get_num_threads()

  • int omp_get_thread_num()

  • void omp_set_num_threads(int)


Serial directives

Serial Directives

  • MASTER / END MASTER

    • executed by master thread only

  • DO SERIAL / END DO SERIAL

    • loop immediately following should not be parallelized

    • useful with -qsmp=omp:auto

  • SINGLE

    • only one thread executes the block


Example serial execution

Example Serial Execution

/* A Monte Carlo algorithm for calculating pi */

omp_set_num_threads(omp_get_num_procs());

xi[0] = 1;xi[1] = 1; xi[2] = omp_get_thread_num();

count = 0;

#pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count)

for (i = 0; i < samples; i++)

{

x = erand48(xi);

y = erand48(xi);

if (x*x + y*y <= 1.0) count++;

#pragma omp single

{

printf(“Loop Iteration: %d\n”, i);

}

}

pi = 4.0 * count / samples;

printf(“Estimate of pi: %7.5f\n”, pi);


Fortran parallel directives

Fortran Parallel Directives

  • PARALLEL / END PARALLEL

  • PARALLEL SECTIONS / SECTION / SECTION / END PARALLEL SECTIONS

  • DO / END DO

    • work sharing directive for DO loop immediately following

  • PARALLEL DO / END PARALLEL DO

    • combined section and work sharing


  • Login