open multiprocessing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Open Multiprocessing PowerPoint Presentation
Download Presentation
Open Multiprocessing

Loading in 2 Seconds...

play fullscreen
1 / 50

Open Multiprocessing - PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on

Open Multiprocessing. Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn. OpenMP. An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran. Supports multiple platforms (processor architectures and operating systems).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Open Multiprocessing' - meara


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
open multiprocessing

Open Multiprocessing

Dr. Bo Yuan

E-mail: yuanb@sz.tsinghua.edu.cn

openmp
OpenMP
  • An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran.
    • Supports multiple platforms (processor architectures and operating systems).
    • Higher level implementation (a block of code that should be executed in parallel).
  • A method of parallelizing whereby a master thread forks a number of slave threads and a task is divided among them.
  • Based on preprocessor directives (Pragma)
    • Requires compiler support.
    • omp.h
  • References
    • http://openmp.org/
    • https://computing.llnl.gov/tutorials/openMP/
    • http://supercomputingblog.com/openmp/
hello world
Hello, World!

#include <stdio.h>

#include <stdlib.h>

#include <omp.h>

void Hello(void)

int main(intargc, char* argv[]) {

/* Get number of threads from command line */

intthread_count=strtol(argv[1], NULL, 10);

# pragmaomp parallel num_threads(thread_count)

Hello();

return 0;

}

void Hello(void) {

intmy_rank=omp_get_thread_num();

intthread_count=omp_get_num_threads();

printf(“Hello from thread %d of %d\n”, my_rank, thread_count);

}

definitions
Definitions

text to modify the directive

# pragmaomp parallel [clauses]

{ code_block }

implicit barrier

Error Checking

#ifdef _OPENMP

# include <omp.h>

#endif

#ifdef _OPENMP

intmy_rank=omp_get_thread_num();

intthread_count=omp_get_num_threads();

#else

intmy_rank=0;

intthread_count=1;

#endif

Thread Team = Master + Slaves

the trapezoidal rule
The Trapezoidal Rule

/* Input: a, b, n */

h=(b-a)/n;

approx=(f(a)+f(b))/2.0;

for (i=1; i<=n-1; i++) {

x_i=a+i*h;

approx+=f(x_i);

}

approx=h*approx;

Thread 0

Thread 2

Shared Memory  Shared Variables  Race Condition

# pragmaomp critical

global_result+=my_result;

the critical directive
The critical Directive

# pragma omp critical

y=f(x);

...

double f(double x) {

# pragma omp critical

z=g(x);

...

}

Cannot be executed simultaneously!

# pragma omp critical(one)

y=f(x);

...

double f(double x) {

# pragma omp critical(two)

z=g(x);

...

}

Deadlock

the atomic directive
The atomic Directive
  • # pragma omp atomic
  • x <op>=<expression>;
  • <op> can be one of the binary operators:
  • +, *, -, /, &, ^, |, <<, >>
  • Higher performance than the critical directive.
  • Only single C assignment statement is protected.
  • Only the load and store of x is protected.
  • <expression> must not reference x.

x++

++x

x--

--x

# pragma omp atomic# pragma omp critical

x+=f(y);x=g(x);

Can be executed simultaneously!

locks
Locks

/* Executed by one thread */

Initialize the lock data structure;

...

/* Executed by multiple threads */

Attempt to lock or set the lock data structure;

Critical section;

Unlock or unset the lock data structure;

...

/* Executed by one thread */

Destroy the lock data structure;

void omp_init_lock(omp_lock_t* lock_p);

void omp_set_lock(omp_lock_t* lock_p);

void omp_unset_lock(omp_lock_t* lock_p);

void omp_destroy_lock(omp_lock_t* lock_p);

trapezoidal rule in openmp
Trapezoidal Rule in OpenMP

#include <stdio.h>

#include <stdlib.h>

#include <omp.h>

void Trap(double a, double b, int n, double* global_result_p);

int main(intargc, char* argv[]) {

double global_result=0.0;

double a, b;

int n, thread_count;

thread_count=strtol(argv[1], NULL, 10);

printf(“Enter a, b, and n\n”);

scanf(“%lf %lf %d”, &a, &b, &n);

# pragmaomp parallel num_threads(thread_count)

Trap(a, b, n, &global_result);

printf(“With n=%d trapezoids, our estimate\n”, n);

printf(“of the integral from %f to %f = %.15e\n”, a, b,

global_result);

return 0;

}

trapezoidal rule in openmp1
Trapezoidal Rule in OpenMP

void Trap(double a, double b, int n, double* global_result_p) {

double h, x, my_result;

double local_a, local_b;

inti, local_n;

intmy_rank=omp_get_thread_num();

intthread_count=omp_get_num_threads();

h=(b-a)/n;

local_n=n/thread_count;

local_a=a+my_rank*local_n*h;

local_b=local_a+local_n*h;

my_result=(f(local_a)+f(local_b))/2.0;

for(i=1; i<=local_n-1; i++) {

x=local_a+i*h;

my_result+=f(x);

}

my_result=my_result*h;

# pragmaomp critical

*global_result_p+=my_result;

}

scope of variables
Scope of Variables
  • In serial programming:
  • Function-wide scope
  • File-wide scope
  • a, b, n
  • global_result
  • thread_count
  • my_rank
  • my_result
  • global_result_p
  • *global_result_p
  • Shared Scope
  • Accessible by all threads in a team
  • Declared before a parallel directive
  • Private Scope
  • Only accessible by a single thread
  • Declared in the code block
another trap function
Another Trap Function

double Local_trap(double a, double b, int n);

global_result=0.0;

# pragmaomp parallel num_threads(thread_count)

{

# pragmaomp critical

global_result+=Local_trap(a, b, n);

}

global_result=0.0;

# pragmaomp parallel num_threads(thread_count)

{

double my_result=0.0; /* Private */

my_result=Local_trap(a, b, n);

# pragmaomp critical

global_result+=my_result;

}

the reduction clause
The Reduction Clause
  • Reduction: A computation (binary operation) that repeatedly applies the same reduction operator (e.g., addition or multiplication) to a sequence of operands in order to get a single result.
  • Note:
    • The reduction variable itself is shared.
    • A private variable is created for each thread in the team.
    • The private variables are initialized to 0 for addition operator.

reduction(<operator>: <variable list>)

global_result=0.0;

# pragmaomp parallel num_threads(thread_count)\

reduction(+: global_result)

global_result=Local_trap(a, b, n);

the parallel for directive
The parallel for Directive

h=(b-a)/n;

approx=(f(a)+f(b))/2.0;

# pragmaomp parallel for num_threads(thread_count)\

reduction(+: approx)

for (i=1; i<=n-1; i++) {

approx+=f(a+i*h);

}

approx=h*approx;

h=(b-a)/n;

approx=(f(a)+f(b))/2.0;

for (i=1; i<=n-1; i++) {

approx+=f(a+i*h);

}

approx=h*approx;

  • The code block must be a for loop.
  • Iterations of the for loop are divided among threads.
  • approx is a reduction variable.
  • i is a private variable.
the parallel for directive1
The parallel for Directive
  • Sounds like a truly wonderful approach to parallelizing serial programs.
  • Does not work with while or do-while loops.
    • How about converting them into for loops?
  • The number of iterations must be determined in advance.

for (; ;) {

...

}

for (i=0; i<n; i++) {

if (...) break;

...

}

int x, y;

# pragma omp parallel for num_threads(thread_count)

for(x=0; x < width; x++) {

for(y=0; y < height; y++) {

finalImage[x][y] = f(x, y);

}

}

private(y)

estimating
Estimating π

double factor=1.0;

double sum=0.0;

for(k=0; k<n; k++) {

sum+=factor/(2*k+1);

factor=-factor;

}

pi_approx=4.0*sum;

double factor=1.0;

double sum=0.0;

# pragmaomp parallel for\

num_threads(thread_count)\

reduction(+: sum)

for(k=0; k<n; k++) {

sum+=factor/(2*k+1);

factor=-factor;

}

pi_approx=4.0*sum;

?

Loop-carried dependence

estimating1
Estimating π

if(k%2 == 0)

factor=1.0;

else

factor=-1.0;

sum+=factor/(2*k+1);

factor=(k%2 == 0)?1.0: -1.0;

sum+=factor/(2*k+1);

double factor=1.0;

double sum=0.0;

# pragmaomp parallel for num_threads(thread_count)\

reduction(+: sum) private(factor)

for(k=0; k<n; k++) {

if(k%2 == 0)

factor=1.0;

else

factor=-1.0;

sum+=factor/(2*k+1);

}

pi_approx=4.0*sum;

scope matters
Scope Matters
  • With the default (none) clause, we need to specify the scope of each variable that we use in the block that has been declared outside the block.
  • The value of a variable with private scope is unspecified at the beginning (and after completion) of a parallel or parallel for block.

double factor=1.0;

double sum=0.0;

# pragmaomp parallel for num_threads(thread_count)\

default(none) reduction(+: sum) private(k, factor) shared(n)

for(k=0; k<n; k++) {

if(k%2 == 0)

factor=1.0;

else

factor=-1.0;

sum+=factor/(2*k+1);

}

pi_approx=4.0*sum;

The private factor is not specified.

bubble sort
Bubble Sort

for (len=n; len>=2; len--)

for (i=0; i<len-1; i++)

if (a[i]>a[i+1]) {

tmp=a[i];

a[i]=a[i+1];

a[i+1]=tmp;

}

  • Can we make it faster?
  • Can we parallelize the outer loop?
  • Can we parallelize the inner loop?
odd even sort
Odd-Even Sort

Any opportunities for parallelism?

odd even sort1
Odd-Even Sort

void Odd_even_sort (int a[], int n) {

int phase, i, temp;

for (phase=0; phase<n; phase++)

if (phase%2 == 0) { /* Even phase */

for (i=1; i<n; i+=2)

if (a[i-1]>a[i]) {

temp=a[i];

a[i]=a[i-1];

a[i-1]=temp;

}

} else { /* Odd phase */

for (i=1; i<n-1; i+=2)

if (a[i]>a[i+1]) {

temp=a[i];

a[i]=a[i+1];

a[i+1]=temp;

}

}

}

odd even sort in openmp
Odd-Even Sort in OpenMP

for (phase=0; phase<n; phase++) {

if (phase%2 == 0) { /* Even phase */

# pragmaomp parallel for num_threads(thread_count)\

default(none) shared(a, n) private(i, temp)

for (i=1; i<n; i+=2)

if (a[i-1]>a[i]) {

temp=a[i];

a[i]=a[i-1];

a[i-1]=temp;

}

} else { /* Odd phase */

# pragmaomp parallel for num_threads(thread_count)\

default(none) shared(a, n) private(i, temp)

for (i=1; i<n-1; i+=2)

if (a[i]>a[i+1]) {

temp=a[i];

a[i]=a[i+1];

a[i+1]=temp;

}

}

}

odd even sort in openmp1
Odd-Even Sort in OpenMP

# pragma omp parallel num_thread(thread_count) \

default(none) shared(a, n) private(i, tmp, phase)

for (phase=0; phase<n; phase++) {

if (phase%2 == 0) { /* Even phase */

# pragma omp for

for (i=1; i<n; i+=2)

if (a[i-1]>a[i]) {

temp=a[i];

a[i]=a[i-1];

a[i-1]=temp;

}

} else { /* Odd phase */

# pragma omp for

for (i=1; i<n-1; i+=2)

if (a[i]>a[i+1]) {

temp=a[i];

a[i]=a[i+1];

a[i+1]=temp;

}

}

}

data partitioning
Data Partitioning

Iterations

Block

Threads

0

1

2

Cyclic

Iterations

Threads

0

1

2

scheduling loops
Scheduling Loops

sum=0.0;

for (i=0; i<=n; i++)

sum+=f(i);

double f(inti) {

int j, start=i*(i+1)/2, finish=start+i;

double return_val=0.0;

for (j=start; j<=finish; j++) {

return_val+=sin(j);

}

return return_val;

}

Load Balancing

the schedule clause
The schedule clause

sum=0.0;

# pragma omp parallel for num_threads(thread_count) \

reduction(+:sum) schedule(static, 1)

for (i=0; i<n; i++)

sum+=f(i);

chunksize

schedule(static, total_iterations/thread_count)

the dynamic and guided types
The dynamic and guided Types
  • In a dynamic schedule:
    • Iterations are broken into chunks of chunksize consecutive iterations.
    • Default chunksize value: 1
    • Each thread executes a chunk.
    • When a thread finishes a chunk, it requests another one.
  • In a guided schedule:
    • Each thread executes a chunk.
    • When a thread finishes a chunk, it requests another one.
    • As chunks are completed, the size of the new chunks decreases.
    • Approximately equals to the number of iterations remaining divided by the number of threads.
    • The size of chunks decreases down to chunksize or 1 (default).
which schedule
Which schedule?
  • The optimal schedule depends on:
    • The type of problem
    • The number of iterations
    • The number of threads
  • Overhead
    • guided>dynamic>static
    • If you are getting satisfactory results (e.g., close to the theoretically maximum speedup) without a schedule clause, go no further.
  • The Cost of Iterations
    • If it is roughly the same, use the default schedule.
    • If it decreases or increases linearly as the loop executes, a static schedule with small chunksize values will be good.
    • If it cannot be determined in advance, try to explore different options.
performance issue
Performance Issue

A

x

y

=

X

# pragma omp parallel for num_threads(thread_count) \

default(none) private(i,j) shared(A, x, y, m, n)

for(i=1; i<m; i++) {

y[i]=0.0;

for(j=0; j<n; j++)

y[i]+=A[i][j]*x[j];

}

performance issue1
Performance Issue

Cache Miss

False Sharing

performance issue2
Performance Issue
  • 8,000,000-by-8
    • y has 8,000,000 elements  Potentially large number of write misses
  • 8-by-8,000,000
    • x has 8,000,000 elements  Potentially large number of read misses
  • 8-by-8,000,000
    • y has 8 elements (8 doubles)  Could be stored in the same cache line (64 bytes).
    • Potentially serious false sharing effect for multiple processors
  • 8000-by-8000
    • y has 8,000 elements (8,000 doubles).
    • Thread 2: 4000 to 5999 Thread 3: 6000 to 7999
    • {y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] }
    • The effect of false sharing is highly unlikely.
thread safety
Thread Safety
  • How to generate random numbers in C?
    • First, call srand() with an integer seed.
    • Second, call rand() to create a sequence of random numbers.
  • Pseudorandom Number Generator (PRNG)
  • Is it thread safe?
    • Can it be simultaneously executed by multiple threads without causing problems?

Shared State

foster s methodology
Foster’s Methodology
  • Partitioning
    • Divide the computation and the data into small tasks.
    • Identify tasks that can be executed in parallel.
  • Communication
    • Determine what communication needs to be carried out.
    • Local Communication vs. Global Communication
  • Agglomeration
    • Group tasks into larger tasks.
    • Reduce communication.
    • Task Dependence
  • Mapping
    • Assign the composite tasks to processes/threads.
the n body problem
The n-body Problem
  • To predict the motion of a group of objects that interact with each other gravitationally over a period of time.
    • Inputs: Mass, Position and Velocity
  • Astrophysicist
    • The positions and velocities of a collection of stars
  • Chemist
    • The positions and velocities of a collection of molecules
the basic algorithm
The Basic Algorithm

Get input data;

for each timestep {

if (timestep output)

Print positions and velocities of particles;

for each particle q

Compute total force on q;

for each particle q

Compute position and velocity of q;

}

for each particle q {

forces[q][0]=forces[q][1]=0;

for each particle k!=q {

x_diff=pos[q][0]-pos[k][0];

y_diff=pos[q][1]-pos[k][1];

dist=sqrt(x_diff*x_diff+y_diff*y_diff);

dist_cubed=dist*dist*dist;

forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff;

forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff;

}

}

the reduced algorithm
The Reduced Algorithm

for each particle q

forces[q][0]=forces[q][1]=0;

for each particle q {

for each particle k>q {

x_diff=pos[q][0]-pos[k][0];

y_diff=pos[q][1]-pos[k][1];

dist=sqrt(x_diff*x_diff+y_diff*y_diff);

dist_cubed=dist*dist*dist;

force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff;

force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

forces[q][0]+=force_qk[0];

forces[q][1]+=force_qk[1];

forces[k][0]-=force_qk[0];

forces[k][1]-=force_qk[1];

}

}

position and velocity
Position and Velocity

for each particle q {

pos[q][0]+=delta_t*vel[q][0];

pos[q][1]+=delta_t*vel[q][1];

vel[q][0]+=delta_t*forces[q][0]/masses[q];

vel[q][1]+=delta_t*forces[q][1]/masses[q];

}

communications
Communications

sq(t)

vq(t)

sr(t)

vr(t)

Fq(t)

Fr(t)

sq(t + △t)

vq(t + △t)

sr(t + △t)

vr(t + △t)

Fq(t+ △t)

Fr(t+ △t)

agglomeration
Agglomeration

sq

sr

sq, vq, Fq

sr, vr, Fr

t

sq

sr

sq, vq, Fq

sr, vr, Fr

t + △t

parallelizing the basic solver
Parallelizing the Basic Solver

# pragma omp parallel

for each timestep {

if (timestep output){

# pragma omp single nowait

Print positions and velocities of particles;

}

# pragma omp for

for each particle q

Compute total force on q;

# pragma omp for

for each particle q

Compute position and velocity of q;

}

Race Conditions?

parallelizing the reduced solver
Parallelizing the Reduced Solver

# pragma omp for

for each particle q

forces[q][0]=forces[q][1]=0;

# pragma omp for

for each particle q {

for each particle k>q {

x_diff=pos[q][0]-pos[k][0];

y_diff=pos[q][1]-pos[k][1];

dist=sqrt(x_diff*x_diff+y_diff*y_diff);

dist_cubed=dist*dist*dist;

force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff;

force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

forces[q][0]+=force_qk[0];

forces[q][1]+=force_qk[1];

forces[k][0]-=force_qk[0];

forces[k][1]-=force_qk[1];

}

}

does it work properly
Does it work properly?
  • Consider 2 threads and 4 particles.
  • Thread 1 is assigned particle 0 and particle 1.
  • Thread 2 is assigned particle 2 and particle 3.
  • F3=-f03-f13-f23
  • Who will calculate f03and f13?
  • Who will calculate f23?
  • Any race conditions?
thread contributions
Thread Contributions

3 Threads, 6 Particles, Block Partition

first phase
First Phase

# pragma omp for

for each particle q {

for each particle k>q {

x_diff=pos[q][0]-pos[k][0];

y_diff=pos[q][1]-pos[k][1];

dist=sqrt(x_diff*x_diff+y_diff*y_diff);

dist_cubed=dist*dist*dist;

force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff;

force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

loc_forces[my_rank][q][0]+=force_qk[0];

loc_forces[my_rank][q][1]+=force_qk[1];

loc_forces[my_rank][k][0]-=force_qk[0];

loc_forces[my_rank][k][1]-=force_qk[1];

}

}

second phase
Second Phase

# pragma omp for

for (q=0; q<n; q++) {

forces[q][0]=forces[q][1]=0;

for(thread=0; thread<thread_count; thread++) {

forces[q][0]+=loc_forces[thread][q][0];

forces[q][1]+=loc_forces[thread][q][1];

}

}

  • In the first phase, each thread carries out the same calculations as before but the values are stored in its own array of forces (loc_forces).
  • In the second phase, the thread that has been assigned particle q will add the contributions that have been computed by different threads.

Race Conditions?

evaluating the openmp codes
Evaluating the OpenMP Codes
  • In the reduced code:
    • Loop 1: Initialization of the loc_forces array
    • Loop 2: The first phase of the computation of forces
    • Loop 3: The second phase of the computation of forces
    • Loop 4: The updating of positions and velocities
  • Which schedule should be used?
review
Review
  • What are the major differences between MPI and OpenMP?
  • What is the scope of a variable?
  • What is a reduction variable?
  • How to ensure mutual exclusion in a critical section?
  • What are the common loop scheduling options?
  • What factors may potentially affect the performance of an OpenMP program?
  • What is a thread safe function?