ece1747 parallel programming l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
ECE1747 Parallel Programming PowerPoint Presentation
Download Presentation
ECE1747 Parallel Programming

Loading in 2 Seconds...

play fullscreen
1 / 64

ECE1747 Parallel Programming - PowerPoint PPT Presentation


  • 437 Views
  • Uploaded on

ECE1747 Parallel Programming. Shared Memory Multithreading Pthreads. Shared Memory. All threads access the same shared memory data space. Shared Memory Address Space. proc1. proc2. proc3. procN. Shared Memory (continued).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ECE1747 Parallel Programming' - arleen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ece1747 parallel programming

ECE1747 Parallel Programming

Shared Memory Multithreading Pthreads

shared memory
Shared Memory
  • All threads access the same shared memory data space.

Shared Memory Address Space

proc1

proc2

proc3

procN

shared memory continued
Shared Memory (continued)
  • Concretely, it means that a variable x, a pointer p, or an array a[] refer tothe same object, no matter what processor the reference originates from.
  • We have more or less implicitly assumed this to be the case in earlier examples.
shared memory4
Shared Memory

a

proc1

proc2

proc3

procN

distributed memory message passing
Distributed Memory - Message Passing

The alternative model to shared memory.

mem1

mem2

mem3

memN

a

a

a

a

proc1

proc2

proc3

procN

network

shared memory vs message passing
Shared Memory vs. Message Passing
  • Same terminology is used in distinguishing hardware.
  • For us: distinguish programming models, not hardware.
programming vs hardware
Programming vs. Hardware
  • One can implement
    • a shared memory programming model
    • on shared or distributed memory hardware
    • (also in software or in hardware)
  • One can implement
    • a message passing programming model
    • on shared or distributed memory hardware
portability of programming models
Portability of programming models

shared memory

programming

message passing

programming

shared memory

machine

distr. memory

machine

shared memory programming important point to remember
Shared Memory Programming: Important Point to Remember
  • No matter what the implementation, it conceptually looks like shared memory.
  • There may be some (important) performance differences.
multithreading
Multithreading
  • User has explicit control over thread.
  • Good: control can be used to performance benefit.
  • Bad: user has to deal with it.
pthreads
Pthreads
  • POSIX standard shared-memory multithreading interface.
  • Provides primitives for process management and synchronization.
what does the user have to do
What does the user have to do?
  • Decide how to decompose the computation into parallel parts.
  • Create (and destroy) processes to support that decomposition.
  • Add synchronization to make sure dependences are covered.
general thread structure
General Thread Structure
  • Typically, a thread is a concurrent execution of a function or a procedure.
  • So, your program needs to be restructured such that parallel parts form separate procedures or functions.
example of thread creation contd
Example of Thread Creation (contd.)

main()

pthread_

create(func)

func()

thread joining example
Thread Joining Example

void *func(void *) { ….. }

pthread_t id; int X;

pthread_create(&id, NULL, func, &X);

…..

pthread_join(id, NULL);

…..

example of thread creation contd16
Example of Thread Creation (contd.)

main()

pthread_

create(func)

func()

pthread_

join(id)

pthread_

exit()

sequential sor
Sequential SOR

for some number of timesteps/iterations {

for (i=0; i<n; i++ )

for( j=1, j<n, j++ )

temp[i][j] = 0.25 *

( grid[i-1][j] + grid[i+1][j]

grid[i][j-1] + grid[i][j+1] );

for( i=0; i<n; i++ )

for( j=1; j<n; j++ )

grid[i][j] = temp[i][j];

}

parallel sor
Parallel SOR
  • First (i,j) loop nest can be parallelized.
  • Second (i,j) loop nest can be parallelized.
  • Must wait to start second loop nest until all processors have finished first.
  • Must wait to start first loop nest of next iteration until all processors have second loop nest of previous iteration.
  • Give n/p rows to each processor.
pthreads sor parallel parts 1
Pthreads SOR: Parallel parts (1)

void* sor_1(void *s)

{

int slice = (int) s;

int from = (slice*n)/p;

int to = ((slice+1)*n)/p;

for( i=from; i<to; i++)

for( j=0; j<n; j++ )

temp[i][j] = 0.25*(grid[i-1][j] + grid[i+1][j]

+grid[i][j-1] + grid[i][j+1]);

}

pthreads sor parallel parts 2
Pthreads SOR: Parallel parts (2)

void* sor_2(void *s)

{

int slice = (int) s;

int from = (slice*n)/p;

int to = ((slice+1)*n)/p;

for( i=from; i<to; i++)

for( j=0; j<n; j++ )

grid[i][j] = temp[i][j];

}

pthreads sor main
Pthreads SOR: main

for some number of timesteps {

for( i=0; i<p; i++ )

pthread_create(&thrd[i], NULL, sor_1, (void *)i);

for( i=0; i<p; i++ )

pthread_join(thrd[i], NULL);

for( i=0; i<p; i++ )

pthread_create(&thrd[i], NULL, sor_2, (void *)i);

for( i=0; i<p; i++ )

pthread_join(thrd[i], NULL);

}

summary thread management
Summary: Thread Management
  • pthread_create(): creates a parallel thread executing a given function (and arguments), returns thread identifier.
  • pthread_exit(): terminates thread.
  • pthread_join(): waits for thread with particular thread identifier to terminate.
summary program structure
Summary: Program Structure
  • Encapsulate parallel parts in functions.
  • Use function arguments to parameterize what a particular thread does.
  • Call pthread_create() with the function and arguments, save thread identifier returned.
  • Call pthread_join() with that thread identifier.
pthreads synchronization
Pthreads Synchronization
  • Create/exit/join
    • provide some form of synchronization,
    • at a very coarse level,
    • requires thread creation/destruction.
  • Need for finer-grain synchronization
    • mutex locks,
    • condition variables.
use of mutex locks
Use of Mutex Locks
  • To implement critical sections.
  • Pthreads provides only exclusive locks.
  • Some other systems allow shared-read, exclusive-write locks.
barrier synchronization
Barrier Synchronization
  • A wait at a barrier causes a thread to wait until all threads have performed a wait at the barrier.
  • At that point, they all proceed.
implementing barriers in pthreads
Implementing Barriers in Pthreads
  • Count the number of arrivals at the barrier.
  • Wait if this is not the last arrival.
  • Make everyone unblock if this is the last arrival.
  • Since the arrival count is a shared variable, enclose the whole operation in a mutex lock-unlock.
implementing barriers in pthreads28
Implementing Barriers in Pthreads

void barrier()

{

pthread_mutex_lock(&mutex_arr);

arrived++;

if (arrived<N) {

pthread_cond_wait(&cond, &mutex_arr);

}

else {

pthread_cond_broadcast(&cond);

arrived=0; /* be prepared for next barrier */

}

pthread_mutex_unlock(&mutex_arr);

}

parallel sor with barriers 1 of 2
Parallel SOR with Barriers (1 of 2)

void* sor (void* arg)

{

int slice = (int)arg;

int from = (slice * (n-1))/p + 1;

int to = ((slice+1) * (n-1))/p + 1;

for some number of iterations { … }

}

parallel sor with barriers 2 of 2
Parallel SOR with Barriers (2 of 2)

for (i=from; i<to; i++)

for (j=1; j<n; j++)

temp[i][j] = 0.25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]);

barrier();

for (i=from; i<to; i++)

for (j=1; j<n; j++)

grid[i][j]=temp[i][j];

barrier();

parallel sor with barriers main
Parallel SOR with Barriers: main

int main(int argc, char *argv[])

{

pthread_t *thrd[p];

/* Initialize mutex and condition variables */

for (i=0; i<p; i++)

pthread_create (&thrd[i], &attr, sor, (void*)i);

for (i=0; i<p; i++)

pthread_join (thrd[i], NULL);

/* Destroy mutex and condition variables */

}

note again
Note again
  • Many shared memory programming systems (other than Pthreads) have barriers as basic primitive.
  • If they do, you should use it, not construct it yourself.
  • Implementation may be more efficient than what you can do yourself.
busy waiting
Busy Waiting
  • Not an explicit part of the API.
  • Available in a general shared memory programming environment.
busy waiting34
Busy Waiting

initially: flag = 0;

P1: produce data;

flag = 1;

P2: while( !flag ) ;

consume data;

use of busy waiting
Use of Busy Waiting
  • On the surface, simple and efficient.
  • In general, not a recommended practice.
  • Often leads to messy and unreadable code (blurs data/synchronization distinction).
  • May be inefficient
private data in pthreads
Private Data in Pthreads
  • To make a variable private in Pthreads, you need to make an array out of it.
  • Index the array by thread identifier, which you can get by the pthreads_self() call.
  • Not very elegant or efficient.
other primitives in pthreads
Other Primitives in Pthreads
  • Set the attributes of a thread.
  • Set the attributes of a mutex lock.
  • Set scheduling parameters.
ece 1747 parallel programming

ECE 1747 Parallel Programming

Machine-independent

Performance Optimization Techniques

returning to sequential vs parallel
Returning to Sequential vs. Parallel
  • Sequential execution time: t seconds.
  • Startup overhead of parallel execution: t_st seconds (depends on architecture)
  • (Ideal) parallel execution time: t/p + t_st.
  • If t/p + t_st > t, no gain.
general idea
General Idea
  • Parallelism limited by dependences.
  • Restructure code to eliminate or reduce dependences.
  • Sometimes possible by compiler, but good to know how to do it by hand.
summary
Summary
  • Reorganize code such that
    • dependences are removed or reduced
    • large pieces of parallel work emerge
    • loop bounds become known
  • Code can become messy … there is a point of diminishing returns.
factors that determine speedup
Factors that Determine Speedup
  • Characteristics of parallel code
    • granularity
    • load balance
    • locality
    • communication and synchronization
granularity
Granularity
  • Granularity = size of the program unit that is executed by a single processor.
  • May be a single loop iteration, a set of loop iterations, etc.
  • Fine granularity leads to:
    • (positive) ability to use lots of processors
    • (positive) finer-grain load balancing
    • (negative) increased overhead
granularity and critical sections
Granularity and Critical Sections
  • Small granularity => more processors => more critical section accesses => more contention.
issues in performance of parallel parts
Issues in Performance of Parallel Parts
  • Granularity.
  • Load balance.
  • Locality.
  • Synchronization and communication.
load balance
Load Balance
  • Load imbalance = different in execution time between processors between barriers.
  • Execution time may not be predictable.
    • Regular data parallel: yes.
    • Irregular data parallel or pipeline: perhaps.
    • Task queue: no.
static vs dynamic
Static vs. Dynamic
  • Static: done once, by the programmer
    • block, cyclic, etc.
    • fine for regular data parallel
  • Dynamic: done at runtime
    • task queue
    • fine for unpredictable execution times
    • usually high overhead
  • Semi-static: done once, at run-time
choice is not inherent
Choice is not inherent
  • MM or SOR could be done using task queues: put all iterations in a queue.
    • In heterogeneous environment.
    • In multitasked environment.
  • TSP could be done using static partitioning: give length-1 path to all processors.
    • If we did exhaustive search.
static load balancing
Static Load Balancing
  • Block
    • best locality
    • possibly poor load balance
  • Cyclic
    • better load balance
    • worse locality
  • Block-cyclic
    • load balancing advantages of cyclic (mostly)
    • better locality (see later)
dynamic load balancing 1 of 2
Dynamic Load Balancing (1 of 2)
  • Centralized: single task queue.
    • Easy to program
    • Excellent load balance
  • Distributed: task queue per processor.
    • Less communication/synchronization
dynamic load balancing 2 of 2
Dynamic Load Balancing (2 of 2)
  • Task stealing:
    • Processes normally remove and insert tasks from their own queue.
    • When queue is empty, remove task(s) from other queues.
      • Extra overhead and programming difficulty.
      • Better load balancing.
semi static load balancing
Semi-static Load Balancing
  • Measure the cost of program parts.
  • Use measurement to partition computation.
  • Done once, done every iteration, done every n iterations.
molecular dynamics md
Molecular Dynamics (MD)
  • Simulation of a set of bodies under the influence of physical laws.
  • Atoms, molecules, celestial bodies, ...
  • Have same basic structure.
molecular dynamics skeleton
Molecular Dynamics (Skeleton)

for some number of timesteps {

for all molecules i

for all other molecules j

force[i] += f( loc[i], loc[j] );

for all molecules i

loc[i] = g( loc[i], force[i] );

}

molecular dynamics
Molecular Dynamics
  • To reduce amount of computation, account for interaction only with nearby molecules.
molecular dynamics continued
Molecular Dynamics (continued)

for some number of timesteps {

for all molecules i

for all nearby molecules j

force[i] += f( loc[i], loc[j] );

for all molecules i

loc[i] = g( loc[i], force[i] );

}

molecular dynamics continued57
Molecular Dynamics (continued)

for each molecule i

number of nearby molecules count[i]

array of indices of nearby molecules index[j]

( 0 <= j < count[i])

molecular dynamics continued58
Molecular Dynamics (continued)

for some number of timesteps {

for( i=0; i<num_mol; i++ )

for( j=0; j<count[i]; j++ )

force[i] += f(loc[i],loc[index[j]]);

for( i=0; i<num_mol; i++ )

loc[i] = g( loc[i], force[i] );

}

molecular dynamics simple
Molecular Dynamics (simple)

for some number of timesteps {

#pragma omp parallel for

for( i=0; i<num_mol; i++ )

for( j=0; j<count[i]; j++ )

force[i] += f(loc[i],loc[index[j]]);

#pragma omp parallel for

for( i=0; i<num_mol; i++ )

loc[i] = g( loc[i], force[i] );

}

molecular dynamics simple60
Molecular Dynamics (simple)
  • Simple to program.
  • Possibly poor load balance
    • block distribution of i iterations (molecules)
    • could lead to uneven neighbor distribution
    • cyclic does not help
better load balance
Better Load Balance
  • Assign iterations such that each processor has ~ the same number of neighbors.
  • Array of “assign records”
    • size: number of processors
    • two elements:
      • beginning i value (molecule)
      • ending i value (molecule)
  • Recompute partition periodically
molecular dynamics continued62
Molecular Dynamics (continued)

for some number of timesteps {

#pragma omp parallel

pr = omp_get_thread_num();

for( i=assign[pr]->b; i<assign[pr]->e; i++ )

for( j=0; j<count[i]; j++ )

force[i] += f(loc[i],loc[index[j]]);

#pragma omp parallel for

for( i=0; i<num_mol; i++ )

loc[i] = g( loc[i], force[i] );

}

frequency of balancing
Frequency of Balancing
  • Every time neighbor list is recomputed.
    • once during initialization.
    • every iteration.
    • every n iterations.
  • Extra overhead vs. better approximation and better load balance.
summary64
Summary
  • Parallel code optimization
    • Critical section accesses.
    • Granularity.
    • Load balance.