# CS 584 - PowerPoint PPT Presentation

1 / 64

CS 584. Designing Parallel Algorithms. Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit from a methodical approach. Framework for algorithm design

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

CS 584

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

### Designing Parallel Algorithms

• Designing a parallel algorithm is not easy.

• There is no recipe or magical ingredient

• Except creativity

• We can benefit from a methodical approach.

• Framework for algorithm design

• Most problems have several parallel solutions which may be totally different from the best sequential algorithm.

### PCAM Algorithm Design

• 4 Stages to designing a parallel algorithm

• Partitioning

• Communication

• Agglomeration

• Mapping

• P & C focus on concurrency and scalability.

• A & M focus on locality and performance.

### PCAM Algorithm Design

• Partitioning

• Computation and data are decomposed.

• Communication

• Agglomeration

• Combining of tasks for performance

• Mapping

• Assignment of tasks to processors

### Partitioning

• Ignore the number of processors and the target architecture.

• Expose opportunities for parallelism.

• Divide up both the computation and data

• Can take two approaches

• domain decomposition

• functional decomposition

### Domain Decomposition

• Start algorithm design by analyzing the data

• Divide the data into small pieces

• Approximately equal in size

• Then partition the computation by associating it with the data.

• Communication issues may arise as one task needs the data from another task.

1

4

ò

+

2

1

x

0

### Domain Decomposition

• Evaluate the definite integral.

0

1

### Split up the domain

0

1

Split up the domain

0

1

Split up the domain

Now each

evaluates

the integral

in their range.

All that is

left is to

sum up each

for the total.

0

0.25

0.5

0.75

1

### Domain Decomposition

• Consider dividing up a 3-D grid

• What issues arise?

• Other issues?

• What if your problem has more than one data structure?

• Different problem phases?

• Replication?

### Functional Decomposition

• Focus on the computation

• Divide the computation into disjoint tasks

• Avoid data dependency among tasks

• After dividing the computation, examine the data requirements of each task.

### Functional Decomposition

• Not as natural as domain decomposition

• Consider search problems

• Often functional decomposition is very useful at a higher level.

• Climate modeling

• Ocean simulation

• Hydrology

• Atmosphere, etc.

### Partitioning Checklist

• Define a LOT of tasks?

• Avoid redundant computation and storage?

• Does the number of tasks scale with the problem size?

• Have you identified several alternative partitioning schemes?

### Communication

• The information flow between tasks is specified in this stage of the design

• Remember:

• Data dependencies may limit concurrency.

### Communication

• Define Channel

• Link the producers with the consumers.

• Consider the costs

• Intellectual

• Physical

• Distribute the communication.

• Specify the messages that are sent.

### Communication Patterns

• Local vs. Global

• Structured vs. Unstructured

• Static vs. Dynamic

• Synchronous vs. Asynchronous

### Local Communication

• Communication within a neighborhood.

Algorithm choice determines communication.

### Global Communication

• Not localized.

• Examples

• All-to-All

• Master-Worker

1

5

2

3

7

### Avoiding Global Communication

• Distribute the communication and computation

1

5

2

3

7

13

10

3

1

5

3

7

2

1

7

8

3

5

1

5

3

2

1

### Divide and Conquer

• Partition the problem into two or more subproblems

• Partition each subproblem, etc.

Results in structured nearest neighbor communication pattern.

### Structured Communication

• Is there a pattern?

### Unstructured Communication

• No regular pattern that can be exploited.

• Examples

• Unstructured Grid

• Resolution changes

Complicates the next stages of design

### Synchronous Communication

• Both consumers and producers are aware when communication is required

• Explicit and simple

t = 1

t = 2

t = 3

### Asynchronous Communication

• Timing of send/receive is unknown.

• No pattern

• Consider: very large data structure

• Distribute among computational tasks (polling)

• Shared Memory

### Problems to Avoid

• A centralized algorithm

• Distribute the computation

• Distribute the communication

• A sequential algorithm

• Seek for concurrency

• Divide and conquer

• Small, equal sized subproblems

### Communication Design Checklist

• Is communication balanced?

• Is communication limited to neighborhoods?

• Restructure global to local if possible.

• Can communications proceed concurrently?

• Can the algorithm proceed concurrently?

• Find the algorithm with most concurrency.

• Be careful!!!

### Agglomeration

• Partition and Communication steps were abstract

• Agglomeration moves to concrete.

• Combine tasks to execute efficiently on some parallel computer.

• Consider replication.

### Agglomeration Goals

• Reduce communication costs by

• increasing computation

• decreasing/increasing granularity

• Retain flexibility for mapping and scaling.

• Reduce software engineering costs.

### Changing Granularity

• A large number of tasks does not necessarily produce an efficient algorithm.

• We must consider the communication costs.

• Reduce communication by

• sending less messages (batching)

### Surface to Volume Effects

• Communication is proportional to the surface of the subdomain.

• Computation is proportional to the volume of the subdomain.

• Increasing computation will often decrease communication.

How many messages total?

How much data is sent?

How many messages total?

How much data is sent?

### Replicating Computation

• Trade-off replicated computation for reduced communication.

• Replication will often reduce execution time as well.

s = sum

How many steps?

### Using Replication

Butterfly to Hypercube

### Avoid Communication

• Look for tasks that cannot execute concurrently because of communication requirements.

• Replication can help accomplish two tasks at the same time, like:

• Summation

### Preserve Flexibility

• Create more tasks than processors.

• Overlap communication and computation.

• Don't incorporate unnecessary limits on the number of tasks.

### Agglomeration Checklist

• Reduce communication costs by increasing locality.

• Do benefits of replication outweigh costs?

• Does replication compromise scalability?

• Does the number of tasks still scale with problem size?

• Is there still sufficient concurrency?

### Mapping

• Specify where each task is to operate.

• Mapping may need to change depending on the target architecture.

• Mapping is NP-complete.

### Mapping

• Goal: Reduce Execution Time

• Concurrent tasks ---> Different processors

• High communication ---> Same processor

• Mapping is a game of trade-offs.

### Mapping

• Many domain-decomposition problems make mapping easy.

• Grids

• Arrays

• etc.

### Mapping

• Unstructured or complex domain decomposition based algorithms are difficult to map.

### Other Mapping Problems

• Variable amounts of work per task

• Unstructured communication

• Heterogeneous processors

• different speeds

• different architectures

• Static

• Determined a priori

• Based on work, processor speed, etc.

• Probabilistic

• Random

• Dynamic

• Based on a priori knowledge.

• Goal: Equal WORK on all processors

• Algorithms:

• Basic

• Recursive Bisection

### Basic

• Divide up the work based on

• Work required

• Processor speed

æ

ö

ç

÷

p

=

i

r

R

ç

÷

å

i

p

ç

÷

è

ø

i

### Recursive Bisection

• Divide work in half recursively.

• Based on physical coordinates.

### Dynamic Algorithms

• Local or Global

• Many tasks with weak locality requirements.

• Manager-Worker model.

• Manager-Worker

• Hierarchical Manager-Worker

• Uses submanagers

• Decentralized

• No central manager

• Task pool on each processor

• Less bottleneck

### Mapping Checklist

• Are there communication bottlenecks?

• Have you evaluated the costs?

### PCAM Algorithm Design

• Partition

• Domain or Functional Decomposition

• Communication

• Agglomeration

• Mapping

• Divide up the tasks for balanced execution

### Example: Atmosphere Model

• Simulate atmospheric processes

• Wind

• Clouds, etc.

• Solves a set of partial differential equations describing the fluid behavior