# Chapter 3 - PowerPoint PPT Presentation

1 / 54

Chapter 3. Parallel Algorithm Design. Outline. Task/channel model Algorithm design methodology Case studies. Task/Channel Model. Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels. Task. Channel.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Chapter 3

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Chapter 3

Parallel Algorithm Design

### Outline

• Algorithm design methodology

• Case studies

• Parallel computation = set of tasks

• Program

• Local memory

• Collection of I/O ports

• Tasks interact by sending messages through channels

Channel

### Foster’s Design Methodology

• Partitioning

• Dividing the Problem into Tasks

• Communication

• Determine what needs to be communicated between the Tasks over Channels

• Agglomeration

• Group or Consolidate Tasks to improve efficiency or simplify the programming solution

• Mapping

• Assign tasks to the Computer Processors

### Step 1: PartitioningDivide Computation & Data into Pieces

• Domain Decomposition – Data Centric Approach

• Divide up most frequently used data

• Associate the computations with the divided data

• Functional Decomposition – Computation Centric Approach

• Divide up the computation

• Associate the data with the divided computations

Resulting Pieces from either Decomposition

• The goal is to have as many of these as possible

### Partitioning Checklist

• e.g, at least 10x more primitive tasks than processors in target computer

• Minimize redundant computations and data

• Primitive tasks roughly the same size

• Scalable

• Number of tasks an increasing function of problem size

### Step 2: CommunicationDetermine Communication Patterns between Primitive Tasks

• Local Communication

• When Tasks need data from a small number of other Tasks

• Global Communication

• Channels for this type of communication are not created during this step

### Communication Checklist

• Balanced

• Communication operations balanced among tasks

• Small degree

• Each task communicates with only small group of neighbors

• Concurrency

• Tasks can perform communications concurrently

• Task can perform computations concurrently

### Step 3: AgglomerationGroup Tasks to Improve Efficiency or Simplify Programming

• Increase Locality

• remove communication by agglomerating Tasks that Communicate with one another

• Combine groups of sending & receiving task

• Send fewer, larger messages rather than more short messages which incur more message latency.

• Maintain Scalability of the Parallel Design

• Be careful not to agglomerate Tasks so much that moving to a machine with more processors will not be possible

• Reduce Software Engineering costs

• Leveraging existing sequential code can reduce the expense of engineering a parallel algorithm

### Agglomeration Can Improve Performance

• Combine groups of sending and receiving tasks

### Agglomeration Checklist

• Locality of parallel algorithm has increased

• Tradeoff between agglomeration and code modifications costs is reasonable

• Agglomerated tasks have similar computational and communications costs

• Number of tasks increases with problem size

• Number of tasks suitable for likely target systems

### Step 4: MappingAssigning Tasks to Processors

• Maximize Processor Utilization

• Ensure computation is evenly balanced across all processors

• Minimize Interprocess Communication

### Optimal Mapping

• Finding optimal mapping is NP-hard

• Must rely on heuristics

### Mapping Goals

• Mapping based on one task per processor and multiple tasks per processor have been considered

• Both static and dynamic allocation of tasks to processors have been evaluated

• If a dynamic allocation of tasks to processors is chosen, the Task allocator is not a bottleneck

• If Static allocation of tasks to processors is chosen, the ratio of tasks to processors is at least 10 to 1

### Case Studies

Boundary value problem

Finding the maximum

The n-body problem

Ice water

Rod

Insulation

### Partitioning

• One data item per grid point

• Associate one primitive task with each grid point

• Two-dimensional domain decomposition

### Communication

• Identify communication pattern between primitive tasks:

• Each interior primitive task has three incoming and three outgoing channels

Agglomeration

### Execution Time Estimate

•  – time to update element

• n – number of elements

• m – number of iterations

• Sequential execution time: mn

• p – number of processors

•  – message latency

• Parallel execution time m(n/p+2)

6.25%

### Reduction

• Given associative operator 

• a0  a1  a2  …  an-1

• Examples

• Multiply

• And, Or

• Maximum, Minimum

### Parallel Reduction Evolution

Subgraph of hypercube

4

2

0

7

-3

5

-6

-3

8

1

2

3

-4

4

6

-1

1

7

-6

4

4

5

8

2

8

-2

9

10

17

8

Binomial Tree

25

sum

sum

sum

sum

### Partitioning

• Domain partitioning

• Assume one task per particle

• Task has particle’s position, velocity vector

• Iteration

• Get positions of all other particles

• Compute new position, velocity

Complete graph

Hypercube

1234

12

5678

56

34

78

### Scatter in log p Steps

12345678

• Parallel computation

• Interactions through channels

• Good designs

• Maximize local computations

• Minimize communications

• Scale up

### Summary: Design Steps

• Partition computation

• Goals

• Maximize processor utilization

• Minimize inter-processor communication

### Summary: Fundamental Algorithms

• Reduction

• Gather and scatter

• All-gather