1 / 54

# Chapter 3 - PowerPoint PPT Presentation

Chapter 3. Parallel Algorithm Design. Outline. Task/channel model Algorithm design methodology Case studies. Task/Channel Model. Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels. Task. Channel.

Related searches for Chapter 3

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Chapter 3' - mahon

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Chapter 3

Parallel Algorithm Design

• Algorithm design methodology

• Case studies

• Parallel computation = set of tasks

• Program

• Local memory

• Collection of I/O ports

• Tasks interact by sending messages through channels

Channel

• Partitioning

• Dividing the Problem into Tasks

• Communication

• Determine what needs to be communicated between the Tasks over Channels

• Agglomeration

• Group or Consolidate Tasks to improve efficiency or simplify the programming solution

• Mapping

• Assign tasks to the Computer Processors

Step 1: PartitioningDivide Computation & Data into Pieces

• Domain Decomposition – Data Centric Approach

• Divide up most frequently used data

• Associate the computations with the divided data

• Functional Decomposition – Computation Centric Approach

• Divide up the computation

• Associate the data with the divided computations

Resulting Pieces from either Decomposition

• The goal is to have as many of these as possible

• Lots of Tasks

• e.g, at least 10x more primitive tasks than processors in target computer

• Minimize redundant computations and data

• Primitive tasks roughly the same size

• Scalable

• Number of tasks an increasing function of problem size

Step 2: CommunicationDetermine Communication Patterns between Primitive Tasks

• Local Communication

• When Tasks need data from a small number of other Tasks

• Channel from Producing Task to Consuming Task Created

• Global Communication

• When Task need data from many or all other Tasks

• Channels for this type of communication are not created during this step

• Balanced

• Communication operations balanced among tasks

• Small degree

• Each task communicates with only small group of neighbors

• Concurrency

• Tasks can perform communications concurrently

• Task can perform computations concurrently

Step 3: AgglomerationGroup Tasks to Improve Efficiency or Simplify Programming

• Increase Locality

• remove communication by agglomerating Tasks that Communicate with one another

• Combine groups of sending & receiving task

• Send fewer, larger messages rather than more short messages which incur more message latency.

• Maintain Scalability of the Parallel Design

• Be careful not to agglomerate Tasks so much that moving to a machine with more processors will not be possible

• Reduce Software Engineering costs

• Leveraging existing sequential code can reduce the expense of engineering a parallel algorithm

• Eliminate communication between primitive tasks agglomerated into consolidated task

• Combine groups of sending and receiving tasks

• Locality of parallel algorithm has increased

• Tradeoff between agglomeration and code modifications costs is reasonable

• Agglomerated tasks have similar computational and communications costs

• Number of tasks increases with problem size

• Number of tasks suitable for likely target systems

Step 4: MappingAssigning Tasks to Processors

• Maximize Processor Utilization

• Ensure computation is evenly balanced across all processors

• Minimize Interprocess Communication

• Finding optimal mapping is NP-hard

• Must rely on heuristics

• Mapping based on one task per processor and multiple tasks per processor have been considered

• Both static and dynamic allocation of tasks to processors have been evaluated

• If a dynamic allocation of tasks to processors is chosen, the Task allocator is not a bottleneck

• If Static allocation of tasks to processors is chosen, the ratio of tasks to processors is at least 10 to 1

Boundary value problem

Finding the maximum

The n-body problem

Ice water

Rod

Insulation

• One data item per grid point

• Associate one primitive task with each grid point

• Two-dimensional domain decomposition

• Identify communication pattern between primitive tasks:

• Each interior primitive task has three incoming and three outgoing channels

Agglomeration and Mapping

•  – time to update element

• n – number of elements

• m – number of iterations

• Sequential execution time: mn

• p – number of processors

•  – message latency

• Parallel execution time m(n/p+2)

• Given associative operator 

• a0  a1  a2  …  an-1

• Examples

• Multiply

• And, Or

• Maximum, Minimum

Binomial Trees

4

2

0

7

-3

5

-6

-3

8

1

2

3

-4

4

6

-1

1

7

-6

4

4

5

8

2

8

-2

9

10

Finding Global Sum

25

sum

sum

sum

Agglomeration

• Domain partitioning

• Assume one task per particle

• Task has particle’s position, velocity vector

• Iteration

• Get positions of all other particles

• Compute new position, velocity

Hypercube

Communication Time

12

5678

56

34

78

Scatter in log p Steps

12345678

• Parallel computation

• Set of tasks

• Interactions through channels

• Good designs

• Maximize local computations

• Minimize communications

• Scale up

• Partition computation

• Map tasks to processors

• Goals

• Maximize processor utilization

• Minimize inter-processor communication

• Reduction

• Gather and scatter

• All-gather