Chapter 3

1 / 54

# Parallel Algorithm Design - PowerPoint PPT Presentation

Chapter 3. Parallel Algorithm Design. Outline. Task/channel model Algorithm design methodology Case studies. Task/Channel Model. Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels. Task. Channel.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Parallel Algorithm Design' - mahon

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Chapter 3

Parallel Algorithm Design

Outline
• Algorithm design methodology
• Case studies
• Parallel computation = set of tasks
• Program
• Local memory
• Collection of I/O ports
• Tasks interact by sending messages through channels
Foster’s Design Methodology
• Partitioning
• Dividing the Problem into Tasks
• Communication
• Determine what needs to be communicated between the Tasks over Channels
• Agglomeration
• Group or Consolidate Tasks to improve efficiency or simplify the programming solution
• Mapping
• Assign tasks to the Computer Processors
Step 1: PartitioningDivide Computation & Data into Pieces
• Domain Decomposition – Data Centric Approach
• Divide up most frequently used data
• Associate the computations with the divided data
• Functional Decomposition – Computation Centric Approach
• Divide up the computation
• Associate the data with the divided computations

Resulting Pieces from either Decomposition

• The goal is to have as many of these as possible
Partitioning Checklist
• e.g, at least 10x more primitive tasks than processors in target computer
• Minimize redundant computations and data
• Primitive tasks roughly the same size
• Scalable
• Number of tasks an increasing function of problem size
Step 2: CommunicationDetermine Communication Patterns between Primitive Tasks
• Local Communication
• When Tasks need data from a small number of other Tasks
• Global Communication
• Channels for this type of communication are not created during this step
Communication Checklist
• Balanced
• Communication operations balanced among tasks
• Small degree
• Each task communicates with only small group of neighbors
• Concurrency
• Tasks can perform communications concurrently
• Task can perform computations concurrently
Step 3: AgglomerationGroup Tasks to Improve Efficiency or Simplify Programming
• Increase Locality
• remove communication by agglomerating Tasks that Communicate with one another
• Combine groups of sending & receiving task
• Send fewer, larger messages rather than more short messages which incur more message latency.
• Maintain Scalability of the Parallel Design
• Be careful not to agglomerate Tasks so much that moving to a machine with more processors will not be possible
• Reduce Software Engineering costs
• Leveraging existing sequential code can reduce the expense of engineering a parallel algorithm
Agglomeration Can Improve Performance
• Combine groups of sending and receiving tasks
Agglomeration Checklist
• Locality of parallel algorithm has increased
• Tradeoff between agglomeration and code modifications costs is reasonable
• Agglomerated tasks have similar computational and communications costs
• Number of tasks increases with problem size
• Number of tasks suitable for likely target systems
Step 4: MappingAssigning Tasks to Processors
• Maximize Processor Utilization
• Ensure computation is evenly balanced across all processors
• Minimize Interprocess Communication
Optimal Mapping
• Finding optimal mapping is NP-hard
• Must rely on heuristics
Mapping Goals
• Mapping based on one task per processor and multiple tasks per processor have been considered
• Both static and dynamic allocation of tasks to processors have been evaluated
• If a dynamic allocation of tasks to processors is chosen, the Task allocator is not a bottleneck
• If Static allocation of tasks to processors is chosen, the ratio of tasks to processors is at least 10 to 1
Case Studies

Boundary value problem

Finding the maximum

The n-body problem

Boundary Value Problem

Ice water

Rod

Insulation

Partitioning
• One data item per grid point
• Associate one primitive task with each grid point
• Two-dimensional domain decomposition
Communication
• Identify communication pattern between primitive tasks:
• Each interior primitive task has three incoming and three outgoing channels
Execution Time Estimate
•  – time to update element
• n – number of elements
• m – number of iterations
• Sequential execution time: mn
• p – number of processors
•  – message latency
• Parallel execution time m(n/p+2)
Reduction
• Given associative operator 
• a0  a1  a2  …  an-1
• Examples
• Multiply
• And, Or
• Maximum, Minimum
Finding Global Sum

4

2

0

7

-3

5

-6

-3

8

1

2

3

-4

4

6

-1

Partitioning
• Domain partitioning
• Assume one task per particle
• Task has particle’s position, velocity vector
• Iteration
• Get positions of all other particles
• Compute new position, velocity

1234

12

5678

56

34

78

Scatter in log p Steps

12345678

• Parallel computation
• Interactions through channels
• Good designs
• Maximize local computations
• Minimize communications
• Scale up
Summary: Design Steps
• Partition computation