- 62 Views
- Uploaded on
- Presentation posted in: General

Chapter 3

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Chapter 3

Parallel Algorithm Design

- Task/channel model
- Algorithm design methodology
- Case studies

- Parallel computation = set of tasks
- Task
- Program
- Local memory
- Collection of I/O ports

- Tasks interact by sending messages through channels

Task

Channel

- Partitioning
- Dividing the Problem into Tasks

- Communication
- Determine what needs to be communicated between the Tasks over Channels

- Agglomeration
- Group or Consolidate Tasks to improve efficiency or simplify the programming solution

- Mapping
- Assign tasks to the Computer Processors

- Domain Decomposition – Data Centric Approach
- Divide up most frequently used data
- Associate the computations with the divided data

- Functional Decomposition – Computation Centric Approach
- Divide up the computation
- Associate the data with the divided computations

- Primitive Tasks:
Resulting Pieces from either Decomposition

- The goal is to have as many of these as possible

- Lots of Tasks
- e.g, at least 10x more primitive tasks than processors in target computer

- Minimize redundant computations and data
- Load Balancing
- Primitive tasks roughly the same size

- Scalable
- Number of tasks an increasing function of problem size

- Local Communication
- When Tasks need data from a small number of other Tasks
- Channel from Producing Task to Consuming Task Created

- Global Communication
- When Task need data from many or all other Tasks
- Channels for this type of communication are not created during this step

- Balanced
- Communication operations balanced among tasks

- Small degree
- Each task communicates with only small group of neighbors

- Concurrency
- Tasks can perform communications concurrently
- Task can perform computations concurrently

- Increase Locality
- remove communication by agglomerating Tasks that Communicate with one another
- Combine groups of sending & receiving task
- Send fewer, larger messages rather than more short messages which incur more message latency.

- Maintain Scalability of the Parallel Design
- Be careful not to agglomerate Tasks so much that moving to a machine with more processors will not be possible

- Reduce Software Engineering costs
- Leveraging existing sequential code can reduce the expense of engineering a parallel algorithm

- Eliminate communication between primitive tasks agglomerated into consolidated task
- Combine groups of sending and receiving tasks

- Locality of parallel algorithm has increased
- Tradeoff between agglomeration and code modifications costs is reasonable
- Agglomerated tasks have similar computational and communications costs
- Number of tasks increases with problem size
- Number of tasks suitable for likely target systems

- Maximize Processor Utilization
- Ensure computation is evenly balanced across all processors

- Minimize Interprocess Communication

- Finding optimal mapping is NP-hard
- Must rely on heuristics

- Mapping based on one task per processor and multiple tasks per processor have been considered
- Both static and dynamic allocation of tasks to processors have been evaluated
- If a dynamic allocation of tasks to processors is chosen, the Task allocator is not a bottleneck
- If Static allocation of tasks to processors is chosen, the ratio of tasks to processors is at least 10 to 1

Boundary value problem

Finding the maximum

The n-body problem

Adding data input

Ice water

Rod

Insulation

- One data item per grid point
- Associate one primitive task with each grid point
- Two-dimensional domain decomposition

- Identify communication pattern between primitive tasks:
- Each interior primitive task has three incoming and three outgoing channels

Agglomeration

- – time to update element
- n – number of elements
- m – number of iterations
- Sequential execution time: mn
- p – number of processors
- – message latency
- Parallel execution time m(n/p+2)

6.25%

- Given associative operator
- a0 a1 a2 … an-1
- Examples
- Add
- Multiply
- And, Or
- Maximum, Minimum

Subgraph of hypercube

4

2

0

7

-3

5

-6

-3

8

1

2

3

-4

4

6

-1

1

7

-6

4

4

5

8

2

8

-2

9

10

17

8

Binomial Tree

25

sum

sum

sum

sum

- Domain partitioning
- Assume one task per particle
- Task has particle’s position, velocity vector
- Iteration
- Get positions of all other particles
- Compute new position, velocity

Complete graph

Hypercube

1234

12

5678

56

34

78

12345678

- Parallel computation
- Set of tasks
- Interactions through channels

- Good designs
- Maximize local computations
- Minimize communications
- Scale up

- Partition computation
- Agglomerate tasks
- Map tasks to processors
- Goals
- Maximize processor utilization
- Minimize inter-processor communication

- Reduction
- Gather and scatter
- All-gather