Chapter 3

Chapter 3 Parallel Algorithm Design

Outline • Task/channel model • Algorithm design methodology • Case studies

Task/Channel Model • Parallel computation = set of tasks • Task • Program • Local memory • Collection of I/O ports • Tasks interact by sending messages through channels

Task Channel Task/Channel Model

Foster’s Design Methodology • Partitioning • Dividing the Problem into Tasks • Communication • Determine what needs to be communicated between the Tasks over Channels • Agglomeration • Group or Consolidate Tasks to improve efficiency or simplify the programming solution • Mapping • Assign tasks to the Computer Processors

Foster’s Methodology

Step 1: PartitioningDivide Computation & Data into Pieces • Domain Decomposition – Data Centric Approach • Divide up most frequently used data • Associate the computations with the divided data • Functional Decomposition – Computation Centric Approach • Divide up the computation • Associate the data with the divided computations • Primitive Tasks: Resulting Pieces from either Decomposition • The goal is to have as many of these as possible

Example Domain Decompositions

Example Functional Decomposition

Partitioning Checklist • Lots of Tasks • e.g, at least 10x more primitive tasks than processors in target computer • Minimize redundant computations and data • Load Balancing • Primitive tasks roughly the same size • Scalable • Number of tasks an increasing function of problem size

Step 2: CommunicationDetermine Communication Patterns between Primitive Tasks • Local Communication • When Tasks need data from a small number of other Tasks • Channel from Producing Task to Consuming Task Created • Global Communication • When Task need data from many or all other Tasks • Channels for this type of communication are not created during this step

Communication Checklist • Balanced • Communication operations balanced among tasks • Small degree • Each task communicates with only small group of neighbors • Concurrency • Tasks can perform communications concurrently • Task can perform computations concurrently

Step 3: AgglomerationGroup Tasks to Improve Efficiency or Simplify Programming • Increase Locality • remove communication by agglomerating Tasks that Communicate with one another • Combine groups of sending & receiving task • Send fewer, larger messages rather than more short messages which incur more message latency. • Maintain Scalability of the Parallel Design • Be careful not to agglomerate Tasks so much that moving to a machine with more processors will not be possible • Reduce Software Engineering costs • Leveraging existing sequential code can reduce the expense of engineering a parallel algorithm

Agglomeration Can Improve Performance • Eliminate communication between primitive tasks agglomerated into consolidated task • Combine groups of sending and receiving tasks

Agglomeration Checklist • Locality of parallel algorithm has increased • Tradeoff between agglomeration and code modifications costs is reasonable • Agglomerated tasks have similar computational and communications costs • Number of tasks increases with problem size • Number of tasks suitable for likely target systems

Step 4: MappingAssigning Tasks to Processors • Maximize Processor Utilization • Ensure computation is evenly balanced across all processors • Minimize Interprocess Communication

Optimal Mapping • Finding optimal mapping is NP-hard • Must rely on heuristics

Decision Tree for Parallel Algorithm Design

Mapping Goals • Mapping based on one task per processor and multiple tasks per processor have been considered • Both static and dynamic allocation of tasks to processors have been evaluated • If a dynamic allocation of tasks to processors is chosen, the Task allocator is not a bottleneck • If Static allocation of tasks to processors is chosen, the ratio of tasks to processors is at least 10 to 1

Case Studies Boundary value problem Finding the maximum The n-body problem Adding data input

Boundary Value Problem Ice water Rod Insulation

Rod Cools as Time Progresses

Finite Difference Approximation

Partitioning • One data item per grid point • Associate one primitive task with each grid point • Two-dimensional domain decomposition

Communication • Identify communication pattern between primitive tasks: • Each interior primitive task has three incoming and three outgoing channels

Agglomeration Agglomeration and Mapping

Execution Time Estimate •  – time to update element • n – number of elements • m – number of iterations • Sequential execution time: mn • p – number of processors •  – message latency • Parallel execution time m(n/p+2)

Finding the Maximum Error 6.25%

Reduction • Given associative operator  • a0  a1  a2  …  an-1 • Examples • Add • Multiply • And, Or • Maximum, Minimum

Parallel Reduction Evolution

Subgraph of hypercube Binomial Trees

Finding Global Sum 4 2 0 7 -3 5 -6 -3 8 1 2 3 -4 4 6 -1

Finding Global Sum 1 7 -6 4 4 5 8 2

Finding Global Sum 8 -2 9 10

Finding Global Sum 17 8

Binomial Tree Finding Global Sum 25

Agglomeration

sum sum sum sum Agglomeration

The n-body Problem

Partitioning • Domain partitioning • Assume one task per particle • Task has particle’s position, velocity vector • Iteration • Get positions of all other particles • Compute new position, velocity

Gather

All-gather

Complete Graph for All-gather

Hypercube for All-gather

Complete graph Hypercube Communication Time

Adding Data Input

Scatter

Chapter 3

Chapter 3

Presentation Transcript

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

chapter 3

CHAPTER 3-3

Chapter 3-3

Chapter 3 Chapter 3

CHAPTER 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

CHAPTER 3

Chapter 3