Chapter 3
Sponsored Links
This presentation is the property of its rightful owner.
1 / 54

Chapter 3 PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on
  • Presentation posted in: General

Chapter 3. Parallel Algorithm Design. Outline. Task/channel model Algorithm design methodology Case studies. Task/Channel Model. Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels. Task. Channel.

Download Presentation

Chapter 3

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Chapter 3

Parallel Algorithm Design


Outline

  • Task/channel model

  • Algorithm design methodology

  • Case studies


Task/Channel Model

  • Parallel computation = set of tasks

  • Task

    • Program

    • Local memory

    • Collection of I/O ports

  • Tasks interact by sending messages through channels


Task

Channel

Task/Channel Model


Foster’s Design Methodology

  • Partitioning

    • Dividing the Problem into Tasks

  • Communication

    • Determine what needs to be communicated between the Tasks over Channels

  • Agglomeration

    • Group or Consolidate Tasks to improve efficiency or simplify the programming solution

  • Mapping

    • Assign tasks to the Computer Processors


Foster’s Methodology


Step 1: PartitioningDivide Computation & Data into Pieces

  • Domain Decomposition – Data Centric Approach

    • Divide up most frequently used data

    • Associate the computations with the divided data

  • Functional Decomposition – Computation Centric Approach

    • Divide up the computation

    • Associate the data with the divided computations

  • Primitive Tasks:

    Resulting Pieces from either Decomposition

    • The goal is to have as many of these as possible


Example Domain Decompositions


Example Functional Decomposition


Partitioning Checklist

  • Lots of Tasks

    • e.g, at least 10x more primitive tasks than processors in target computer

  • Minimize redundant computations and data

  • Load Balancing

    • Primitive tasks roughly the same size

  • Scalable

    • Number of tasks an increasing function of problem size


Step 2: CommunicationDetermine Communication Patterns between Primitive Tasks

  • Local Communication

    • When Tasks need data from a small number of other Tasks

    • Channel from Producing Task to Consuming Task Created

  • Global Communication

    • When Task need data from many or all other Tasks

    • Channels for this type of communication are not created during this step


Communication Checklist

  • Balanced

    • Communication operations balanced among tasks

  • Small degree

    • Each task communicates with only small group of neighbors

  • Concurrency

    • Tasks can perform communications concurrently

    • Task can perform computations concurrently


Step 3: AgglomerationGroup Tasks to Improve Efficiency or Simplify Programming

  • Increase Locality

    • remove communication by agglomerating Tasks that Communicate with one another

    • Combine groups of sending & receiving task

      • Send fewer, larger messages rather than more short messages which incur more message latency.

  • Maintain Scalability of the Parallel Design

    • Be careful not to agglomerate Tasks so much that moving to a machine with more processors will not be possible

  • Reduce Software Engineering costs

    • Leveraging existing sequential code can reduce the expense of engineering a parallel algorithm


Agglomeration Can Improve Performance

  • Eliminate communication between primitive tasks agglomerated into consolidated task

  • Combine groups of sending and receiving tasks


Agglomeration Checklist

  • Locality of parallel algorithm has increased

  • Tradeoff between agglomeration and code modifications costs is reasonable

  • Agglomerated tasks have similar computational and communications costs

  • Number of tasks increases with problem size

  • Number of tasks suitable for likely target systems


Step 4: MappingAssigning Tasks to Processors

  • Maximize Processor Utilization

    • Ensure computation is evenly balanced across all processors

  • Minimize Interprocess Communication


Optimal Mapping

  • Finding optimal mapping is NP-hard

  • Must rely on heuristics


Decision Tree for Parallel Algorithm Design


Mapping Goals

  • Mapping based on one task per processor and multiple tasks per processor have been considered

  • Both static and dynamic allocation of tasks to processors have been evaluated

  • If a dynamic allocation of tasks to processors is chosen, the Task allocator is not a bottleneck

  • If Static allocation of tasks to processors is chosen, the ratio of tasks to processors is at least 10 to 1


Case Studies

Boundary value problem

Finding the maximum

The n-body problem

Adding data input


Boundary Value Problem

Ice water

Rod

Insulation


Rod Cools as Time Progresses


Finite Difference Approximation


Partitioning

  • One data item per grid point

  • Associate one primitive task with each grid point

  • Two-dimensional domain decomposition


Communication

  • Identify communication pattern between primitive tasks:

    • Each interior primitive task has three incoming and three outgoing channels


Agglomeration

Agglomeration and Mapping


Execution Time Estimate

  •  – time to update element

  • n – number of elements

  • m – number of iterations

  • Sequential execution time: mn

  • p – number of processors

  •  – message latency

  • Parallel execution time m(n/p+2)


Finding the Maximum Error

6.25%


Reduction

  • Given associative operator 

  • a0  a1  a2  …  an-1

  • Examples

    • Add

    • Multiply

    • And, Or

    • Maximum, Minimum


Parallel Reduction Evolution


Parallel Reduction Evolution


Parallel Reduction Evolution


Subgraph of hypercube

Binomial Trees


Finding Global Sum

4

2

0

7

-3

5

-6

-3

8

1

2

3

-4

4

6

-1


Finding Global Sum

1

7

-6

4

4

5

8

2


Finding Global Sum

8

-2

9

10


Finding Global Sum

17

8


Binomial Tree

Finding Global Sum

25


Agglomeration


sum

sum

sum

sum

Agglomeration


The n-body Problem


The n-body Problem


Partitioning

  • Domain partitioning

  • Assume one task per particle

  • Task has particle’s position, velocity vector

  • Iteration

    • Get positions of all other particles

    • Compute new position, velocity


Gather


All-gather


Complete Graph for All-gather


Hypercube for All-gather


Complete graph

Hypercube

Communication Time


Adding Data Input


Scatter


1234

12

5678

56

34

78

Scatter in log p Steps

12345678


Summary: Task/channel Model

  • Parallel computation

    • Set of tasks

    • Interactions through channels

  • Good designs

    • Maximize local computations

    • Minimize communications

    • Scale up


Summary: Design Steps

  • Partition computation

  • Agglomerate tasks

  • Map tasks to processors

  • Goals

    • Maximize processor utilization

    • Minimize inter-processor communication


Summary: Fundamental Algorithms

  • Reduction

  • Gather and scatter

  • All-gather


  • Login