chapter 3
Download
Skip this Video
Download Presentation
Chapter 3

Loading in 2 Seconds...

play fullscreen
1 / 54

Parallel Algorithm Design - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Chapter 3. Parallel Algorithm Design. Outline. Task/channel model Algorithm design methodology Case studies. Task/Channel Model. Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels. Task. Channel.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Parallel Algorithm Design' - mahon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
chapter 3

Chapter 3

Parallel Algorithm Design

outline
Outline
  • Task/channel model
  • Algorithm design methodology
  • Case studies
task channel model
Task/Channel Model
  • Parallel computation = set of tasks
  • Task
    • Program
    • Local memory
    • Collection of I/O ports
  • Tasks interact by sending messages through channels
foster s design methodology
Foster’s Design Methodology
  • Partitioning
    • Dividing the Problem into Tasks
  • Communication
    • Determine what needs to be communicated between the Tasks over Channels
  • Agglomeration
    • Group or Consolidate Tasks to improve efficiency or simplify the programming solution
  • Mapping
    • Assign tasks to the Computer Processors
step 1 partitioning divide computation data into pieces
Step 1: PartitioningDivide Computation & Data into Pieces
  • Domain Decomposition – Data Centric Approach
    • Divide up most frequently used data
    • Associate the computations with the divided data
  • Functional Decomposition – Computation Centric Approach
    • Divide up the computation
    • Associate the data with the divided computations
  • Primitive Tasks:

Resulting Pieces from either Decomposition

    • The goal is to have as many of these as possible
partitioning checklist
Partitioning Checklist
  • Lots of Tasks
    • e.g, at least 10x more primitive tasks than processors in target computer
  • Minimize redundant computations and data
  • Load Balancing
    • Primitive tasks roughly the same size
  • Scalable
    • Number of tasks an increasing function of problem size
step 2 communication determine communication patterns between primitive tasks
Step 2: CommunicationDetermine Communication Patterns between Primitive Tasks
  • Local Communication
    • When Tasks need data from a small number of other Tasks
    • Channel from Producing Task to Consuming Task Created
  • Global Communication
    • When Task need data from many or all other Tasks
    • Channels for this type of communication are not created during this step
communication checklist
Communication Checklist
  • Balanced
    • Communication operations balanced among tasks
  • Small degree
    • Each task communicates with only small group of neighbors
  • Concurrency
    • Tasks can perform communications concurrently
    • Task can perform computations concurrently
step 3 agglomeration group tasks to improve efficiency or simplify programming
Step 3: AgglomerationGroup Tasks to Improve Efficiency or Simplify Programming
  • Increase Locality
    • remove communication by agglomerating Tasks that Communicate with one another
    • Combine groups of sending & receiving task
      • Send fewer, larger messages rather than more short messages which incur more message latency.
  • Maintain Scalability of the Parallel Design
    • Be careful not to agglomerate Tasks so much that moving to a machine with more processors will not be possible
  • Reduce Software Engineering costs
    • Leveraging existing sequential code can reduce the expense of engineering a parallel algorithm
agglomeration can improve performance
Agglomeration Can Improve Performance
  • Eliminate communication between primitive tasks agglomerated into consolidated task
  • Combine groups of sending and receiving tasks
agglomeration checklist
Agglomeration Checklist
  • Locality of parallel algorithm has increased
  • Tradeoff between agglomeration and code modifications costs is reasonable
  • Agglomerated tasks have similar computational and communications costs
  • Number of tasks increases with problem size
  • Number of tasks suitable for likely target systems
step 4 mapping assigning tasks to processors
Step 4: MappingAssigning Tasks to Processors
  • Maximize Processor Utilization
    • Ensure computation is evenly balanced across all processors
  • Minimize Interprocess Communication
optimal mapping
Optimal Mapping
  • Finding optimal mapping is NP-hard
  • Must rely on heuristics
mapping goals
Mapping Goals
  • Mapping based on one task per processor and multiple tasks per processor have been considered
  • Both static and dynamic allocation of tasks to processors have been evaluated
  • If a dynamic allocation of tasks to processors is chosen, the Task allocator is not a bottleneck
  • If Static allocation of tasks to processors is chosen, the ratio of tasks to processors is at least 10 to 1
case studies
Case Studies

Boundary value problem

Finding the maximum

The n-body problem

Adding data input

boundary value problem
Boundary Value Problem

Ice water

Rod

Insulation

partitioning
Partitioning
  • One data item per grid point
  • Associate one primitive task with each grid point
  • Two-dimensional domain decomposition
communication
Communication
  • Identify communication pattern between primitive tasks:
    • Each interior primitive task has three incoming and three outgoing channels
execution time estimate
Execution Time Estimate
  •  – time to update element
  • n – number of elements
  • m – number of iterations
  • Sequential execution time: mn
  • p – number of processors
  •  – message latency
  • Parallel execution time m(n/p+2)
reduction
Reduction
  • Given associative operator 
  • a0  a1  a2  …  an-1
  • Examples
    • Add
    • Multiply
    • And, Or
    • Maximum, Minimum
finding global sum
Finding Global Sum

4

2

0

7

-3

5

-6

-3

8

1

2

3

-4

4

6

-1

partitioning1
Partitioning
  • Domain partitioning
  • Assume one task per particle
  • Task has particle’s position, velocity vector
  • Iteration
    • Get positions of all other particles
    • Compute new position, velocity
scatter in log p steps

1234

12

5678

56

34

78

Scatter in log p Steps

12345678

summary task channel model
Summary: Task/channel Model
  • Parallel computation
    • Set of tasks
    • Interactions through channels
  • Good designs
    • Maximize local computations
    • Minimize communications
    • Scale up
summary design steps
Summary: Design Steps
  • Partition computation
  • Agglomerate tasks
  • Map tasks to processors
  • Goals
    • Maximize processor utilization
    • Minimize inter-processor communication
summary fundamental algorithms
Summary: Fundamental Algorithms
  • Reduction
  • Gather and scatter
  • All-gather
ad