Me964 high performance computing for engineering applications
Download
1 / 60

ME964 High Performance Computing for Engineering Applications - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

ME964 High Performance Computing for Engineering Applications. Parallel Programming Patterns Oct. 16, 2008. Before we get started…. Last Time CUDA arithmetic support Performance implications Accuracy implications A software design exercise: prefix scan

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ME964 High Performance Computing for Engineering Applications' - wyatt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Me964 high performance computing for engineering applications l.jpg

ME964High Performance Computing for Engineering Applications

Parallel Programming Patterns

Oct. 16, 2008


Before we get started l.jpg
Before we get started…

  • Last Time

    • CUDA arithmetic support

      • Performance implications

      • Accuracy implications

    • A software design exercise: prefix scan

      • Preamble to the topic of software design patterns

  • Today

    • Parallel programming patterns

  • Other issues:

    • HW due on Saturday, 11:59 PM.

    • I will be in Milwaukee next Th, Mik will be the lecturer - tracing the evolution of the hardware support for the graphics pipeline

2


If you build it they will come l.jpg
“If you build it, they will come.”

“And so we built them. Multiprocessor workstations, massively parallel supercomputers, a cluster in every department … and they haven’t come.

Programmers haven’t come to program these wonderful machines.

The computer industry is ready to flood the market with hardware that will only run at full speed with parallel programs. But who will write these programs?”

- Mattson, Sanders, Massingill (2005)

3

HK-UIUC


Objective l.jpg
Objective

  • Outline a framework that draws on techniques & best practices used by experienced parallel programmers in order to enable you to

    • Think about the problem of parallel programming

    • Address functionality and performance issues in your parallel program design

    • Discuss your design decisions with others

    • Select an appropriate implementation platform that will that will facilitate application development

4


Fundamentals of parallel computing l.jpg
Fundamentals of Parallel Computing

  • Parallel computing requires that

    • The problem can be decomposed into sub-problems that can be safely solved at the same time

    • The programmer structures the code and data to solve these sub-problems concurrently

  • The goals of parallel computing are

    • To solve problems in less time, and/or

    • To solve bigger problems

The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency.

5

HK-UIUC


Recommended reading l.jpg
Recommended Reading

Mattson, Sanders, Massingill, Patterns for Parallel Programming, Addison Wesley, 2005, ISBN 0-321-22811-1.

  • Used for this presentation

  • A good overview of challenges, best practices, and common techniques in all aspects of parallel programming

  • Book is on reserve at Wendt Library

6

HK-UIUC


Parallel programming key steps l.jpg
Parallel Programming: Key Steps

  • Find the concurrency in the problem

  • Structure the algorithm so that concurrency can be exploited

  • Implement the algorithm in a suitable programming environment

  • Execute and tune the performance of the code on a parallel system

NOTE: The reality is that these have not been separated into levels of abstractions that can be dealt with independently.

7

HK-UIUC


What s next l.jpg
What’s Next?

Focus on thisfor a while

From “Patterns for Parallel Programming”

8


Challenges of parallel programming l.jpg
Challenges of Parallel Programming

  • Finding and exploiting concurrency often requires looking at the problem from a non-obvious angle

  • Dependences need to be identified and managed

    • The order of task execution may change the answers

      • Obvious: One step feeds result to the next steps

      • Subtle: numeric accuracy may be affected by ordering steps that are logically parallel with each other

9

HK-UIUC


Challenges of parallel programming cont l.jpg
Challenges of Parallel Programming (cont.)

  • Performance can be drastically reduced by many factors

    • Overhead of parallel processing

    • Load imbalance among processor elements

    • Inefficient data sharing patterns

    • Saturation of critical resources such as memory bandwidth

10

HK-UIUC


Shared memory vs message passing l.jpg
Shared Memory vs. Message Passing

  • We will focus on shared memory parallel programming

    • This is what CUDA is based on

    • Future massively parallel microprocessors are expected to support shared memory at the chip level

  • The programming considerations of message passing model is quite different!

11

HK-UIUC


Finding concurrency in problems l.jpg
Finding Concurrency in Problems

  • Identify a decomposition of the problem into sub-problems that can be solved simultaneously

    • A task decomposition that identifies tasks that can execute concurrently

    • A data decomposition that identifies data local to each task

    • A way of grouping tasks and ordering the groups to satisfy temporal constraints

    • An analysis on the data sharing patterns among the concurrent tasks

    • A design evaluation that assesses of the quality the choices made in all the steps

12

HK-UIUC


Finding concurrency the process l.jpg
Finding Concurrency – The Process

Dependence Analysis

Decomposition

Group Tasks

Task Decomposition

Design Evaluation

Order Tasks

Data Decomposition

Data Sharing

This is typically an iterative process.

Opportunities exist for dependence analysis to play earlier role in decomposition.

13

HK-UIUC


Find concr 1 decomp stage task decomposition l.jpg
Find Concr. 1: Decomp. Stage: Task Decomposition

  • Many large problems have natural independent tasks

    • The number of tasks used should be adjustable to the execution resources available.

    • Each task must include sufficient work in order to compensate for the overhead of managing their parallel execution.

    • Tasks should maximize reuse of sequential program code to minimize design/implementation effort.

“In an ideal world, the compiler would find tasks for the programmer. Unfortunately, this almost never happens.”

- Mattson, Sanders, Massingill

14

HK-UIUC


Example task decomposition square matrix multiplication l.jpg
Example: Task Decomposition Square Matrix Multiplication

N

  • P = M * N of WIDTH ● WIDTH

    • One natural task (sub-problem) produces one element of P

    • All tasks can execute in parallel

WIDTH

M

P

WIDTH

WIDTH

WIDTH

15

HK-UIUC


Task decomposition example molecular dynamics l.jpg
Task Decomposition Example: Molecular Dynamics

  • Simulation of motions of a large molecular system

  • A natural task is to perform calculations of individual atoms in parallel

  • For each atom, there are several tasks to complete

    • Covalent forces acting on the atom

    • Neighbors that must be considered in non-bonded forces

    • Non-bonded forces

    • Update position and velocity

    • Misc physical properties based on motions

  • Some of these can go in parallel for an atom

It is common that there are multiple ways to decompose any given problem.

16

HK-UIUC


Find concr 2 decomp stage data decomposition l.jpg
Find Concr. 2: Decomp. Stage: Data Decomposition

  • The most compute intensive parts of many large problems manipulate a large data structure

    • Similar operations are being applied to different parts of the data structure, in a mostly independent manner.

    • This is what CUDA is optimized for.

  • The data decomposition should lead to

    • Efficient data usage by each UE within the partition

    • Few dependencies between the UEs that work on different partitions

    • Adjustable partitions that can be varied according to the hardware characteristics

17

HK-UIUC


Data decomposition example square matrix multiplication l.jpg
Data Decomposition Example - Square Matrix Multiplication

N

  • P Row blocks (adjacent full rows)

    • Computing each partition requires access to entire N array

  • Square sub-blocks

    • Only bands of M and N are needed

WIDTH

  • Think how data is stored in the Matrix type

    • Computing things row-wise or column-wise might actually lead to different performance…

M

P

WIDTH

WIDTH

WIDTH

18


Find concr 3 dependence analysis tasks grouping l.jpg
Find Concr. 3: Dependence Analysis - Tasks Grouping

  • Sometimes natural tasks of a problem can be grouped together to improve efficiency

    • Reduced synchronization overhead – because when task grouping there is supposedly no need for synchronization

    • All tasks in the group efficiently share data loaded into a common on-chip, shared storage (Shard Memory)

19

HK-UIUC


Task grouping example square matrix multiplication l.jpg
Task Grouping Example - Square Matrix Multiplication

N

  • Tasks calculating a P sub-block

    • Extensive input data sharing, reduced memory bandwidth using Shared Memory

    • All synched in execution

WIDTH

M

P

WIDTH

WIDTH

WIDTH

20

HK-UIUC


Find concr 4 dependence analysis task ordering l.jpg
Find Concr. 4: Dependence Analysis - Task Ordering

  • Identify the data required by a group of tasks before they can execute

    • Find the task group that creates it (look upwind)

    • Determine a temporal order that satisfy all data constraints

21

HK-UIUC


Task ordering example molecular dynamics l.jpg
Task Ordering Example:Molecular Dynamics

Neighbor List

Covalent Forces

Non-bonded Force

Update atomic positions and velocities

Next Time Step

22

HK-UIUC


Find concr 5 dependence analysis data sharing l.jpg
Find Concr. 5: Dependence Analysis - Data Sharing

  • Data sharing can be a double-edged sword

    • An algorithm that calls for excessive data sharing can drastically reduce advantage of parallel execution

    • Localized sharing can improve memory bandwidth efficiency

    • Use the execution of task groups to interleave with (mask) global memory data accesses

  • Read-only sharing can usually be done at much higher efficiency than read-write sharing, which often requires a higher level of synchronization

23

HK-UIUC


Data sharing example matrix multiplication l.jpg
Data Sharing Example – Matrix Multiplication

  • Each task group will finish usage of each sub-block of N and M before moving on

    • N and M sub-blocks loaded into Shared Memory for use by all threads of a P sub-block

    • Amount of on-chip Shared Memory strictly limits the number of threads working on a P sub-block

  • Read-only shared data can be efficiently accessed as Constant or Texture data (on the GPU)

    • Frees up the shared memory for other uses

24


Data sharing example molecular dynamics l.jpg
Data Sharing Example – Molecular Dynamics

  • The atomic coordinates

    • Read-only access by the neighbor list, bonded force, and non-bonded force task groups

    • Read-write access for the position update task group

  • The force array

    • Read-only access by position update group

    • Accumulate access by bonded and non-bonded task groups

  • The neighbor list

    • Read-only access by non-bonded force task groups

    • Generated by the neighbor list task group

25

HK-UIUC


Find concr 6 design evaluation l.jpg
Find Concr. 6: Design Evaluation

  • Key questions to ask

    • How many UE can be used?

    • How are the data structures shared?

    • Is there a lot of data dependency that leads to excessive synchronization needs?

    • Is there enough work in each UE between synchronizations to make parallel execution worthwhile?

26

HK-UIUC


Key parallel programming steps l.jpg
Key Parallel Programming Steps

  • Find the concurrency in the problem

  • Structure the algorithm so that concurrency can be exploited

  • Implement the algorithm in a suitable programming environment

  • Execute and tune the performance of the code on a parallel system

27


What s next28 l.jpg
What’s Next?

Focus on thisfor a while

From “Patterns for Parallel Programming”

28


Algorithm definition l.jpg
Algorithm: Definition

  • A step by step procedure that is guaranteed to terminate, such that each step is precisely stated and can be carried out by a computer

    • Definiteness – the notion that each step is precisely stated

    • Effective computability – each step can be carried out by a computer

    • Finiteness – the procedure terminates

  • Multiple algorithms can be used to solve the same problem

    • Some require fewer steps and some exhibit more parallelism

29

HK-UIUC


Algorithm structure strategies l.jpg
Algorithm Structure - Strategies

Organize by Tasks

Organize by Data

Organize by Flow

Task Parallelism

Geometric

Decomposition

Pipeline

Divide and Conquer

Recursive Data

Decomposition

Event Condition

30

HK-UIUC


Choosing algorithm structure l.jpg
Choosing Algorithm Structure

Algorithm Design

Organize

by Task

Organize by

Data

Organize by

Data Flow

Linear

Recursive

Linear

Recursive

Regular

Irregular

Task

Parallelism

Divide and

Conquer

Geometric

Decomposition

Recursive

Data

Pipeline

Event Driven

31

HK-UIUC


Alg struct 1 organize by structure linear parallel tasks l.jpg
Alg. Struct. 1: Organize by StructureLinear Parallel Tasks

  • Common in the context of distributed memory models

  • These are the algorithms where the work needed can be represented as a collection of decoupled or loosely coupled tasks that can be executed in parallel

    • The tasks don’t even have to be identical

    • Load balancing between UE can be an issue (dynamic load balancing?)

    • If there is no data dependency involved there is no need for synchronization: the so called “embarrassingly parallel” scenario

  • Example: ray tracing, a good portion of the N-body problem, Monte Carlo type simulation

32


Alg struct 2 organize by structure recursive parallel tasks divide conquer l.jpg
Alg. Struct. 2: Organize by StructureRecursive Parallel Tasks (Divide & Conquer)

  • Valid when you can break a problem into a set of decoupled or loosely coupled smaller problems, which in turn…

    • This pattern is applicable when you can solve concurrently and with little synchronization the small problems (the leafs)

  • Chances are you need synchronization when you have this balanced tree type algorithm

    • Often required by the merging step (assembling the result from the “subresults”)

  • Examples: FFT, Linear Algebra problems (see FLAME project), the vector reduce operation

33


Computational ordering can have major effects on memory bank conflicts and control flow divergence l.jpg
Computational ordering can have major effects on memory bank conflicts and control flow divergence.

0

1

2

3

13

14

15

16

17

18

19

1

2

3

34


Alg struct 3 organize by data linear geometric decomposition l.jpg
Alg. Struct. 3: Organize by Data conflicts and control flow divergence.~ Linear (Geometric Decomposition) ~

  • This is the case where the UEs gather and work around a big chunk of data with no or little synchronization

    • Imagine a car that needs to be painted:

      • One robot paints the front left door, another one the rear left door, another one the hood, etc.

      • The car is the “data” and it’s parceled up with a collection of UEs acting on these chunks of data

  • NOTE: this is exactly the algorithmic approach enabled best by the GPU & CUDA

  • Examples: Matrix multiplication, matrix convolution

35


Tiled stenciled algorithms are important for g80 l.jpg

bx conflicts and control flow divergence.

0

1

2

tx

bsize-1

0

1

2

N

BLOCK_WIDTH

WIDTH

BLOCK_WIDTH

M

P

0

0

Psub

1

2

by

WIDTH

BLOCK_SIZE

1

ty

bsize-1

BLOCK_WIDTH

BLOCK_WIDTH

BLOCK_WIDTH

2

WIDTH

WIDTH

Tiled (Stenciled) Algorithms are Important for G80

36

HK-UIUC


Alg struct 4 organize by data recursive data scenario l.jpg
Alg. Struct. 4: Organize by Data conflicts and control flow divergence.~ Recursive Data Scenario ~

  • This scenario comes into play when the data that you operate on is structured in a recursive fashion

    • Balanced Trees

    • Graphs

    • Lists

  • Sometimes you don’t even have to have the data represented as such

    • See example with prefix scan

    • You choose to look at data as having a balanced tree structure

  • Many problems that seem inherently sequential can be approached in this framework

    • This is typically associated with an net increase in the amount of work you have to do

    • Work goes up from O(n) to O(n log(n)) (see for instance Hillis and Steele algorithm)

    • The key question is whether parallelism gained brings you ahead of the sequential alternative

37


Example the prefix scan reduction step l.jpg
Example: The prefix scan conflicts and control flow divergence.~ Reduction Step~

for k=0 to M-1

offset = 2k

for j=1 to 2M-k-1 in parallel do

x[j·2k+1-1] = x[j·2k+1-1] + x[j·2k+1-2k-1]

endfor

endfor

38


Example the array reduction the bad choice l.jpg
Example: The array reduction (the bad choice) conflicts and control flow divergence.

Array elements

0

1

2

3

4

5

6

7

8

9

10

11

1

0+1

2+3

4+5

6+7

8+9

10+11

2

0...3

4..7

8..11

3

0..7

8..15

iterations

39

HK-UIUC


Alg struct 5 organize by data flow regular pipeline l.jpg
Alg. Struct. 5: Organize by Data Flow conflicts and control flow divergence.~ Regular: Pipeline ~

  • Tasks are parallel, but there is a high degree of synchronization (coordination) between the UEs

    • The input of one UE is the output of an upwind UE (the “pipeline”)

    • The time elapsed between two pipeline ticks dictated by the slowest stage of the pipe (the bottleneck)

  • Commonly employed by sequential chips

  • For complex problems having a deep pipeline is attractive

    • At each pipeline tick you output one set of results

  • Examples:

    • CPU: instruction fetch, decode, data fetch, instruction execution, data write are pipelined.

    • The Cray vector architecture drawing heavily on this algorithmic structure when performing linear algebra operations

40


Alg struct 6 organize by data flow irregular event driven scenarios l.jpg
Alg. Struct. 6: Organize by Data Flow conflicts and control flow divergence.~ Irregular: Event Driven Scenarios ~

  • The furthest away from what the GPU can support today

  • Well supported by MIMD architectures

  • You coordinate UEs through asynchronous events

  • Critical aspects:

    • Load balancing – should be dynamic

    • Communication overhead, particularly in real-time application

  • Suitable for action-reaction type simulations

  • Examples: computer games, traffic control algorithms

  • 41


    Key parallel programming steps42 l.jpg
    Key Parallel Programming Steps conflicts and control flow divergence.

    • Find the concurrency in the problem

    • Structure the algorithm so that concurrency can be exploited

    • Implement the algorithm in a suitable programming environment

    • Execute and tune the performance of the code on a parallel system

    42


    What s next43 l.jpg
    What’s Next? conflicts and control flow divergence.

    Focus on thisfor a while

    43


    Objective44 l.jpg
    Objective conflicts and control flow divergence.

    • To understand SPMD and the role of the CUDA supporting structures

      • How it relates other common parallel programming structures.

      • What the supporting structures entail for applications development.

      • How the supporting structures could be used to implement parameterized applications for different levels of efficiency and complexity.

    44


    Parallel programming coding styles program and data models l.jpg
    Parallel Programming Coding Styles - Program and Data Models conflicts and control flow divergence.

    Program Models

    Data Models

    SPMD

    Shared Data

    Master/Worker

    Shared Queue

    Loop Parallelism

    Distributed Array

    Fork/Join

    These are not necessarily mutually exclusive.

    45


    Program models l.jpg
    Program Models conflicts and control flow divergence.

    • SPMD (Single Program, Multiple Data)

      • All PE’s (Processor Elements) execute the same program in parallel

      • Each PE has its own data

      • Each PE uses a unique ID to access its portion of data

      • Different PE can follow different paths through the same code

      • This is essentially the CUDA Grid model

      • SIMD is a special case - WARP

    • Master/Worker

    • Loop Parallelism

    • Fork/Join

    46


    Program models47 l.jpg
    Program Models conflicts and control flow divergence.

    • SPMD (Single Program, Multiple Data)

    • Master/Worker

      • A Master thread sets up a pool of worker threads and a bag of tasks

      • Workers execute concurrently, removing tasks until done

    • Loop Parallelism

      • Loop iterations execute in parallel

      • FORTRAN do-all (truly parallel), do-across (with dependence)

    • Fork/Join

      • Most general, generic way of creation of threads

    47


    Review algorithm structure l.jpg
    Review: Algorithm Structure conflicts and control flow divergence.

    Algorithm Design

    Organize

    by Task

    Organize by

    Data

    Organize by

    Data Flow

    Linear

    Recursive

    Linear

    Recursive

    Regular

    Irregular

    Task

    Parallelism

    Divide and

    Conquer

    Geometric

    Decomposition

    Recursive

    Data

    Pipeline

    Event Driven

    48


    Algorithm structures vs coding styles four smiles is best l.jpg
    Algorithm Structures vs. Coding Styles conflicts and control flow divergence.(four smiles is best)

    Source: Mattson, et al.

    49


    Supporting structures vs program models l.jpg
    Supporting Structures vs. Program Models conflicts and control flow divergence.

    50


    More on spmd l.jpg
    More on SPMD conflicts and control flow divergence.

    • Dominant coding style of scalable parallel computing

      • MPI code is mostly developed in SPMD style

      • Many OpenMP code is also in SPMD (next to loop parallelism)

      • Particularly suitable for algorithms based on data parallelism, geometric decomposition, divide and conquer.

    • Main advantage

      • Tasks and their interactions visible in one piece of source code, no need to correlated multiple sources

    SPMD is by far the most commonly used pattern for structuring parallel programs.

    51


    Typical spmd program phases l.jpg
    Typical SPMD Program Phases conflicts and control flow divergence.

    • Initialize

      • Establish localized data structure and communication channels

    • Obtain a unique identifier

      • Each thread acquires a unique identifier, typically range from 0 to N=1, where N is the number of threads.

      • Both OpenMP and CUDA have built-in support for this.

    • Distribute Data

      • Decompose global data into chunks and localize them, or

      • Sharing/replicating major data structure using thread ID to associate subset of the data to threads

    • Run the core computation

      • More details in next slide…

    • Finalize

      • Reconcile global data structure, prepare for the next major iteration

    52


    Core computation phase l.jpg
    Core Computation Phase conflicts and control flow divergence.

    • Thread IDs are used to differentiate behavior of threads

      • Use thread ID in loop index calculations to split loop iterations among threads

      • Use thread ID or conditions based on thread ID to branch to their specific actions

    Both can have very different performance results and code complexity depending on the way they are done.

    53


    A simple example l.jpg
    A Simple Example conflicts and control flow divergence.

    • Assume

      • The computation being parallelized has 1,000,000 iterations.

    • Sequential code:

    Num_steps = 1000000;

    for (i=0; i< num_steps, i++) {

    }

    54


    Spmd code version 1 l.jpg
    SPMD Code Version 1 conflicts and control flow divergence.

    • Assign a chunk of iterations to each thread

      • The last thread also finishes up the remainder iterations

    num_steps = 1000000;

    i_start = my_id * (num_steps/num_threads);

    i_end = i_start + (num_steps/num_threads);

    if (my_id == (num_threads-1)) i_end = num_steps;

    for (i = i_start; i < i_end; i++) {

    ….

    }

    Reconciliation of results across threads if necessary.

    55


    Problems with version 1 l.jpg
    Problems with Version 1 conflicts and control flow divergence.

    • The last thread executes more iterations than others

    • The number of extra iterations is up to the total number of threads – 1

      • This is not a big problem when the number of threads is small

      • When there are thousands of threads, this can create serious load imbalance problems.

    • Also, the extra if statement is a typical source of “branch divergence” in CUDA programs.

    56


    Spmd code version 2 l.jpg
    SPMD Code Version 2 conflicts and control flow divergence.

    • Assign one more iterations to some of the threads

    int rem = num_steps % num_threads;

    i_start = my_id * (num_steps/num_threads);

    i_end = i_start + (num_steps/num_threads);

    if (rem != 0) {

    if (my_id < rem) {

    i_start += my_id;

    i_end += (my_id +1);

    }

    else {

    i_start += rem;

    i_end += rem;

    }

    .

    Less load imbalance

    More branch divergence.

    57


    Spmd code version 3 l.jpg
    SPMD Code Version 3 conflicts and control flow divergence.

    • Use cyclic distribution of iteration

    num_steps = 1000000;

    for (i = my_id; i < num_steps; i+= num_threads)

    {

    ….

    }

    Less load imbalance

    Loop branch divergence in the last Warp

    Data padding further eliminates divergence.

    58


    Comparing the three versions l.jpg
    Comparing the Three Versions conflicts and control flow divergence.

    Version 1

    ID=0

    ID=1

    ID=2

    ID=3

    Version 2

    ID=0

    ID=1

    ID=2

    ID=3

    Version 3

    ID=0

    ID=1

    ID=2

    ID=3

    Padded version1 1 may be best

    for some data access patterns.

    59


    Data styles l.jpg
    Data Styles conflicts and control flow divergence.

    • Shared Data

      • All threads share a major data structure

      • This is what CUDA supports

    • Shared Queue

      • All threads see a “thread safe” queue that maintains ordering of data communication

    • Distributed Array

      • Decomposed and distributed among threads

      • Limited support in CUDA Shared Memory

    60


    ad