Parallel programming
This presentation is the property of its rightful owner.
Sponsored Links
1 / 49

Parallel Programming PowerPoint PPT Presentation


  • 52 Views
  • Uploaded on
  • Presentation posted in: General

Parallel Programming. Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/PPP2009. Motivation for Parallel Programming. Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases.

Download Presentation

Parallel Programming

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Parallel programming

Parallel Programming

Sathish S. Vadhiyar

Course Web Page:

http://www.serc.iisc.ernet.in/~vss/courses/PPP2009


Motivation for parallel programming

Motivation for Parallel Programming

  • Faster Execution time due to non-dependencies between regions of code

  • Presents a level of modularity

  • Resource constraints. Large databases.

  • Certain class of algorithms lend themselves

  • Aggregate bandwidth to memory/disk. Increase in data throughput.

    • Clock rate improvement in the past decade – 40%

    • Memory access time improvement in the past decade – 10%

  • Grand challenge problems (more later)


Challenges problems in parallel algorithms

Challenges / Problems in Parallel Algorithms

  • Building efficient algorithms.

  • Avoiding

    • Communication delay

    • Idling

    • Synchronization


Challenges

Challenges

P0

P1

Idle time

Computation

Communication

Synchronization


How do we evaluate a parallel program

How do we evaluate a parallel program?

  • Execution time, Tp

  • Speedup, S

    • S(p, n) = T(1, n) / T(p, n)

    • Usually, S(p, n) < p

    • Sometimes S(p, n) > p (superlinear speedup)

  • Efficiency, E

    • E(p, n) = S(p, n)/p

    • Usually, E(p, n) < 1

    • Sometimes, greater than 1

  • Scalability – Limitations in parallel computing, relation to n and p.


Speedups and efficiency

Speedups and efficiency

S

E

p

p

Ideal

Practical


Limitations on speedup amdahl s law

Limitations on speedup – Amdahl’s law

  • Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

  • Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.

  • Places a limit on the speedup due to parallelism.

  • Speedup = 1

    (fs + (fp/P))


Amdahl s law illustration

Amdahl’s law Illustration

S = 1 / (s + (1-s)/p)

Courtesy:

http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html

http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm


Amdahl s law analysis

Amdahl’s law analysis

  • For the same fraction, speedup numbers keep moving away from processor size.

  • Thus Amdahl’s law is a bit depressing for parallel programming.

  • In practice, the number of parallel portions of work has to be large enough to match a given number of processors.


Gustafson s law

Gustafson’s Law

  • Amdahl’s law – keep the parallel work fixed

  • Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time

  • For a particular number of processors, find the problem size for which parallel time is equal to the constant time

  • For that problem size, find the sequential time and the corresponding speedup

  • Thus speedup is scaled or scaled speedup


Metrics contd

Metrics (Contd..)

Table 5.1: Efficiency as a function of n and p.


Scalability

Scalability

  • Efficiency decreases with increasing P; increases with increasing N

  • How effectively the parallel algorithm can use an increasing number of processors

  • How the amount of computation performed must scale with P to keep E constant

  • This function of computation in terms of P is called isoefficiency function.

  • An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable


Scalability analysis finite difference algorithm with 1d decomposition

Scalability Analysis – Finite Difference algorithm with 1D decomposition

For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.

Can be satisfied with N = P, except for small P.

Hence isoefficiency function = O(P2) since computation is O(N2)


Scalability analysis finite difference algorithm with 2d decomposition

Scalability Analysis – Finite Difference algorithm with 2D decomposition

Can be satisfied with N = sqroot(P)

Hence isoefficiency function = O(P)

2D algorithm is more scalable than 1D


Parallel algorithm design

Parallel Algorithm Design


Steps

Steps

  • Decomposition – Splitting the problem into tasks or modules

  • Mapping – Assigning tasks to processor

  • Mapping’s contradictory objectives

    • To minimize idle times

    • To reduce communications


Mapping

Mapping

  • Static mapping

    • Mapping based on Data partitioning

      • Applicable to dense matrix computations

      • Block distribution

      • Block-cyclic distribution

    • Graph partitioning based mapping

      • Applicable for sparse matrix computations

    • Mapping based on task partitioning

0

0

0

1

1

1

2

2

2

0

1

2

0

1

2

0

1

2


Based on task partitioning

Based on Task Partitioning

  • Based on task dependency graph

  • In general the problem is NP complete

0

0

4

0

2

4

6

0

1

2

3

4

5

6

7


Mapping1

Mapping

  • Dynamic Mapping

    • A process/global memory can hold a set of tasks

    • Distribute some tasks to all processes

    • Once a process completes its tasks, it asks the coordinator process for more tasks

    • Referred to as self-scheduling, work-stealing


Interaction overheads

Interaction Overheads

  • In spite of the best efforts in mapping, there can be interaction overheads

  • Due to frequent communications, exchanging large volume of data, interaction with the farthest processors etc.

  • Some techniques can be used to minimize interactions


Parallel algorithm design containing interaction overheads

Parallel Algorithm Design - Containing Interaction Overheads

  • Maximizing data locality

    • Minimizing volume of data exchange

      • Using higher dimensional mapping

      • Not communicating intermediate results

    • Minimizing frequency of interactions

  • Minimizing contention and hot spots

    • Do not use the same communication pattern with the other processes in all the processes


Parallel algorithm design containing interaction overheads1

Parallel Algorithm Design - Containing Interaction Overheads

  • Overlapping computations with interactions

    • Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2)

    • Initiate communication for type 1; During communication, perform type 2

  • Overlapping interactions with interactions

  • Replicating data or computations

    • Balancing the extra computation or storage cost with the gain due to less communication


Parallel algorithm classification types models

Parallel Algorithm Classification

– Types

- Models


Parallel algorithm types

Parallel Algorithm Types

  • Divide and conquer

  • Data partitioning / decomposition

  • Pipelining


Divide and conquer

Divide-and-Conquer

  • Recursive in structure

    • Divide the problem into sub-problems that are similar to the original, smaller in size

    • Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner

    • Combine the solutions to create a solution to the original problem


Divide and conquer example merge sort

Divide-and-ConquerExample: Merge Sort

  • Problem: Sort a sequence of n elements

  • Divide the sequence into two subsequences of n/2 elements each

  • Conquer: Sort the two subsequences recursively using merge sort

  • Combine: Merge the two sorted subsequences to produce sorted answer


Partitioning

Partitioning

  • Breaking up the given problem into p independent subproblems of almost equal sizes

  • Solving the p subproblems concurrently

  • Mostly splitting the input or output into non-overlapping pieces

  • Example: Matrix multiplication

  • Either the inputs (A or B) or output (C) can be partitioned.


Pipelining

Pipelining

Occurs with image processing applications where a number of images undergoes a sequence of transformations.


Parallel algorithm models

Parallel Algorithm Models

  • Data parallel model

    • Processes perform identical tasks on different data

  • Task parallel model

    • Different processes perform different tasks on same or different data – based on task dependency graph

  • Work pool model

    • Any task can be performed by any process. Tasks are added to a work pool dynamically

  • Pipeline model

    • A stream of data passes through a chain of processes – stream parallelism


Parallel program classification models structure paradigms

Parallel Program Classification

- Models

- Structure

- Paradigms


Parallel program models

Parallel Program Models

  • Single Program Multiple Data (SPMD)

  • Multiple Program Multiple Data (MPMD)

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Parallel program structure types

Parallel Program Structure Types

  • Master-Worker / parameter sweep / task farming

  • Embarassingly/pleasingly parallel

  • Pipleline / systolic / wavefront

  • Tightly-coupled

  • Workflow

P0

P1

P2

P3

P4

P0

P1

P2

P3

P4


Programming paradigms

Programming Paradigms

  • Shared memory model – Threads, OpenMP

  • Message passing model – MPI

  • Data parallel model – HPF

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Parallel programming

Parallel Architectures Classification

- Classification

- Cache coherence in shared memory platforms

- Interconnection networks


Classification of architectures flynn s classification

Classification of Architectures – Flynn’s classification

  • Single Instruction Single Data (SISD): Serial Computers

  • Single Instruction Multiple Data (SIMD)

    - Vector processors and processor arrays

    - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Classification of architectures flynn s classification1

Classification of Architectures – Flynn’s classification

  • Multiple Instruction Single Data (MISD): Not popular

  • Multiple Instruction Multiple Data (MIMD)

    - Most popular

    - IBM SP and most other supercomputers,

    clusters, computational Grids etc.

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Classification of architectures based on memory

Classification of Architectures – Based on Memory

  • Shared memory

  • 2 types – UMA and NUMA

NUMA

Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q

UMA

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Classification of architectures based on memory1

Classification of Architectures – Based on Memory

  • Distributed memory

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  • Recently multi-cores

  • Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids


Parallel programming

Cache Coherence

- for details, read 2.4.6 of book

Interconnection networks

- for details, read 2.4.2-2.4.5 of book


Cache coherence in smps

Cache Coherence in SMPs

  • All processes read variable ‘x’ residing in cache line ‘a’

  • Each process updates ‘x’ at different points of time

CPU0

CPU1

CPU2

CPU3

a

a

a

a

cache0

cache1

cache2

cache3

a

  • Challenge: To maintain consistent view of the data

  • Protocols:

  • Write update

  • Write invalidate

Main

Memory


Caches coherence protocols and implementations

Caches Coherence Protocols and Implementations

  • Write update – propagate cache line to other processors on every write to a processor

  • Write invalidate – each processor get the updated cache line whenever it reads stale data

  • Which is better??


Caches false sharing

Caches –False sharing

  • Different processors update different parts of the same cache line

  • Leads to ping-pong of cache lines between processors

  • Situation better in update protocols than invalidate protocols. Why?

CPU1

CPU0

A0, A2, A4…

A1, A3, A5…

cache0

cache1

A0 – A8

A9 – A15

  • Modify the algorithm to change the stride

Main

Memory


Caches coherence using invalidate protocols

Caches Coherence using invalidate protocols

  • 3 states associated with data items

    • Shared – a variable shared by 2 caches

    • Invalid – another processor (say P0) has updated the data item

    • Dirty – state of the data item in P0

  • Implementations

    • Snoopy

      • for bus based architectures

      • Memory operations are propagated over the bus and snooped

      • Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors

    • Directory-based

      • A central directory maintains states of cache blocks, associated processors

      • Implemented with presence bits


Interconnection networks

Interconnection Networks

  • An interconnection network defined by switches, links and interfaces

    • Switches – provide mapping between input and output ports, buffering, routing etc.

    • Interfaces – connects nodes with network

  • Network topologies

    • Static – point-to-point communication links among processing nodes

    • Dynamic – Communication links are formed dynamically by switches


Interconnection networks1

Interconnection Networks

  • Static

    • Bus – SGI challenge

    • Completely connected

    • Star

    • Linear array, Ring (1-D torus)

    • Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus

    • k-d mesh: d dimensions with k nodes in each dimension

    • Hypercubes – 2-logp mesh – e.g. many MIMD machines

    • Trees – our campus network

  • Dynamic – Communication links are formed dynamically by switches

    • Crossbar – Cray X series – non-blocking network

    • Multistage – SP2 – blocking network.

  • For more details, and evaluation of topologies, refer to book


Evaluating interconnection topologies

Evaluating Interconnection topologies

  • Diameter – maximum distance between any two processing nodes

    • Full-connected –

    • Star –

    • Ring –

    • Hypercube -

  • Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks

    • Linear-array –

    • Ring –

    • 2-d mesh –

    • 2-d mesh with wraparound –

    • D-dimension hypercubes –

1

2

p/2

logP

1

2

2

4

d


Evaluating interconnection topologies1

Evaluating Interconnection topologies

  • bisection width – minimum number of links to be removed from network to partition it into 2 equal halves

    • Ring –

    • P-node 2-D mesh -

    • Tree –

    • Star –

    • Completely connected –

    • Hypercubes -

2

Root(P)

1

1

P2/4

P/2


Evaluating interconnection topologies2

Evaluating Interconnection topologies

  • channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes

  • channel rate – performance of a single physical wire

  • channel bandwidth – channel rate times channel width

  • bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth


Parallel programming

  • END


  • Login