Parallel programming
Download
1 / 49

Parallel Programming - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Parallel Programming. Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/PPP2009. Motivation for Parallel Programming. Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Parallel Programming' - morgan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Parallel programming

Parallel Programming

Sathish S. Vadhiyar

Course Web Page:

http://www.serc.iisc.ernet.in/~vss/courses/PPP2009


Motivation for parallel programming
Motivation for Parallel Programming

  • Faster Execution time due to non-dependencies between regions of code

  • Presents a level of modularity

  • Resource constraints. Large databases.

  • Certain class of algorithms lend themselves

  • Aggregate bandwidth to memory/disk. Increase in data throughput.

    • Clock rate improvement in the past decade – 40%

    • Memory access time improvement in the past decade – 10%

  • Grand challenge problems (more later)


Challenges problems in parallel algorithms
Challenges / Problems in Parallel Algorithms

  • Building efficient algorithms.

  • Avoiding

    • Communication delay

    • Idling

    • Synchronization


Challenges
Challenges

P0

P1

Idle time

Computation

Communication

Synchronization


How do we evaluate a parallel program
How do we evaluate a parallel program?

  • Execution time, Tp

  • Speedup, S

    • S(p, n) = T(1, n) / T(p, n)

    • Usually, S(p, n) < p

    • Sometimes S(p, n) > p (superlinear speedup)

  • Efficiency, E

    • E(p, n) = S(p, n)/p

    • Usually, E(p, n) < 1

    • Sometimes, greater than 1

  • Scalability – Limitations in parallel computing, relation to n and p.


Speedups and efficiency
Speedups and efficiency

S

E

p

p

Ideal

Practical


Limitations on speedup amdahl s law
Limitations on speedup – Amdahl’s law

  • Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

  • Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.

  • Places a limit on the speedup due to parallelism.

  • Speedup = 1

    (fs + (fp/P))


Amdahl s law illustration
Amdahl’s law Illustration

S = 1 / (s + (1-s)/p)

Courtesy:

http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html

http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm


Amdahl s law analysis
Amdahl’s law analysis

  • For the same fraction, speedup numbers keep moving away from processor size.

  • Thus Amdahl’s law is a bit depressing for parallel programming.

  • In practice, the number of parallel portions of work has to be large enough to match a given number of processors.


Gustafson s law
Gustafson’s Law

  • Amdahl’s law – keep the parallel work fixed

  • Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time

  • For a particular number of processors, find the problem size for which parallel time is equal to the constant time

  • For that problem size, find the sequential time and the corresponding speedup

  • Thus speedup is scaled or scaled speedup


Metrics contd
Metrics (Contd..)

Table 5.1: Efficiency as a function of n and p.


Scalability
Scalability

  • Efficiency decreases with increasing P; increases with increasing N

  • How effectively the parallel algorithm can use an increasing number of processors

  • How the amount of computation performed must scale with P to keep E constant

  • This function of computation in terms of P is called isoefficiency function.

  • An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable


Scalability analysis finite difference algorithm with 1d decomposition
Scalability Analysis – Finite Difference algorithm with 1D decomposition

For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.

Can be satisfied with N = P, except for small P.

Hence isoefficiency function = O(P2) since computation is O(N2)


Scalability analysis finite difference algorithm with 2d decomposition
Scalability Analysis – Finite Difference algorithm with 2D decomposition

Can be satisfied with N = sqroot(P)

Hence isoefficiency function = O(P)

2D algorithm is more scalable than 1D



Steps
Steps decomposition

  • Decomposition – Splitting the problem into tasks or modules

  • Mapping – Assigning tasks to processor

  • Mapping’s contradictory objectives

    • To minimize idle times

    • To reduce communications


Mapping
Mapping decomposition

  • Static mapping

    • Mapping based on Data partitioning

      • Applicable to dense matrix computations

      • Block distribution

      • Block-cyclic distribution

    • Graph partitioning based mapping

      • Applicable for sparse matrix computations

    • Mapping based on task partitioning

0

0

0

1

1

1

2

2

2

0

1

2

0

1

2

0

1

2


Based on task partitioning
Based on Task Partitioning decomposition

  • Based on task dependency graph

  • In general the problem is NP complete

0

0

4

0

2

4

6

0

1

2

3

4

5

6

7


Mapping1
Mapping decomposition

  • Dynamic Mapping

    • A process/global memory can hold a set of tasks

    • Distribute some tasks to all processes

    • Once a process completes its tasks, it asks the coordinator process for more tasks

    • Referred to as self-scheduling, work-stealing


Interaction overheads
Interaction Overheads decomposition

  • In spite of the best efforts in mapping, there can be interaction overheads

  • Due to frequent communications, exchanging large volume of data, interaction with the farthest processors etc.

  • Some techniques can be used to minimize interactions


Parallel algorithm design containing interaction overheads
Parallel Algorithm Design - Containing Interaction Overheads decomposition

  • Maximizing data locality

    • Minimizing volume of data exchange

      • Using higher dimensional mapping

      • Not communicating intermediate results

    • Minimizing frequency of interactions

  • Minimizing contention and hot spots

    • Do not use the same communication pattern with the other processes in all the processes


Parallel algorithm design containing interaction overheads1
Parallel Algorithm Design - Containing Interaction Overheads decomposition

  • Overlapping computations with interactions

    • Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2)

    • Initiate communication for type 1; During communication, perform type 2

  • Overlapping interactions with interactions

  • Replicating data or computations

    • Balancing the extra computation or storage cost with the gain due to less communication


Parallel algorithm classification types models

Parallel Algorithm Classification decomposition

– Types

- Models


Parallel algorithm types
Parallel Algorithm Types decomposition

  • Divide and conquer

  • Data partitioning / decomposition

  • Pipelining


Divide and conquer
Divide-and-Conquer decomposition

  • Recursive in structure

    • Divide the problem into sub-problems that are similar to the original, smaller in size

    • Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner

    • Combine the solutions to create a solution to the original problem


Divide and conquer example merge sort
Divide-and-Conquer decompositionExample: Merge Sort

  • Problem: Sort a sequence of n elements

  • Divide the sequence into two subsequences of n/2 elements each

  • Conquer: Sort the two subsequences recursively using merge sort

  • Combine: Merge the two sorted subsequences to produce sorted answer


Partitioning
Partitioning decomposition

  • Breaking up the given problem into p independent subproblems of almost equal sizes

  • Solving the p subproblems concurrently

  • Mostly splitting the input or output into non-overlapping pieces

  • Example: Matrix multiplication

  • Either the inputs (A or B) or output (C) can be partitioned.


Pipelining
Pipelining decomposition

Occurs with image processing applications where a number of images undergoes a sequence of transformations.


Parallel algorithm models
Parallel Algorithm Models decomposition

  • Data parallel model

    • Processes perform identical tasks on different data

  • Task parallel model

    • Different processes perform different tasks on same or different data – based on task dependency graph

  • Work pool model

    • Any task can be performed by any process. Tasks are added to a work pool dynamically

  • Pipeline model

    • A stream of data passes through a chain of processes – stream parallelism


Parallel program classification models structure paradigms

Parallel Program Classification decomposition

- Models

- Structure

- Paradigms


Parallel program models
Parallel Program Models decomposition

  • Single Program Multiple Data (SPMD)

  • Multiple Program Multiple Data (MPMD)

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Parallel program structure types
Parallel Program Structure Types decomposition

  • Master-Worker / parameter sweep / task farming

  • Embarassingly/pleasingly parallel

  • Pipleline / systolic / wavefront

  • Tightly-coupled

  • Workflow

P0

P1

P2

P3

P4

P0

P1

P2

P3

P4


Programming paradigms
Programming Paradigms decomposition

  • Shared memory model – Threads, OpenMP

  • Message passing model – MPI

  • Data parallel model – HPF

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Parallel Architectures Classification decomposition

- Classification

- Cache coherence in shared memory platforms

- Interconnection networks


Classification of architectures flynn s classification
Classification of Architectures – Flynn’s classification decomposition

  • Single Instruction Single Data (SISD): Serial Computers

  • Single Instruction Multiple Data (SIMD)

    - Vector processors and processor arrays

    - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Classification of architectures flynn s classification1
Classification of Architectures – Flynn’s classification decomposition

  • Multiple Instruction Single Data (MISD): Not popular

  • Multiple Instruction Multiple Data (MIMD)

    - Most popular

    - IBM SP and most other supercomputers,

    clusters, computational Grids etc.

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Classification of architectures based on memory
Classification of Architectures – Based on Memory decomposition

  • Shared memory

  • 2 types – UMA and NUMA

NUMA

Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q

UMA

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Classification of architectures based on memory1
Classification of Architectures – Based on Memory decomposition

  • Distributed memory

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  • Recently multi-cores

  • Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids


Cache Coherence decomposition

- for details, read 2.4.6 of book

Interconnection networks

- for details, read 2.4.2-2.4.5 of book


Cache coherence in smps
Cache Coherence in SMPs decomposition

  • All processes read variable ‘x’ residing in cache line ‘a’

  • Each process updates ‘x’ at different points of time

CPU0

CPU1

CPU2

CPU3

a

a

a

a

cache0

cache1

cache2

cache3

a

  • Challenge: To maintain consistent view of the data

  • Protocols:

  • Write update

  • Write invalidate

Main

Memory


Caches coherence protocols and implementations
Caches Coherence Protocols and Implementations decomposition

  • Write update – propagate cache line to other processors on every write to a processor

  • Write invalidate – each processor get the updated cache line whenever it reads stale data

  • Which is better??


Caches false sharing
Caches –False sharing decomposition

  • Different processors update different parts of the same cache line

  • Leads to ping-pong of cache lines between processors

  • Situation better in update protocols than invalidate protocols. Why?

CPU1

CPU0

A0, A2, A4…

A1, A3, A5…

cache0

cache1

A0 – A8

A9 – A15

  • Modify the algorithm to change the stride

Main

Memory


Caches coherence using invalidate protocols
Caches Coherence using invalidate protocols decomposition

  • 3 states associated with data items

    • Shared – a variable shared by 2 caches

    • Invalid – another processor (say P0) has updated the data item

    • Dirty – state of the data item in P0

  • Implementations

    • Snoopy

      • for bus based architectures

      • Memory operations are propagated over the bus and snooped

      • Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors

    • Directory-based

      • A central directory maintains states of cache blocks, associated processors

      • Implemented with presence bits


Interconnection networks
Interconnection Networks decomposition

  • An interconnection network defined by switches, links and interfaces

    • Switches – provide mapping between input and output ports, buffering, routing etc.

    • Interfaces – connects nodes with network

  • Network topologies

    • Static – point-to-point communication links among processing nodes

    • Dynamic – Communication links are formed dynamically by switches


Interconnection networks1
Interconnection Networks decomposition

  • Static

    • Bus – SGI challenge

    • Completely connected

    • Star

    • Linear array, Ring (1-D torus)

    • Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus

    • k-d mesh: d dimensions with k nodes in each dimension

    • Hypercubes – 2-logp mesh – e.g. many MIMD machines

    • Trees – our campus network

  • Dynamic – Communication links are formed dynamically by switches

    • Crossbar – Cray X series – non-blocking network

    • Multistage – SP2 – blocking network.

  • For more details, and evaluation of topologies, refer to book


Evaluating interconnection topologies
Evaluating Interconnection topologies decomposition

  • Diameter – maximum distance between any two processing nodes

    • Full-connected –

    • Star –

    • Ring –

    • Hypercube -

  • Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks

    • Linear-array –

    • Ring –

    • 2-d mesh –

    • 2-d mesh with wraparound –

    • D-dimension hypercubes –

1

2

p/2

logP

1

2

2

4

d


Evaluating interconnection topologies1
Evaluating Interconnection topologies decomposition

  • bisection width – minimum number of links to be removed from network to partition it into 2 equal halves

    • Ring –

    • P-node 2-D mesh -

    • Tree –

    • Star –

    • Completely connected –

    • Hypercubes -

2

Root(P)

1

1

P2/4

P/2


Evaluating interconnection topologies2
Evaluating Interconnection topologies decomposition

  • channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes

  • channel rate – performance of a single physical wire

  • channel bandwidth – channel rate times channel width

  • bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth


  • END decomposition


ad