Parallel programming
Sponsored Links
This presentation is the property of its rightful owner.
1 / 49

Parallel Programming PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on
  • Presentation posted in: General

Parallel Programming. Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/PPP2009. Motivation for Parallel Programming. Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases.

Download Presentation

Parallel Programming

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Parallel Programming

Sathish S. Vadhiyar

Course Web Page:

http://www.serc.iisc.ernet.in/~vss/courses/PPP2009


Motivation for Parallel Programming

  • Faster Execution time due to non-dependencies between regions of code

  • Presents a level of modularity

  • Resource constraints. Large databases.

  • Certain class of algorithms lend themselves

  • Aggregate bandwidth to memory/disk. Increase in data throughput.

    • Clock rate improvement in the past decade – 40%

    • Memory access time improvement in the past decade – 10%

  • Grand challenge problems (more later)


Challenges / Problems in Parallel Algorithms

  • Building efficient algorithms.

  • Avoiding

    • Communication delay

    • Idling

    • Synchronization


Challenges

P0

P1

Idle time

Computation

Communication

Synchronization


How do we evaluate a parallel program?

  • Execution time, Tp

  • Speedup, S

    • S(p, n) = T(1, n) / T(p, n)

    • Usually, S(p, n) < p

    • Sometimes S(p, n) > p (superlinear speedup)

  • Efficiency, E

    • E(p, n) = S(p, n)/p

    • Usually, E(p, n) < 1

    • Sometimes, greater than 1

  • Scalability – Limitations in parallel computing, relation to n and p.


Speedups and efficiency

S

E

p

p

Ideal

Practical


Limitations on speedup – Amdahl’s law

  • Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

  • Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.

  • Places a limit on the speedup due to parallelism.

  • Speedup = 1

    (fs + (fp/P))


Amdahl’s law Illustration

S = 1 / (s + (1-s)/p)

Courtesy:

http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html

http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm


Amdahl’s law analysis

  • For the same fraction, speedup numbers keep moving away from processor size.

  • Thus Amdahl’s law is a bit depressing for parallel programming.

  • In practice, the number of parallel portions of work has to be large enough to match a given number of processors.


Gustafson’s Law

  • Amdahl’s law – keep the parallel work fixed

  • Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time

  • For a particular number of processors, find the problem size for which parallel time is equal to the constant time

  • For that problem size, find the sequential time and the corresponding speedup

  • Thus speedup is scaled or scaled speedup


Metrics (Contd..)

Table 5.1: Efficiency as a function of n and p.


Scalability

  • Efficiency decreases with increasing P; increases with increasing N

  • How effectively the parallel algorithm can use an increasing number of processors

  • How the amount of computation performed must scale with P to keep E constant

  • This function of computation in terms of P is called isoefficiency function.

  • An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable


Scalability Analysis – Finite Difference algorithm with 1D decomposition

For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.

Can be satisfied with N = P, except for small P.

Hence isoefficiency function = O(P2) since computation is O(N2)


Scalability Analysis – Finite Difference algorithm with 2D decomposition

Can be satisfied with N = sqroot(P)

Hence isoefficiency function = O(P)

2D algorithm is more scalable than 1D


Parallel Algorithm Design


Steps

  • Decomposition – Splitting the problem into tasks or modules

  • Mapping – Assigning tasks to processor

  • Mapping’s contradictory objectives

    • To minimize idle times

    • To reduce communications


Mapping

  • Static mapping

    • Mapping based on Data partitioning

      • Applicable to dense matrix computations

      • Block distribution

      • Block-cyclic distribution

    • Graph partitioning based mapping

      • Applicable for sparse matrix computations

    • Mapping based on task partitioning

0

0

0

1

1

1

2

2

2

0

1

2

0

1

2

0

1

2


Based on Task Partitioning

  • Based on task dependency graph

  • In general the problem is NP complete

0

0

4

0

2

4

6

0

1

2

3

4

5

6

7


Mapping

  • Dynamic Mapping

    • A process/global memory can hold a set of tasks

    • Distribute some tasks to all processes

    • Once a process completes its tasks, it asks the coordinator process for more tasks

    • Referred to as self-scheduling, work-stealing


Interaction Overheads

  • In spite of the best efforts in mapping, there can be interaction overheads

  • Due to frequent communications, exchanging large volume of data, interaction with the farthest processors etc.

  • Some techniques can be used to minimize interactions


Parallel Algorithm Design - Containing Interaction Overheads

  • Maximizing data locality

    • Minimizing volume of data exchange

      • Using higher dimensional mapping

      • Not communicating intermediate results

    • Minimizing frequency of interactions

  • Minimizing contention and hot spots

    • Do not use the same communication pattern with the other processes in all the processes


Parallel Algorithm Design - Containing Interaction Overheads

  • Overlapping computations with interactions

    • Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2)

    • Initiate communication for type 1; During communication, perform type 2

  • Overlapping interactions with interactions

  • Replicating data or computations

    • Balancing the extra computation or storage cost with the gain due to less communication


Parallel Algorithm Classification

– Types

- Models


Parallel Algorithm Types

  • Divide and conquer

  • Data partitioning / decomposition

  • Pipelining


Divide-and-Conquer

  • Recursive in structure

    • Divide the problem into sub-problems that are similar to the original, smaller in size

    • Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner

    • Combine the solutions to create a solution to the original problem


Divide-and-ConquerExample: Merge Sort

  • Problem: Sort a sequence of n elements

  • Divide the sequence into two subsequences of n/2 elements each

  • Conquer: Sort the two subsequences recursively using merge sort

  • Combine: Merge the two sorted subsequences to produce sorted answer


Partitioning

  • Breaking up the given problem into p independent subproblems of almost equal sizes

  • Solving the p subproblems concurrently

  • Mostly splitting the input or output into non-overlapping pieces

  • Example: Matrix multiplication

  • Either the inputs (A or B) or output (C) can be partitioned.


Pipelining

Occurs with image processing applications where a number of images undergoes a sequence of transformations.


Parallel Algorithm Models

  • Data parallel model

    • Processes perform identical tasks on different data

  • Task parallel model

    • Different processes perform different tasks on same or different data – based on task dependency graph

  • Work pool model

    • Any task can be performed by any process. Tasks are added to a work pool dynamically

  • Pipeline model

    • A stream of data passes through a chain of processes – stream parallelism


Parallel Program Classification

- Models

- Structure

- Paradigms


Parallel Program Models

  • Single Program Multiple Data (SPMD)

  • Multiple Program Multiple Data (MPMD)

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Parallel Program Structure Types

  • Master-Worker / parameter sweep / task farming

  • Embarassingly/pleasingly parallel

  • Pipleline / systolic / wavefront

  • Tightly-coupled

  • Workflow

P0

P1

P2

P3

P4

P0

P1

P2

P3

P4


Programming Paradigms

  • Shared memory model – Threads, OpenMP

  • Message passing model – MPI

  • Data parallel model – HPF

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Parallel Architectures Classification

- Classification

- Cache coherence in shared memory platforms

- Interconnection networks


Classification of Architectures – Flynn’s classification

  • Single Instruction Single Data (SISD): Serial Computers

  • Single Instruction Multiple Data (SIMD)

    - Vector processors and processor arrays

    - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Classification of Architectures – Flynn’s classification

  • Multiple Instruction Single Data (MISD): Not popular

  • Multiple Instruction Multiple Data (MIMD)

    - Most popular

    - IBM SP and most other supercomputers,

    clusters, computational Grids etc.

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Classification of Architectures – Based on Memory

  • Shared memory

  • 2 types – UMA and NUMA

NUMA

Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q

UMA

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Classification of Architectures – Based on Memory

  • Distributed memory

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  • Recently multi-cores

  • Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids


Cache Coherence

- for details, read 2.4.6 of book

Interconnection networks

- for details, read 2.4.2-2.4.5 of book


Cache Coherence in SMPs

  • All processes read variable ‘x’ residing in cache line ‘a’

  • Each process updates ‘x’ at different points of time

CPU0

CPU1

CPU2

CPU3

a

a

a

a

cache0

cache1

cache2

cache3

a

  • Challenge: To maintain consistent view of the data

  • Protocols:

  • Write update

  • Write invalidate

Main

Memory


Caches Coherence Protocols and Implementations

  • Write update – propagate cache line to other processors on every write to a processor

  • Write invalidate – each processor get the updated cache line whenever it reads stale data

  • Which is better??


Caches –False sharing

  • Different processors update different parts of the same cache line

  • Leads to ping-pong of cache lines between processors

  • Situation better in update protocols than invalidate protocols. Why?

CPU1

CPU0

A0, A2, A4…

A1, A3, A5…

cache0

cache1

A0 – A8

A9 – A15

  • Modify the algorithm to change the stride

Main

Memory


Caches Coherence using invalidate protocols

  • 3 states associated with data items

    • Shared – a variable shared by 2 caches

    • Invalid – another processor (say P0) has updated the data item

    • Dirty – state of the data item in P0

  • Implementations

    • Snoopy

      • for bus based architectures

      • Memory operations are propagated over the bus and snooped

      • Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors

    • Directory-based

      • A central directory maintains states of cache blocks, associated processors

      • Implemented with presence bits


Interconnection Networks

  • An interconnection network defined by switches, links and interfaces

    • Switches – provide mapping between input and output ports, buffering, routing etc.

    • Interfaces – connects nodes with network

  • Network topologies

    • Static – point-to-point communication links among processing nodes

    • Dynamic – Communication links are formed dynamically by switches


Interconnection Networks

  • Static

    • Bus – SGI challenge

    • Completely connected

    • Star

    • Linear array, Ring (1-D torus)

    • Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus

    • k-d mesh: d dimensions with k nodes in each dimension

    • Hypercubes – 2-logp mesh – e.g. many MIMD machines

    • Trees – our campus network

  • Dynamic – Communication links are formed dynamically by switches

    • Crossbar – Cray X series – non-blocking network

    • Multistage – SP2 – blocking network.

  • For more details, and evaluation of topologies, refer to book


Evaluating Interconnection topologies

  • Diameter – maximum distance between any two processing nodes

    • Full-connected –

    • Star –

    • Ring –

    • Hypercube -

  • Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks

    • Linear-array –

    • Ring –

    • 2-d mesh –

    • 2-d mesh with wraparound –

    • D-dimension hypercubes –

1

2

p/2

logP

1

2

2

4

d


Evaluating Interconnection topologies

  • bisection width – minimum number of links to be removed from network to partition it into 2 equal halves

    • Ring –

    • P-node 2-D mesh -

    • Tree –

    • Star –

    • Completely connected –

    • Hypercubes -

2

Root(P)

1

1

P2/4

P/2


Evaluating Interconnection topologies

  • channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes

  • channel rate – performance of a single physical wire

  • channel bandwidth – channel rate times channel width

  • bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth


  • END


  • Login