parallel programming
Download
Skip this Video
Download Presentation
Parallel Programming

Loading in 2 Seconds...

play fullscreen
1 / 49

Parallel Programming - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

Parallel Programming. Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/PPP2009. Motivation for Parallel Programming. Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Parallel Programming' - morgan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
parallel programming

Parallel Programming

Sathish S. Vadhiyar

Course Web Page:

http://www.serc.iisc.ernet.in/~vss/courses/PPP2009

motivation for parallel programming
Motivation for Parallel Programming
  • Faster Execution time due to non-dependencies between regions of code
  • Presents a level of modularity
  • Resource constraints. Large databases.
  • Certain class of algorithms lend themselves
  • Aggregate bandwidth to memory/disk. Increase in data throughput.
    • Clock rate improvement in the past decade – 40%
    • Memory access time improvement in the past decade – 10%
  • Grand challenge problems (more later)
challenges problems in parallel algorithms
Challenges / Problems in Parallel Algorithms
  • Building efficient algorithms.
  • Avoiding
    • Communication delay
    • Idling
    • Synchronization
challenges
Challenges

P0

P1

Idle time

Computation

Communication

Synchronization

how do we evaluate a parallel program
How do we evaluate a parallel program?
  • Execution time, Tp
  • Speedup, S
    • S(p, n) = T(1, n) / T(p, n)
    • Usually, S(p, n) < p
    • Sometimes S(p, n) > p (superlinear speedup)
  • Efficiency, E
    • E(p, n) = S(p, n)/p
    • Usually, E(p, n) < 1
    • Sometimes, greater than 1
  • Scalability – Limitations in parallel computing, relation to n and p.
speedups and efficiency
Speedups and efficiency

S

E

p

p

Ideal

Practical

limitations on speedup amdahl s law
Limitations on speedup – Amdahl’s law
  • Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
  • Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.
  • Places a limit on the speedup due to parallelism.
  • Speedup = 1

(fs + (fp/P))

amdahl s law illustration
Amdahl’s law Illustration

S = 1 / (s + (1-s)/p)

Courtesy:

http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html

http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm

amdahl s law analysis
Amdahl’s law analysis
  • For the same fraction, speedup numbers keep moving away from processor size.
  • Thus Amdahl’s law is a bit depressing for parallel programming.
  • In practice, the number of parallel portions of work has to be large enough to match a given number of processors.
gustafson s law
Gustafson’s Law
  • Amdahl’s law – keep the parallel work fixed
  • Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time
  • For a particular number of processors, find the problem size for which parallel time is equal to the constant time
  • For that problem size, find the sequential time and the corresponding speedup
  • Thus speedup is scaled or scaled speedup
metrics contd
Metrics (Contd..)

Table 5.1: Efficiency as a function of n and p.

scalability
Scalability
  • Efficiency decreases with increasing P; increases with increasing N
  • How effectively the parallel algorithm can use an increasing number of processors
  • How the amount of computation performed must scale with P to keep E constant
  • This function of computation in terms of P is called isoefficiency function.
  • An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable
scalability analysis finite difference algorithm with 1d decomposition
Scalability Analysis – Finite Difference algorithm with 1D decomposition

For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.

Can be satisfied with N = P, except for small P.

Hence isoefficiency function = O(P2) since computation is O(N2)

scalability analysis finite difference algorithm with 2d decomposition
Scalability Analysis – Finite Difference algorithm with 2D decomposition

Can be satisfied with N = sqroot(P)

Hence isoefficiency function = O(P)

2D algorithm is more scalable than 1D

steps
Steps
  • Decomposition – Splitting the problem into tasks or modules
  • Mapping – Assigning tasks to processor
  • Mapping’s contradictory objectives
    • To minimize idle times
    • To reduce communications
mapping
Mapping
  • Static mapping
    • Mapping based on Data partitioning
      • Applicable to dense matrix computations
      • Block distribution
      • Block-cyclic distribution
    • Graph partitioning based mapping
      • Applicable for sparse matrix computations
    • Mapping based on task partitioning

0

0

0

1

1

1

2

2

2

0

1

2

0

1

2

0

1

2

based on task partitioning
Based on Task Partitioning
  • Based on task dependency graph
  • In general the problem is NP complete

0

0

4

0

2

4

6

0

1

2

3

4

5

6

7

mapping1
Mapping
  • Dynamic Mapping
    • A process/global memory can hold a set of tasks
    • Distribute some tasks to all processes
    • Once a process completes its tasks, it asks the coordinator process for more tasks
    • Referred to as self-scheduling, work-stealing
interaction overheads
Interaction Overheads
  • In spite of the best efforts in mapping, there can be interaction overheads
  • Due to frequent communications, exchanging large volume of data, interaction with the farthest processors etc.
  • Some techniques can be used to minimize interactions
parallel algorithm design containing interaction overheads
Parallel Algorithm Design - Containing Interaction Overheads
  • Maximizing data locality
    • Minimizing volume of data exchange
      • Using higher dimensional mapping
      • Not communicating intermediate results
    • Minimizing frequency of interactions
  • Minimizing contention and hot spots
    • Do not use the same communication pattern with the other processes in all the processes
parallel algorithm design containing interaction overheads1
Parallel Algorithm Design - Containing Interaction Overheads
  • Overlapping computations with interactions
    • Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2)
    • Initiate communication for type 1; During communication, perform type 2
  • Overlapping interactions with interactions
  • Replicating data or computations
    • Balancing the extra computation or storage cost with the gain due to less communication
parallel algorithm types
Parallel Algorithm Types
  • Divide and conquer
  • Data partitioning / decomposition
  • Pipelining
divide and conquer
Divide-and-Conquer
  • Recursive in structure
    • Divide the problem into sub-problems that are similar to the original, smaller in size
    • Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner
    • Combine the solutions to create a solution to the original problem
divide and conquer example merge sort
Divide-and-ConquerExample: Merge Sort
  • Problem: Sort a sequence of n elements
  • Divide the sequence into two subsequences of n/2 elements each
  • Conquer: Sort the two subsequences recursively using merge sort
  • Combine: Merge the two sorted subsequences to produce sorted answer
partitioning
Partitioning
  • Breaking up the given problem into p independent subproblems of almost equal sizes
  • Solving the p subproblems concurrently
  • Mostly splitting the input or output into non-overlapping pieces
  • Example: Matrix multiplication
  • Either the inputs (A or B) or output (C) can be partitioned.
pipelining
Pipelining

Occurs with image processing applications where a number of images undergoes a sequence of transformations.

parallel algorithm models
Parallel Algorithm Models
  • Data parallel model
    • Processes perform identical tasks on different data
  • Task parallel model
    • Different processes perform different tasks on same or different data – based on task dependency graph
  • Work pool model
    • Any task can be performed by any process. Tasks are added to a work pool dynamically
  • Pipeline model
    • A stream of data passes through a chain of processes – stream parallelism
parallel program classification models structure paradigms
Parallel Program Classification

- Models

- Structure

- Paradigms

parallel program models
Parallel Program Models
  • Single Program Multiple Data (SPMD)
  • Multiple Program Multiple Data (MPMD)

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

parallel program structure types
Parallel Program Structure Types
  • Master-Worker / parameter sweep / task farming
  • Embarassingly/pleasingly parallel
  • Pipleline / systolic / wavefront
  • Tightly-coupled
  • Workflow

P0

P1

P2

P3

P4

P0

P1

P2

P3

P4

programming paradigms
Programming Paradigms
  • Shared memory model – Threads, OpenMP
  • Message passing model – MPI
  • Data parallel model – HPF

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

slide34
Parallel Architectures Classification

- Classification

- Cache coherence in shared memory platforms

- Interconnection networks

classification of architectures flynn s classification
Classification of Architectures – Flynn’s classification
  • Single Instruction Single Data (SISD): Serial Computers
  • Single Instruction Multiple Data (SIMD)

- Vector processors and processor arrays

- Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

classification of architectures flynn s classification1
Classification of Architectures – Flynn’s classification
  • Multiple Instruction Single Data (MISD): Not popular
  • Multiple Instruction Multiple Data (MIMD)

- Most popular

- IBM SP and most other supercomputers,

clusters, computational Grids etc.

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

classification of architectures based on memory
Classification of Architectures – Based on Memory
  • Shared memory
  • 2 types – UMA and NUMA

NUMA

Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q

UMA

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

classification of architectures based on memory1
Classification of Architectures – Based on Memory
  • Distributed memory

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  • Recently multi-cores
  • Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids
slide39
Cache Coherence

- for details, read 2.4.6 of book

Interconnection networks

- for details, read 2.4.2-2.4.5 of book

cache coherence in smps
Cache Coherence in SMPs
  • All processes read variable ‘x’ residing in cache line ‘a’
  • Each process updates ‘x’ at different points of time

CPU0

CPU1

CPU2

CPU3

a

a

a

a

cache0

cache1

cache2

cache3

a

  • Challenge: To maintain consistent view of the data
  • Protocols:
  • Write update
  • Write invalidate

Main

Memory

caches coherence protocols and implementations
Caches Coherence Protocols and Implementations
  • Write update – propagate cache line to other processors on every write to a processor
  • Write invalidate – each processor get the updated cache line whenever it reads stale data
  • Which is better??
caches false sharing
Caches –False sharing
  • Different processors update different parts of the same cache line
  • Leads to ping-pong of cache lines between processors
  • Situation better in update protocols than invalidate protocols. Why?

CPU1

CPU0

A0, A2, A4…

A1, A3, A5…

cache0

cache1

A0 – A8

A9 – A15

  • Modify the algorithm to change the stride

Main

Memory

caches coherence using invalidate protocols
Caches Coherence using invalidate protocols
  • 3 states associated with data items
    • Shared – a variable shared by 2 caches
    • Invalid – another processor (say P0) has updated the data item
    • Dirty – state of the data item in P0
  • Implementations
    • Snoopy
      • for bus based architectures
      • Memory operations are propagated over the bus and snooped
      • Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors
    • Directory-based
      • A central directory maintains states of cache blocks, associated processors
      • Implemented with presence bits
interconnection networks
Interconnection Networks
  • An interconnection network defined by switches, links and interfaces
    • Switches – provide mapping between input and output ports, buffering, routing etc.
    • Interfaces – connects nodes with network
  • Network topologies
    • Static – point-to-point communication links among processing nodes
    • Dynamic – Communication links are formed dynamically by switches
interconnection networks1
Interconnection Networks
  • Static
    • Bus – SGI challenge
    • Completely connected
    • Star
    • Linear array, Ring (1-D torus)
    • Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus
    • k-d mesh: d dimensions with k nodes in each dimension
    • Hypercubes – 2-logp mesh – e.g. many MIMD machines
    • Trees – our campus network
  • Dynamic – Communication links are formed dynamically by switches
    • Crossbar – Cray X series – non-blocking network
    • Multistage – SP2 – blocking network.
  • For more details, and evaluation of topologies, refer to book
evaluating interconnection topologies
Evaluating Interconnection topologies
  • Diameter – maximum distance between any two processing nodes
    • Full-connected –
    • Star –
    • Ring –
    • Hypercube -
  • Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks
    • Linear-array –
    • Ring –
    • 2-d mesh –
    • 2-d mesh with wraparound –
    • D-dimension hypercubes –

1

2

p/2

logP

1

2

2

4

d

evaluating interconnection topologies1
Evaluating Interconnection topologies
  • bisection width – minimum number of links to be removed from network to partition it into 2 equal halves
    • Ring –
    • P-node 2-D mesh -
    • Tree –
    • Star –
    • Completely connected –
    • Hypercubes -

2

Root(P)

1

1

P2/4

P/2

evaluating interconnection topologies2
Evaluating Interconnection topologies
  • channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes
  • channel rate – performance of a single physical wire
  • channel bandwidth – channel rate times channel width
  • bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth
ad