- 71 Views
- Uploaded on
- Presentation posted in: General

Parallel Programming

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Parallel Programming

Sathish S. Vadhiyar

Course Web Page:

http://www.serc.iisc.ernet.in/~vss/courses/PPP2009

- Faster Execution time due to non-dependencies between regions of code
- Presents a level of modularity
- Resource constraints. Large databases.
- Certain class of algorithms lend themselves
- Aggregate bandwidth to memory/disk. Increase in data throughput.
- Clock rate improvement in the past decade – 40%
- Memory access time improvement in the past decade – 10%

- Grand challenge problems (more later)

- Building efficient algorithms.
- Avoiding
- Communication delay
- Idling
- Synchronization

P0

P1

Idle time

Computation

Communication

Synchronization

- Execution time, Tp
- Speedup, S
- S(p, n) = T(1, n) / T(p, n)
- Usually, S(p, n) < p
- Sometimes S(p, n) > p (superlinear speedup)

- Efficiency, E
- E(p, n) = S(p, n)/p
- Usually, E(p, n) < 1
- Sometimes, greater than 1

- Scalability – Limitations in parallel computing, relation to n and p.

S

E

p

p

Ideal

Practical

- Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
- Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.
- Places a limit on the speedup due to parallelism.
- Speedup = 1
(fs + (fp/P))

S = 1 / (s + (1-s)/p)

Courtesy:

http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html

http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm

- For the same fraction, speedup numbers keep moving away from processor size.
- Thus Amdahl’s law is a bit depressing for parallel programming.
- In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

- Amdahl’s law – keep the parallel work fixed
- Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time
- For a particular number of processors, find the problem size for which parallel time is equal to the constant time
- For that problem size, find the sequential time and the corresponding speedup
- Thus speedup is scaled or scaled speedup

Table 5.1: Efficiency as a function of n and p.

- Efficiency decreases with increasing P; increases with increasing N
- How effectively the parallel algorithm can use an increasing number of processors
- How the amount of computation performed must scale with P to keep E constant
- This function of computation in terms of P is called isoefficiency function.
- An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.

Can be satisfied with N = P, except for small P.

Hence isoefficiency function = O(P2) since computation is O(N2)

Can be satisfied with N = sqroot(P)

Hence isoefficiency function = O(P)

2D algorithm is more scalable than 1D

Parallel Algorithm Design

- Decomposition – Splitting the problem into tasks or modules
- Mapping – Assigning tasks to processor
- Mapping’s contradictory objectives
- To minimize idle times
- To reduce communications

- Static mapping
- Mapping based on Data partitioning
- Applicable to dense matrix computations
- Block distribution
- Block-cyclic distribution

- Graph partitioning based mapping
- Applicable for sparse matrix computations

- Mapping based on task partitioning

- Mapping based on Data partitioning

0

0

0

1

1

1

2

2

2

0

1

2

0

1

2

0

1

2

- Based on task dependency graph
- In general the problem is NP complete

0

0

4

0

2

4

6

0

1

2

3

4

5

6

7

- Dynamic Mapping
- A process/global memory can hold a set of tasks
- Distribute some tasks to all processes
- Once a process completes its tasks, it asks the coordinator process for more tasks
- Referred to as self-scheduling, work-stealing

- In spite of the best efforts in mapping, there can be interaction overheads
- Due to frequent communications, exchanging large volume of data, interaction with the farthest processors etc.
- Some techniques can be used to minimize interactions

- Maximizing data locality
- Minimizing volume of data exchange
- Using higher dimensional mapping
- Not communicating intermediate results

- Minimizing frequency of interactions

- Minimizing volume of data exchange
- Minimizing contention and hot spots
- Do not use the same communication pattern with the other processes in all the processes

- Overlapping computations with interactions
- Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2)
- Initiate communication for type 1; During communication, perform type 2

- Overlapping interactions with interactions
- Replicating data or computations
- Balancing the extra computation or storage cost with the gain due to less communication

Parallel Algorithm Classification

– Types

- Models

- Divide and conquer
- Data partitioning / decomposition
- Pipelining

- Recursive in structure
- Divide the problem into sub-problems that are similar to the original, smaller in size
- Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner
- Combine the solutions to create a solution to the original problem

- Problem: Sort a sequence of n elements
- Divide the sequence into two subsequences of n/2 elements each
- Conquer: Sort the two subsequences recursively using merge sort
- Combine: Merge the two sorted subsequences to produce sorted answer

- Breaking up the given problem into p independent subproblems of almost equal sizes
- Solving the p subproblems concurrently
- Mostly splitting the input or output into non-overlapping pieces
- Example: Matrix multiplication
- Either the inputs (A or B) or output (C) can be partitioned.

Occurs with image processing applications where a number of images undergoes a sequence of transformations.

- Data parallel model
- Processes perform identical tasks on different data

- Task parallel model
- Different processes perform different tasks on same or different data – based on task dependency graph

- Work pool model
- Any task can be performed by any process. Tasks are added to a work pool dynamically

- Pipeline model
- A stream of data passes through a chain of processes – stream parallelism

Parallel Program Classification

- Models

- Structure

- Paradigms

- Single Program Multiple Data (SPMD)
- Multiple Program Multiple Data (MPMD)

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

- Master-Worker / parameter sweep / task farming
- Embarassingly/pleasingly parallel
- Pipleline / systolic / wavefront
- Tightly-coupled
- Workflow

P0

P1

P2

P3

P4

P0

P1

P2

P3

P4

- Shared memory model – Threads, OpenMP
- Message passing model – MPI
- Data parallel model – HPF

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Parallel Architectures Classification

- Classification

- Cache coherence in shared memory platforms

- Interconnection networks

- Single Instruction Single Data (SISD): Serial Computers
- Single Instruction Multiple Data (SIMD)
- Vector processors and processor arrays

- Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

- Multiple Instruction Single Data (MISD): Not popular
- Multiple Instruction Multiple Data (MIMD)
- Most popular

- IBM SP and most other supercomputers,

clusters, computational Grids etc.

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

- Shared memory
- 2 types – UMA and NUMA

NUMA

Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q

UMA

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

- Distributed memory

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

- Recently multi-cores
- Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

Cache Coherence

- for details, read 2.4.6 of book

Interconnection networks

- for details, read 2.4.2-2.4.5 of book

- All processes read variable ‘x’ residing in cache line ‘a’
- Each process updates ‘x’ at different points of time

CPU0

CPU1

CPU2

CPU3

a

a

a

a

cache0

cache1

cache2

cache3

a

- Challenge: To maintain consistent view of the data
- Protocols:
- Write update
- Write invalidate

Main

Memory

- Write update – propagate cache line to other processors on every write to a processor
- Write invalidate – each processor get the updated cache line whenever it reads stale data
- Which is better??

- Different processors update different parts of the same cache line
- Leads to ping-pong of cache lines between processors
- Situation better in update protocols than invalidate protocols. Why?

CPU1

CPU0

A0, A2, A4…

A1, A3, A5…

cache0

cache1

A0 – A8

A9 – A15

- Modify the algorithm to change the stride

Main

Memory

- 3 states associated with data items
- Shared – a variable shared by 2 caches
- Invalid – another processor (say P0) has updated the data item
- Dirty – state of the data item in P0

- Implementations
- Snoopy
- for bus based architectures
- Memory operations are propagated over the bus and snooped
- Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors

- Directory-based
- A central directory maintains states of cache blocks, associated processors
- Implemented with presence bits

- Snoopy

- An interconnection network defined by switches, links and interfaces
- Switches – provide mapping between input and output ports, buffering, routing etc.
- Interfaces – connects nodes with network

- Network topologies
- Static – point-to-point communication links among processing nodes
- Dynamic – Communication links are formed dynamically by switches

- Static
- Bus – SGI challenge
- Completely connected
- Star
- Linear array, Ring (1-D torus)
- Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus
- k-d mesh: d dimensions with k nodes in each dimension
- Hypercubes – 2-logp mesh – e.g. many MIMD machines
- Trees – our campus network

- Dynamic – Communication links are formed dynamically by switches
- Crossbar – Cray X series – non-blocking network
- Multistage – SP2 – blocking network.

- For more details, and evaluation of topologies, refer to book

- Diameter – maximum distance between any two processing nodes
- Full-connected –
- Star –
- Ring –
- Hypercube -

- Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks
- Linear-array –
- Ring –
- 2-d mesh –
- 2-d mesh with wraparound –
- D-dimension hypercubes –

1

2

p/2

logP

1

2

2

4

d

- bisection width – minimum number of links to be removed from network to partition it into 2 equal halves
- Ring –
- P-node 2-D mesh -
- Tree –
- Star –
- Completely connected –
- Hypercubes -

2

Root(P)

1

1

P2/4

P/2

- channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes
- channel rate – performance of a single physical wire
- channel bandwidth – channel rate times channel width
- bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

- END