Parallel Processing & Parallel Algorithm

Parallel Processing &Parallel Algorithm May 8, 2003 B4 Yuuki Horita

Chapter 3 Principles of Parallel Programming

Principles of Parallel Programming • Data Dependency • Processors Communication • Mapping • Granularity

Data Dependency • data flow dependency • data anti-dependency • data output dependency • data input dependency • data control dependency

data flow dependency &data anti-dependency - data flow dependency S1: A = B + C S2: D = A + E - data anti-dependency S1: A = B + C S2: B = D + E

data output dependency &data input dependency - data output dependency S1: A = B + C S2: A = D + E - data input dependency S1: A = B + C S2: D = B + E

data control dependency S1: A = B - C if ( A > 0 ) S2: then D = 1; S3: else D = 2; endif S1⇒S2,S3

Dependency Graph • G = ( V, E ) V(Vertices) : statements E(Edges) : dependency ex) S1: A = B + C S1 → S2 : flow dep. , anti-dep. S2: B = A * 3 S1 → S3 : output dep. , input dep. S3: A = 2 * C S2 → S3 : anti-dep. S4: P = B ≧0 S2 → S4 : flow dep. if ( P is True) S4 → S5 , S6 : control dep. S5: then D = 1 S6: else D = 2 endif

Elimination of the dependencies • data output dependency • data anti-dependency ⇒ renaming can remove these form of dependency

Ex) S1: A = B + C S2: B = D + E S3: A = F + G S1,S2 : anti-dep. S1,S3 : output-dep. S1: A = B + C S2: B’ = D + E S3: A’ = F + G No dependency! Elimination of the dependencies(2) Renaming

Processors communication • Message passing communication - processors communicate via communication links • Shared memory communication - processors communicate via common memory

Message Passing System Interconnection Network PE1 PE2 PE3 PEm ・・・ M1 M2 M3 Mm ・・・

Message Passing System(2) • Send and Receive operations - blocking - nonblocking • System - synchronous blocking send and blocking receive operations - asynchronous nonblocking send and blocking receive operations ( the messages are buffered )

Shared memory system Global Memory M1 M2 M3 Mm ・・・ Interconnection Network PE1 PE2 PE3 PEm ・・・

Mapping … matching parallel algorithms to parallel architecture mapping to - asynchronous architecture - synchronous architecture - distributed architecture

Mapping to Asynchronous Architecture Mapping a program to an asynchronous shared-memory computer has the following steps: • Allocate each statement in the program to a processor • Allocate each variable to a memory • Specify the control flow for each processor the sender may send many messages without the receiver removing them from the channel ⇒ the need for buffering messages

Mapping to Synchronous Architecture • Mapping a program is the same as Asynchronous Architecture • a common clock for synchronization purposes • each processor executes an instruction at each step ( at each clock tick) • only one message exists at any one time on the channel ⇒ no buffering is needed

Mapping to Distributed Architecture • A local memory accessible only by owning processor • only a pair of processor along the channel • Mapping a program is the same as in the asynchronous shared-memory architecture, except that each variable is allocated either to a processor or a channel

Granularity relates to the ratio of the amount of computation to the amount of communication • fine : at statement level • medium : at procedure level • coarse : at program level

Program Level Parallelism • a program creates a new process by creating a complete copy of itself - Fork() 　（UNIX)

Statement Level Parallelism ・　Parbegin-Parend block Par-begin Statement1 Statement2 : Statementn Par-end ⇒ the statements Statement1～Statementn are executed in parallel

Statement Level Parallelism(2) ex) (a + b) * (c + d) – (e / f ) Par-begin Par-begin t1 = a + b t2 = c + d Par-end t4 = t1 * t2 t3 = e / f Par-end t5 = t4 – t3

Statement Level Parallelism(3) ・　Fork, Join, Quit - Fork x cause a new process to be created and to start executing at the instruction labeled x - Join t, y t = t – 1 if ( t = 0) then go to y - Quit the process terminates

Statement Level Parallelism(4) ex) (a + b) * (c + d) – (e / f ) n = 2 m = 2 Fork P2 Fork P3 P1: t1 = a + b; Join m, P4; Quit; P2: t2 = c + d; Join m, P4; Quit; P4: t4 = t1*t2; Join n, P5; Quit; P3: t3 = e / f; Join n, P5; Quit; P5: t5 = t4 – t3

Parallel Processing & Parallel Algorithm