1 / 129

Parallel Algorithms and Computing Selected topics

Parallel Algorithms and Computing Selected topics. Parallel Architecture. References. An introduction to parallel algorithms Joseph Jaja Introduction to parallel computing Vipin Kumar, Ananth Grama, Anshul Gupta, George KArypis Parallel sorting algorithms Selim G. Akl. Models.

leo-huffman
Download Presentation

Parallel Algorithms and Computing Selected topics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Algorithms and ComputingSelected topics Parallel Architecture

  2. References An introduction to parallel algorithms Joseph Jaja Introduction to parallel computing Vipin Kumar, Ananth Grama, Anshul Gupta, George KArypis Parallel sorting algorithms Selim G. Akl

  3. Models Three models: Graphs (DAG : Directed Acyclic Graph) Parallel Randon Access Machine Network

  4. Graphs Not studied here

  5. Parallel Architecture Parallel random access machine

  6. Parallel Randon Access Machine • Flynn classifies parallel machines based on: • Data flow • Instruction flow • Each flow can be: • Single • Multiple

  7. Parallel Randon Access Machine • Flynn classification Data flow Instruction flow

  8. Mémoire Globale (Shared – Memory) P1 P2 Pp Parallel Randon Access Machine • Extend the traditional RAM (Random Access Memory) machine • Interconnection network between global memory and processors • Multiple processors

  9. Parallel Randon Access Machine Characteristics • Processors Pi (i (0  i  p-1 ) • each with a local memory • i is a unique identity for processor Pi • A global shared memory • it can be accessed by all processors

  10. Parallel Randon Access Machine Types of operations: • Synchronous • Processors work in locked step Fat each step, a processor is active or idle Fsuited for SIMD and MIMD architectures • Asynchronous • processors have local clocks • needs to synchronize the processors F suited for MIMD architecture

  11. Parallel Randon Access Machine • Example of synchronous operation Algorithm : Processor i (i=0 … 3) Input : A, B i processor id Output : (1) C Begin If ( B==0) C = A Else C = A/B End

  12. Parallel Randon Access Machine Initial Processeur 0 Processeur 1 Processeur 2 Processeur 3 A : 5 B : 0 C : 0 A : 4 B : 2 C : 0 A : 2 B : 1 C : 0 A : 7 B : 0 C : 0 Step 1 Processeur 0 Processeur 1 Processeur 2 Processeur 3 A : 5 B : 0 C : 5 (Actif, (B=0) A : 4 B : 2 C : 0 (Inactif, (B0) A : 2 B : 1 C : 0 (Inactif, (B0) A : 7 B : 0 C : 7 (Actif, B=0) (active B =0) (idle B 0) (idle B 0) (active B = 0)

  13. Step 2 Processeur 0 Processeur 1 Processeur 2 Processeur 3 A : 5 B : 0 C : 5 A : 4 B : 2 C : 2 A : 2 B : 1 C : 2 A : 7 B : 0 C : 7 Parallel Randon Access Machine (idle B =0) (active B 0) (active B 0) (idle B = 0)

  14. Parallel Randon Access Machine Read / Write conflicts • EREW : Exclusive - Read, Exclusive -Write • no concurrent ( read or write) operation on a variable • CREW : Concurrent – Read, Exclusive – Write • concurrent reads allowed on same variable • exclusive write only

  15. Parallel Randon Access Machine • ERCW : Exclusive Read – Concurrent Write • CRCW : Concurrent – Read, Concurrent – Write

  16. Parallel Randon Access Machine Concurrent write on a variable X • Common CRCW : only if all processors write the same value on X • SUM CRCW : write the sum all variables on X • Random CRCW : choose one processor at random and write its value on X • Priority CRCW : processor with hign priority writes on X

  17. Parallel Randon Access Machine Example: Concurrent write on X by processors P1 (50  X) , P2 (60  X), P3 (70  X) • Common CRCW ou ERCW : Failure • SUM CRCW : X is the sum (180) of the written values • Random CRCW : final value of X  { 50, 60, 70 }

  18. Parallel Randon Access Machine Basic Input/Output operations • On global memory • global read (X, x) • global write (Y, y) • On local memory • read (X, x) • write (Y, y)

  19. Example 1: Matrix-Vector product • Matrix-Vector produt Y = AX • A is a nXn matrix • X = [ x1, x2, …, xn] a vector of n elements • p processeurs ( pn ) and r = n/p • Each processor is assigned a bloc of r= n/p elements

  20. P1 P2 Pp Example 1: Matrix-Vector product Global memory A1,1 A1,2 … A1,n A2,1 A2,2 … A2,n …….. A n,1 An,2 ... An,n X1 X2 …. Xn Y1 Y2 …. Yn X = Processors

  21. A1,1 A1,2 … A1,n ……. Ar,1 A2,2 … A2,n ……. A(p-1)r,1 A(p-1),2 … A(p-1),n …….. A pr,1 Apr,2 ….Apr,n r lignes A1 Ap A1 A2 …. Ap = A = r lignes Example 1: Matrix-Vector product Partition A in p blocks Ai • Compute p partial products in parallel • Processor Pi compute the partial product Yi = Ai * X

  22. A1,1 A1,2 … A1,n ……. Ar,1 A2,2 … A2,n Y1 Y2 …. Yr X1 X2 …. Xn X1 X2 …. Xn X1 X2 …. Xn X X X P1 Yr+1 Yr+2 …. Y2r P2 Ar+1,1 Ar+1,2 … Ar+1,n ……. A2r,1 A2r,2 … A2r,n Y(p-1)r+1 Y(p-1)r+2 …. Ypr Pp A(p-1)r,1 A(p-1)r,2 …A(p-1)r,n ……. Apr,1 Apr,2 … Apr,n Example 1: Matrix-Vector product Processeur Pi computes Yi = Ai * X

  23. Example 1: Matrix-Vector product Solution requires : • p concurrents reads of vector X • each processor Pi makes an exclusive read of block Ai = A [((i-1)r +1) : ir, 1:n] • Each processor Pi makes an exclusive write on block Yi = Y[((i-1)r +1) : ir ] Required architecture : PRAM CREW

  24. Example 1: Matrix-Vector product Algorithm: processor Pi (i=1,2, …, n) • Input • A : nxn matruix in global memory • X : a vector in global memory • Output • y = AX (y is a vector in global memory) • Local variables • i : Pi processor id • p: number of processors • n : dimension of A and X • Begin • 1. Global read ( x, z) • 2. global read (A((i-1)r + 1 : ir, 1:n), B) • 3. calculer W = Bz • 4. global write (w, y(i-1)r+1 : ir)) • End

  25. Example 1: Matrix-Vector product Analysis • Computation cost Ligne 3: O( n2/p) opérations arithmétiques by Pi r lignes X n opérations ( avec r = n/p) • Communication cost Ligne 1 : O(n) numbers transferred from global to local memory by Pi Ligne 2 : O(n2/p) numbers transferred from global to local memory by Pi Ligne 4 : O(n/p) numbers transferred from global to local memory by Pi • Overall: Algorithm run in O(n2/p) time

  26. Example 1: Matrix-Vector product Other way to partition the matrix is vertically • Ai and X are split into blocks • A1, A2, … Ap • X1, X2 … Xp • Solution in two phases : • Compute partial products Z1 =A1X1, … Zp = ApXp • Synchronize the processors • Add partial results to get Y Y= AX = Z1 + Z2 + … + Zp

  27. r columns r columns …….. Example 1: Matrix-Vector product A1,1 … A1,r A2,1 … A2,r An,1 … An,r X1 … Xr * Y1 Y2 …. Yn Processor P1 A1,(p-1)r +1 ... A1,pr A2,(p-1)r +1 ... A2,pr An,(p-1)r +1 ... An,pr X(p-1)r +1 ... Xpr * Synchronization Processor Pp

  28. Example 1: Matrix-Vector product Algorithm: processor Pi (i=1,2, …, n) • Input • A : nxn matruix in global memory • X : a vector in global memory • Output • y = AX (* y: vector in global memory *) • Local variables • i : Pi processor id • p: number of processors • n : dimension of A and X • Begin • 1. Global read ( x( (i-1)r +1 : ir) , z) • 2. global read (A(1:n, (i-1)r + 1 : ir), B) • 3. compute W = Bz • 4. Synchronize processors Pi (i=1, 2, …, n) • 5. global write (w, y(i-1)r+1 : ir)) • End

  29. Example 1: Matrix-Vector product Analysis • Work out the details • Overall: Algorithm run in O(n2/p) time

  30. Example 2: Sum on the PRAM model An aray A of n = 2k numbers A PRAM machine with n processor Compute S = A(1) + A(2) + …. + A(n) Construct a binary tree to compute the sum in log2n time

  31. S=B(1) P1 P1 B(1) B(1) B(1) B(4) B(3) B(2) B(1) =A(1) B(2) =A(2) B(1) =A(1) B(2) =A(2) Example 2: Sum on the PRAM model Level >1, Pi compute B(i) = B(2i-1) + B(2i) Level 1, Pi B(i) = A(i) B(2) P2 P1 P1 P3 P4 P2 B(1) =A(1) B(2) =A(2) B(1) =A(1) B(2) =A(2) P1 P2 P3 P4 P5 P6 P7 P8

  32. Example 2: Sum on the PRAM model Algorithm processor Pi ( i=0,1, …n-1) • Input • A : array of n = 2k elements in mémoire global • Output • S : où S= A(1) + A(2) + …. . A(n) • Local variables Pi • n : • i : processor Pi identity • Begin • 1. global read ( A(i), a) • 2. global write (a, B(i)) • 3. for h = 1 to log n do • if ( i ≤ n / 2h ) then begin • global read (B(2i-1), x) • global read (b(2i), y) • z = x +y • global write (z,B(i)) • end • 4. if i = 1 then global write(z,S) • End

  33. Parallel Architecture Network model

  34. Network model Characteristics • Communication structure is important • Network can be seen as a graph G=(N,E): • Node i N is a processor • Edge (i,j)E represents a two way communication between processors i and j • Basi communication operation • Send (X, Pi) • Receive(X, Pi)  No global shared memory

  35. P2 P1 P3 Pn Network model n processors Linear array … P2 P1 P3 Pn n processor ring

  36. Network model n2 processors Grid … P12 P11 P13 P1n … P22 P21 P23 P2n … P32 P31 P33 P3n … Pn2 Pn1 Pn3 Pnn n2 processors Torus: columns and rows are n rings

  37. Network model n=2k hypercube (P7) (P6) (P2) (P3) (P4) (P5) (P1) (P0)

  38. Network model n2 processors Grid … P12 P11 P13 P1n … P22 P21 P23 P2n … P32 P31 P33 P3n … Pn2 Pn1 Pn3 Pnn n2 processors Torus: columns and rows are n rings

  39. Exemple 1: Matrix-Vector Product on linear array • A=[aij] an nxn matrix, i,j [1,n] • X=[xi] i [1,n] • Compute

  40. a44 a43 a42 a41 a34 a33 a32 a31 a24 a23 a22 a21 a14 a13 a12 a11 . . . . . . x4 x3 x2 x1 P2 P1 P3 P4 Exemple 1: Matrix-Vector Product on linear array Systolic array algorithm for n=4

  41. Exemple 1: Matrix-Vector Product on linear array • At step j, xj enters the processor P1. At step j, processor Pi receives • (when possible) a value from its left and a value from the top. • It updates its partial as follows: • Yi = Yi + aij*xj , j=1,2,3, …. • Values xj and aij reach processor i at the same time at step (i+j-1) • (x1, a11) reach P1 at step 1 = (1+1-1) • (x3, a13) reach P1 at setep 3 = (1+3-1) • In general, Yi is computed at step N+i-1

  42. Exemple 1: Matrix-Vector Product on linear array • The computation is completed when x4 and a44 reach processor P4 at • Step N + N –1 = 2N-1 • Conclusion: The algorithm requires (2N-1) steps. At each step, active processor • Perform an addition and a multiplication • Complexity of the algorithm: O(N)

  43. Step x1 1 x1 x2 2 x2 x1 x3 3 x3 x2 x1 4 x4 4 4 4 4 y1 = a1j*xj y4 = a4j*xj y2 = a2j*xj y3 = a3j*xj x4 x3 x2 x1 5 J=1 J=1 J=1 J=1 x4 x3 x1 x2 6 x1 x2 x4 x3 7 P2 P2 P2 P2 P2 P2 P2 P1 P1 P1 P1 P1 P1 P1 P3 P3 P3 P3 P3 P3 P3 P4 P4 P4 P4 P4 P4 P4 Exemple 1: Matrix-Vector Product on linear array

  44. P2 P2 P2 P2 P2 P2 P2 P1 P1 P1 P1 P1 P1 P1 P3 P3 P3 P3 P3 P3 P3 P4 P4 P4 P4 P4 P4 P4 Exemple 1: Matrix-Vector Product on linear array Systolic array algorithm: Time-Cost analysis 1 Add; 1 Mult; active: P1 idle: P2, P3, P4 2 Add; 2 Mult; active: P1, P2 idle: P3, P4 3 Add; 3 Mult; active: P1, P2,P3 idle: P4 4 Add; 4 Mult; active: P1, P2,P3 P4 idle: 3 Add; 3 Mult; active: P2,P3,P4 idle: P1 2 Add; 2 Mult; active: P3,P4 idle: P1,P2 1 Add; 1 Mult; active: P4 idle: P1,P2,P3

  45. Step x1 1 x1 x2 2 x2 x1 x3 3 x3 x2 x1 4 x4 x4 x3 x2 5 x4 x3 6 x4 7 P2 P2 P2 P2 P2 P2 P2 P1 P1 P1 P1 P1 P1 P1 P3 P3 P3 P3 P3 P3 P3 P4 P4 P4 P4 P4 P4 P4 Exemple 1: Matrix-Vector Product on linear array Systolic array algorithm: Time-Cost analysis 1 Add; 1 Mult; active: P1 idle: P2, P3, P4 2 Add; 2 Mult; active: P1, P2 idle: P3, P4 3 Add; 3 Mult; active: P1, P2,P3 idle: P4 4 Add; 4 Mult; active: P1, P2,P3 P4 idle: 3 Add; 3 Mult; active: P2,P3,P4 idle: P1 2 Add; 2 Mult; active: P3,P4 idle: P1,P2 1 Add; 1 Mult; active: P4 idle: P1,P2,P3

  46. Exemple 2: Matrix multiplication on a 2-D nxn Mesh • Given two nxn matrices A = [aij] and B = [bij], i,j [1,n], • Compute the product C=AB , where C is given by :

  47. Exemple 2: Matrix multiplication on a 2-D nxn Mesh • At step i, Row i of A (starting with ai1) is entered from the top into column i (into processor P1i) • At step j, Column j of B (starting with b1j) is entered from the left into row j (to processor Pj1) • The values aik and bkj reach processor (Pji) at • step (i+j+k-2). • At the end of this step, aik is sent down and bkj is sent right.

  48. a44 a43 a42 a41 . . . a34 a33 a32 a31 . . a24 a23 a22 a21 a14 a13 a12 a11 . b41 b3 b21 b11 (1,4) (1,2) (1,3) (1,1) . b42 b32 b22 b12 (2,1) (2,2) (2,3) (2,4) . . b43 b33 b23 b13 (3,1) (3,2) (3,3) (3,4) . . b44 b34 b24 b14 (4,1) (4,2) (4,3) (4,4) Exemple 2: Matrix multiplication on a 2-D nxn Mesh Example: Systolic mesh algorithm for n=4 STEP 1

  49. a44 A a34 a43 a24 a33 a42 b41 b31 b21 b11 a14 a41 a23 a32 b12 b32 b22 b42 a13 a31 b43 a22 b33 b23 b13 • a12 b44 b34 a21 b14 b24 B • a11 Exemple 2: Matrix multiplication on a 2-D nxn Mesh Example: Systolic mesh algorithm for n=4 STEP 5

  50. Exemple 2: Matrix-Vector multiplication on a ring Analysis • To determine the number of steps for completing the multiplication of the matrice, we must find the step at which the terms ann and bnn reach rocessor Pnn. • Values aik and bkj reach processor Pji at i+j+k-2 • Substituing n for i,j,k yields : n + n + n – 2 = 3n - 2 • Complexity of the solution: O(N)

More Related