Creating coarse grained parallelism for loop nests
This presentation is the property of its rightful owner.
Sponsored Links
1 / 52

Creating Coarse-grained Parallelism for Loop Nests PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on
  • Presentation posted in: General

Creating Coarse-grained Parallelism for Loop Nests. Chapter 6, Sections 6.3 through 6.9. Yaniv Carmeli. Last time …. Single loop methods Privatization Loop distribution Alignment Loop Fusion. This time …. Perfect Loop Nests Loop Interchange Loop Selection Loop Reversal Loop Skewing

Download Presentation

Creating Coarse-grained Parallelism for Loop Nests

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Creating coarse grained parallelism for loop nests

Creating Coarse-grained Parallelism for Loop Nests

Chapter 6, Sections 6.3 through 6.9

Yaniv Carmeli


Last time

Last time…

  • Single loop methods

    • Privatization

    • Loop distribution

    • Alignment

    • Loop Fusion


This time

This time…

  • Perfect Loop Nests

    • Loop Interchange

    • Loop Selection

    • Loop Reversal

    • Loop Skewing

    • Profitibility-Based Methods


This time1

This time…

  • Imperfectly Nested Loops

    • Multilevel Loop Fusion

    • Parallel Code Generation

  • Packaging Parallelism

    • Strip Mining

    • Pipeline Parallelism

    • Guided Self Scheduling


Loop interchange

Loop Interchange

A(I+1, J) = A(I, J) + B(I, J)

ENDDO

ENDDO

DO I = 1, N

PARALLEL DO J = 1 , M

DO I = 1 , N

A(I+1, J) = A(I, J) + B(I, J)

ENDDO

END PARALLEL DO

DO J = 1, M

D = ( < , = )

  • Vectorization: OK

  • Parallelization: Problematic

  • Vectorization: Bad

  • Parallelization: Good


Loop interchange cont

Loop Interchange (Cont.)

Best we can do

DO I = 1, N

PARALLEL DO J = 1, M

A(I+1, J+1) = A(I, J) + B(I, J)

END PARALLEL DO

ENDDO

DO I = 1, N

DO J = 1, M

A(I+1, J+1) = A(I, J) + B(I, J)

ENDDO

ENDDO

D = ( < , < )

  • Loop Interchange doesn’t work, as both loops carry dependence!!

When can a loop be moved to the outermost position in the nest, and be guaranteed to be parallel?


Loop interchange cont1

Loop Interchange (Cont.)

  • Theorem: In a perfect nest of loops, a particular loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contains only ‘=‘ entries.

  • Proof. If. A column with only “=“ entries represents a loop that can be interchanged, and carries no dependence.

  • Only If. There is a non “=“ entry in that column:

    • If it is “>” – Can’t interchange loops (dependence will be reversed)

    • If it is “<“ – Can interchange, but can’t shake the dependece (Will not allow parallelization anyway...)


Loop interchange cont2

Loop Interchange (Cont.)

  • Working with direction matrix

    • 1. Move loops with all “=“ entries into outermost position and parallelize it. Remove the column from the matrix

    • 2. Move loops with most “<“ entries into next outermost position and sequentialize it, eliminate the column and any rows representing carried dependences

    • 3. Repeat step 1


Loop interchange cont3

< = =

= = <

< < <

Loop Interchange (Cont.)

  • Example:

DO I = 1, N

DO J = 1, M

DO K = 1, L

A(I+1, J,K) = A(I, J,K) + X1

B(I, J,K+1) = B(I, J,K) + X2

C(I+1, J+1,K+1) = C(I, J,K) + X3

ENDDO

ENDDO

ENDO

DO I = 1, N

PARALLEL DO J = 1, M

DO K = 1, L

A(I+1, J,K) = A(I, J,K) + X1

B(I, J,K+1) = B(I, J,K) + X2

C(I+1, J+1,K+1) = C(I, J,K) + X3

ENDDO

END PARALLEL DO

ENDO


Loop selection optimal

< < = =

< = < =

< = = <

= < = =

= = < =

= = = <

< < = =

< = < =

< = = <

= < = =

= = < =

= = = <

Loop Selection – Optimal?

  • Is the approach of selecting the loop with the most ‘ < ‘ directions optimal?

  • Will result in NO parallelization for this matrix

  • While other selections may allow parallelization

Is it possible to derive a selection heuristic that provides optimal code?


Loop selection

< < = =

< = < =

< = = <

= < = =

= = < =

= = = <

Loop Selection

  • The problem of loop selection is NP-complete

    • Loop selection is best done by a heuristic!

  • Favor the selection of loops that must be sequentialized before parallelism can be uncovered.


Heuristic loop selection cont

= < =

< = <

= < <

< = >

Heuristic Loop Selection (Cont.)

  • Example of principals involvedin heuristic loop selection

    DO I = 2, N

    DO J = 2, M

    DO K = 2, L

    A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) +

    A(I, J+1, K+1) + A(I-1, J, K+1)

    ENDDO

    ENDDO

    ENDDO

DO J = 2, M

DO I = 2, N

PARALLEL DO K = 2, L

A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) +

A(I, J+1, K+1) + A(I-1, J, K+1)

END PARALLEL DO

ENDDO

ENDDO

The J-loop must be sequentialized because of the first dependence

The I-loop must be sequentialized because of the fourth dependence


Loop reversal

= < >

< = >

= < <

< = <

Loop Reversal

  • Using loop reversal to create coarse-grained parallelism. Consider:

DO I = 2, N+1

DO J = 2, M+1

DO K = 1, L

A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1)

ENDDO

ENDDO

ENDDO

DO K = L, 1, -1

DO I = 2, N+1

DO J = 2, M+1

A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1)

ENDDO

ENDDO

ENDDO

DO K = L, 1, -1

PARALLEL DO I = 2, N+1

PARALLEL DO J = 2, M+1

A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1)

END PARALLEL DO

END PARALLEL DO

ENDDO

DO I = 2, N+1

DO J = 2, M+1

DO K = L, 1, -1

A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1)

ENDDO

ENDDO

ENDDO


Loop skewing

0 1 0

1 0 0

0 0 1

0 0 0

= < =

< = =

= = <

= = =

= < <

< = <

= = <

= = =

Loop Skewing

Skewed using k = K + I + J yield:

DO I = 2, N+1

DO J = 2, M+1

DO k = I+J+1, I+J+L

A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J)

B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J)

ENDDO

ENDDO

ENDDO

DO I = 2, N+1

DO J = 2, M+1

DO K = 1, L

A(I, J, K) = A(I, J-1, K) + A(I-1, J, K)

B(I, J, K+1) = B(I, J, K) + A(I, J, K)

ENDDO

ENDDO

ENDDO

DO k = 5, N+M+1

PARALLEL DO I = MAX(2, k-M-L-1), MIN(N+1, k-L-2)

PARALLEL DO J = MAX(2, k-I-L), MIN(M+1, k-I-1)

A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J)

B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J)

END PARALLEL DO

END PARALLEL DO

ENDDO


Loop skewing main benefits

Loop Skewing - Main Benefits

  • Eliminate “>” signs in the matrix

  • Transforms skewed loops in such a way, that after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed


Loop skewing drawback

Loop Skewing - Drawback

  • The resulting parallelism is usually unbalanced. (The resulting loop executes a variable amount of iterations each time).

    • As we shall see – It’s not really a problem for asynchronous parallelism (unlike vectorization).


Loop skewing cont

Loop Skewing (Cont.)

  • Updated strategy

    • Parallelize outermost loop if possible

    • Sequentializes at most one outer loop to find parallelism in the next loop

    • If 1 and 2 fail, try skewing

    • If 3 fails, sequentialize the loop that can be moved to the outermost position and cover the most other loops


Creating coarse grained parallelism for loop nests

  • In Practice –

    • Sometimes we get much worse execution times, than we would have gotten parallelizing less\different loops.


Profitability based methods

Profitability-Based Methods

  • Static performance estimation function

    • No need to be accurate, just good at selecting the better of two alternatives

  • Key considerations

    • Cost of memory references

    • Sufficiency of granularity


Profitability based methods cont

Profitability-Based Methods (Cont.)

  • Impractical to choose from all arrangements

  • Consider only subset of the possible code arrangements, based on properties of the cost function

    • In our case: consider only the inner-most loop


Profitability based methods cont1

Profitability-Based Methods (Cont.)

A possible cost evaluation heuristics:

  • Subdivide all the references in the loop body into reference groups

    • Two references are in the same group if:

      • There is a loop independent dependence between them.

      • There is a constant-distance loop carried dependence between them.


Profitability based methods cont2

Profitability-Based Methods (Cont.)

A possible cost evaluation heuristics:

  • Determine whether subsequent accesses to the same reference are

    • Loop invariant

      • Cost = 1

    • Unit stride

      • Cost = number of iterations / cache line size

    • Non-unit stride

      • Cost = number of iterations


Profitability based methods cont3

Profitability-Based Methods (Cont.)

A possible cost evaluation heuristics:

  • Compute loop cost:


Profitability based methods example

Profitability-Based Methods: Example

DO I = 1, N

DO J = 1, N

DO K = 1, N

C(I, J) = C(I, J) + A(I, K) * B(K, J)

ENDDO

ENDDO

ENDDO


Profitability based methods example1

Profitability-Based Methods: Example

DO I = 1, N

DO J = 1, N

DO K = 1, N

C(I, J) = C(I, J) + A(I, K) * B(K, J)

ENDDO

ENDDO

ENDDO

Inner-most loop

C

A

B

COST

K

1

N

N/L

N3(1+1/L)+N2

Worst

J

N

1

N

2N3+N2

I

N/L

N/L

1

2N3/L+N2

Best


Profitability based methods example2

Profitability-Based Methods: Example

  • Reorder loop from innermost to outermost by increasing loop cost: I,K,J

DO J = 1, N

DO K = 1, N

DO I = 1, N

C(I, J) = C(I, J) + A(I, K) * B(K, J)

ENDDO

ENDDO

ENDDO

Can’t always have desired loop order

(as some permutations are illegal) -

Try to find the possible permutation closest to the desired one.


Profitability based methods cont4

Profitability-Based Methods (Cont.)

  • Goal: Given a desired loop order and a direction matrix for a loop nest - find the legal permutation closest to the desired one.

  • Method:

    Until there are no more loops:

    Choose from all the loops that can be interchanged to the outermost position, the one that is outermost in the desired permutation. Drop that loop.

It can be shown that if a legal permutation with the desired innermost loop in the innermost position exists – this algorithm will find such a permutation.


Profitability based methods cont5

Profitability-Based Methods (Cont.)

DO J = 1, N

DO K = 1, N

DO I = 1, N

C(I, J) = C(I, J) + A(I, K) * B(K, J)

ENDDO

ENDDO

ENDDO

  • For performance reasons – the compiler may mark the inner loop as “not meant for parallelization” (sequential performance utilizes locality in memory accesses).


Multilevel loop fusion

Multilevel Loop Fusion

  • Commonly used for imperfect loop nests

  • Used after maximal loop distribution


Multilevel loop fusion1

Multilevel Loop Fusion

PARALLEL DO I = 1, N

DO J = 1, M

A(I, J+1) = A(I, J) + C

ENDDO

END PARALLEL DO

PARALLEL DO J = 1, M

DO I = 1, N

B(I+1, J) = B(I, J) + D

ENDDO

END PARALLEL DO

DO I = 1, N

DO J = 1, M

A(I, J+1) = A(I, J) + C

ENDDO

ENDDO

DO I = 1, N

DO J = 1, M

B(I+1, J) = B(I, J) + D

ENDDO

ENDDO

DO I = 1, N

DO J = 1, M

A(I, J+1) = A(I, J) + C

B(I+1, J) = B(I, J) + D

ENDDO

ENDDO

After distribution each nest is better with a different outer loop – Can’t fuse!


Multilevel loop fusion cont

i,j

A

j

i

B

C

j

D

Multilevel Loop Fusion (Cont.)

DO I = 1, N

DO J = 1, M

A(I, J) = A(I, J) + X

ENDDO

ENDDO

DO I = 1, N

DO J = 1, M

B(I+1, J) = A(I, J) + B(I,J)

ENDDO

ENDDO

DO I = 1, N

DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J)

ENDDO

ENDDO

DO I = 1, N

DO J = 1, M

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

ENDDO

DO I = 1, N

DO J = 1, M

A(I, J) = A(I, J) + X

ENDDO

ENDDO

DO I = 1, N

DO J = 1, M

B(I+1, J) = A(I, J) + B(I,J)

ENDDO

ENDDO

DO I = 1, N

DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J)

ENDDO

ENDDO

DO I = 1, N

DO J = 1, M

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

ENDDO

DO I = 1, N

DO J = 1, M

A(I, J) = A(I, J) + X

B(I+1, J) = A(I, J) + B(I,J)

C(I, J+1) = A(I, J) + B(I,J)

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

ENDDO

Which loop should be fused into the A loop?


Multilevel loop fusion cont1

2 barriers

j

AB

j

D

i

C

Multilevel Loop Fusion (Cont.)

Fusing A loop with B loop

PARALLEL DO J = 1, M

DO I = 1, N

A(I, J) = A(I, J) + X

B(I+1, J) = A(I, J) + B(I,J)

ENDDO

ENDDO

PARALLEL DO I = 1, N

DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J)

ENDDO

ENDDO

PARALLEL DO J = 1, M

DO I = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

ENDDO


Multilevel loop fusion cont2

1 barrier

i

i

AC

AC

j

D

j

j

B

BD

Multilevel Loop Fusion (Cont.)

Fusing A loop with C loop

PARALLEL DO I = 1, N

DO J = 1, M

A(I, J) = A(I, J) + X

C(I, J+1) = A(I, J) + C(I,J)

ENDDO

ENDDO

PARALLEL DO J = 1, M

DO I = 1, N

B(I+1, J) = A(I, J) + B(I,J)

ENDDO

ENDDO

PARALLEL DO J = 1, M

DO I = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

ENDDO

PARALLEL DO I = 1, N

DO J = 1, M

A(I, J) = A(I, J) + X

C(I, J+1) = A(I, J) + C(I,J)

ENDDO

ENDDO

PARALLEL DO J = 1, M

DO I = 1, N

B(I+1, J) = A(I, J) + B(I,J)

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

ENDDO

Now we can also fuse B-D


Multilevel loop fusion cont3

i,j

A

j

i

B

C

j

D

A barrier is inevitable!!

Multilevel Loop Fusion (Cont.)

  • Decision making needs look-ahead

  • Strategy: Fuse with the loop that cannot be fused with one of its successors

    Rationale: If it can’t be fused with its successors – a barrier will be formed anyway.


Parallel code generation

Parallel Code Generation

Code generation scheme:

Parallelize(l,D)

  • Try methods for perfect nests (loop interchange, loop skewing, loop reversal), and stop if parallelism is found.

  • If nest can be distributed: distribute, run recursively on the distributed nests, and merge.

  • Else sequentialize outer loop, eliminate the dependences it carries, and try recursively on each of the loops nested in it.


Parallel code generation1

Parallel Code Generation

procedure Parallelize(l, Dl);

ParallelizeNest(l, success); //(try methods for perfect nests..)

if ¬success then begin

if l can be distributed then begin

distribute l into loop nests l1, l2, …, ln;

for i:=1 to n do begin

Parallelize(li, Di);

end

Merge({l1, l2, …, ln});

end


Parallel code generation cont

Parallel Code Generation (Cont.)

else begin// if l cannot be distributed then

for each outer loop l0 nested in l do begin

let D0 be the set of dependences between statements in l0 less dependences carried by l;

Parallelize(l0,D0);

end

let S - the set of outer loops and statements loops left in l;

If ||S||>1 then Merge(S);

end

end

end Parallelize


Parallel code generation cont1

I loop can be parallelized

Both loops can be parallelized

Parallel Code Generation (Cont.)

DO J = 1, M

DO I = 1, N

A(I+1, J+1) = A(I+1, J) + C

ENDDO

END DO

DO J = 1, M

DO I = 1, N

X(I, J) = A(I, J) + C

ENDDO

END DO

PARALLEL DO I = 1, N

DO J = 1, M

A(I+1, J+1) = A(I+1, J) + C

ENDDO

END PARALLEL DO

DO J = 1, M

DO I = 1, N

X(I, J) = A(I, J) + C

ENDDO

END DO

PARALLEL DO I = 1, N

DO J = 1, M

A(I+1, J+1) = A(I+1, J) + C

ENDDO

END PARALLEL DO

PARALLEL DO J = 1, M

DO I = 1, N !Left sequential for memory hierarchy

X(I, J) = A(I, J) + C

ENDDO

END PARALLEL DO

Type: (I-loop, parallel)

Try distribution…

DO J = 1, M

DO I = 1, N

A(I+1, J+1) = A(I+1, J) + C

X(I, J) = A(I, J) + C

ENDDO

ENDDO

Different types – can’t fuse

Now fusing…

Type: (J-loop, parallel)

Both loops carry dependence – loop interchange will not find sufficient parallelism.


Parallel code generation cont2

A

B

C

D

A

C

D

Parallel Code Generation (Cont.)

J loop, parallel

PARALLEL DO J = 1, M

DO I = 1, N !Sequentialized for memory hierarchy

A(I, J) = A(I, J) + X

ENDDO

ENDPARALLEL DO

PARALLEL DO J = 1, M

DO I = 1, N

B(I+1, J) = A(I, J) + B(I,J)

ENDDO

END PARALLEL DO

PARALLEL DO I = 1, N

DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J)

ENDDO

END PARALLEL DO

PARALLEL DO J = 1, M

DO J = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

END PARLLEL DO

PARALLEL DO J = 1, M

DO I = 1, N !Sequentialized for memory hierarchy

A(I, J) = A(I, J) + X

ENDDO

DO I = 1, N

B(I+1, J) = A(I, J) + B(I,J)

ENDDO

END PARALLEL DO

PARALLEL DO I = 1, N

DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J)

ENDDO

END PARALLEL DO

PARALLEL DO J = 1, M

DO J = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

END PARLLEL DO

PARALLEL DO J = 1, M

DO I = 1, N !Sequentialized for memory hierarchy

A(I, J) = A(I, J) + X

B(I+1, J) = A(I, J) + B(I,J)

ENDDO

END PARALLEL DO

PARALLEL DO I = 1, N

DO J = 1, M

C(I, J+1) = A(I, J) + C(I,J)

ENDDO

END PARALLEL DO

PARALLEL DO J = 1, M

DO J = 1, N

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

END PARLLEL DO

DO I = 1, N

DO J = 1, M

A(I, J) = A(I, J) + X

B(I+1, J) = A(I, J) + B(I,J)

C(I, J+1) = A(I, J) + C(I,J)

D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)

ENDDO

ENDDO

J loop, parallel

I loop, parallel

J loop, parallel


Erlebacher

Erlebacher

PARALLEL DO J = 1, JMAXDO I = 1, IMAXDF(I, J, 1) = F(I, J, 1) * B(1)

DO K = 2, N-1PARALLEL DO J = 1, JMAXDDO I = 1, IMAXDF(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

PARALLEL DO J = 1, JMAXDDO I = 1, IMAXDTOT(I, J) = 0.0

PARALLEL DO J = 1, JMAXDDO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1PARALLEL DO J = 1, JMAXDDO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

DO J = 1, JMAXDO I = 1, IMAXDF(I, J, 1) = F(I, J, 1) * B(1)

DO K = 2, N-1DO J = 1, JMAXDDO I = 1, IMAXDF(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

DO J = 1, JMAXDDO I = 1, IMAXDTOT(I, J) = 0.0

DO J = 1, JMAXDDO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1DO J = 1, JMAXDDO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)


Erlebacher1

L1

L2

L3

L4

L5

Erlebacher

PARALLEL DO J= 1, MAXD

L1 :DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1)

L2:DO K = 2, N – 1 DO I = 1, IMAXDF(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

L3:DO I = 1, IMAXD TOT(I, J) = 0.0

L4:DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

L5:DO K = 2, N-1 DO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

END PARALLEL DO


Erlebacher2

Erlebacher

PARALLEL DO J = 1, JMAXD

DO I = 1, IMAXD

F(I, J, 1) = F(I, J, 1) * B(1)

TOT(I, J) = 0.0

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

ENDDO

DO K = 2, N-1

DO I = 1, IMAXD

F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

ENDDO

ENDDO

END PARALLEL DO


Packaging of parallelism

Packaging of Parallelism

  • Trade off between parallelism and granularity of synchronization.

    • Larger granularity work-units means synchronization needs to be done less frequently, but at a cost of less parallelism, and poorer load balance.


Strip mining

Strip Mining

  • Converts available parallelism into a form more suitable for the hardware

    DO I = 1, N

    A(I) = A(I) + B(I)

    ENDDO

    • Interruptions may be disastrous

k = CEIL (N / P)

PARALLEL DO I = 1, N, k

DO i = I, MIN(I + k-1, N)

A(i) = A(i) + B(i)

ENDDO

END PARALLEL DO

The value of P is unknown until runtime, so strip mining is often handled by special hardware (Convex C2 and C3)


Strip mining cont

Strip Mining (Cont.)

  • What if the execution time varies among iteraions?

    PARALLEL DO I = 1, N

    DO J = 2, I

    A(J, I) = A(J-1, I) * 2.0

    ENDDO

    END PARALLEL DO

Solution: smaller unit size to allow more balanced distribution


Pipeline parallelism

Pipeline Parallelism

  • Fortran command DOACROSS – pipelines parallel loop iterations with cross-iteration synchronization.

  • Useful where parallelization is not available

  • High synchronization costs

DOACROSS I = 2, N

S1: A(I) = B(I) + C(I)

POST(EV(I))

IF (I>2) WAIT (EV(I-1))

S2: C(I) = A(I-1) + A(I)

ENDDO


Scheduling parallel work

Scheduling Parallel Work

Load

balance

Little

Sychro.


Scheduling parallel work1

Scheduling Parallel Work

  • Parallel execution is slower than serial execution if

  • Bakery-counter scheduling

    • Moderate synchronization overhead

N- number of iterations

B- time of one iteration

p- number of processors

σ0- constant overhead

per processor


Guided self scheduling

Guided Self-Scheduling

  • Incorporates some level of static scheduling to guide dynamic self-scheduling

    • Schedules groups of iterations

    • Going from large to small chunks of work

  • Iterations dispensed at time t follows:


Guided self scheduling cont

Guided Self-Scheduling (Cont.)

  • GSS: (20 iteration, 4 processors)

  • Not completely balanced

  • Required synchronization: 9

    In bakery counter: 20


Guided self scheduling cont1

Guided Self-Scheduling (Cont.)

  • In the example, last 4 allocation are for a single iteration. Coincidence?

    • Last p-1 iterations will always be of 1 iteration.

  • GSS(2): No block of iterations smaller than 2

  • GSS(k): No block is smaller than k


Creating coarse grained parallelism for loop nests

Yaniv Carmeli

B.A. in CS

Thanks for you attention!


  • Login