Using compiler-directed approach to create MPI code automatically Paraguin Compiler Patterns

Using compiler-directed approach to create MPI code automatically Paraguin Compiler Patterns ITCS4145/5145, Parallel Programming Clayton Ferner/B. Wilkinson March 11, 2014. ParagionSlides2abw.ppt

The Paraguin compiler is being developed by Dr. C Ferner, UNC-WilmingtonFollowing based upon his slides

Patterns • As of right now, there are only two patterns implemented in Paraguin: • Scatter/Gather • Stencil

Scatter/Gather Master prepares input Input is scatter to all processors 0 Scatter 0 1 2 3 4 5 6 7 Gather Processors work independently (no communication) 0 Partial results are gathered together to build final result

Scatter/Gather • This pattern is done as a template rather than a single pragma • Master prepares input • Scatter input • Compute partial results • Gather partial results into the final result

Scatter/Gather Example - Matrix Addition Make sure we have the correct number of arguments int main(intargc, char *argv[]) { inti, j, error = 0; double a[N][N], b[N][N], c[N][N]; char *usage = "Usage: %s file\n"; FILE *fd; if (argc < 2) { fprintf (stderr, usage, argv[0]); error = -1; } if (!error && (fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n", argv[0],argv[1]); fprintf (stderr, usage, argv[0]); error = -1; } #pragma paraguinbegin_parallel #pragma paraguinbcast error if (error) return error; #pragma paraguinend_parallel Make sure we can open the input file The variable error is used to stop the other processors error code broadcast to all processors so that they know to exit. If we just had a “return -1” in the above two if statements then the master only would exit and workers would not, causing a deadlock.

Master prepares input for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%lf", &a[i][j]); for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%lf", &b[i][j]); fclose(fd); #pragma paraguinbegin_parallel #pragma paraguin scatter a b // Parallelize loop nest assigning iterations // of outermost loop (i) to different partitions. #pragma paraguinforall for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = a[i][j] + b[i][j]; } } ; #pragma paraguin gather c #pragma paraguinend_parallel • Scatter input • Compute partial results Semicolon to prevent gather pragma from being placed INSIDE above for loop nest. • Gather partial results into the final result

More on Scatter/Gather • The scatter/gather pattern can also use either broadcast or reduction or both • Master prepares input • Broadcast input • Compute partial results • Reduce partial results into the final result

Broadcast/Reduce ExampleIntegration To demonstrate Broadcast/Reduce, consider the problem of integrating a function using rectangles: As h approaches zero the area of the rectangles approaches the area under the curve between a and b y=f(x) f(x+h) f(x) a b x x+h

Let f(x)=4sin(1.5x) + 5 f(x) need to run in parallel Previously functions needed // #pragma paraguinbegin_parallel // #pragma paraguinend_parallel double f(double x) { return 4.0 * sin(1.5*x) + 5; } int main(intargc, char *argv[]) { char *usage = "Usage: %s a b N\n"; inti, error = 0, N; double a, b, x, y, h, area, overall_area; if (argc < 4) { fprintf (stderr, usage, argv[0]); error = -1; } else { a = atof(argv[1]); b = atof(argv[2]); N = atoi(argv[3]); if (b <= a) { fprintf (stderr, "a should be smaller than b\n"); error = -1; } } #pragma paraguinbegin_parallel #pragma paraguinbcast error if (error) return error; Make sure we have the correct number of arguments • Master prepares input The variable error is used to stop the other processors Error code broadcast to all processors so that they know to exit.

This semicolon is here to prevent the bcast pragma from being placed INSIDE the above if statement. • Broadcast input • Compute partial results ; #pragma paraguinbcast a b N h = (b - a) / N; area = 0.0; #pragmaparaguinforall for (i = 0; i < N-1; i++) { x = a + i * h; y = f(x); area += y * h; } ; #pragma paraguin reduce sum area overall_area #pragma paraguinend_parallel Since this is a forall loop, each processors will compute a partition of the rectangles This semicolon is here to prevent the reduce pragma from being placed INSIDE the above for loop nest. • Reduce partial results into the final result Final area is in overal_area

Stencil Pattern 0

Jacobi Iteration

Basic Jacobi Iteration int main() { inti, j; double A[N][M], B[N][M]; // A is initialized with data somehow for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) A[i][j] = B[i][j]; } ... } Skip the boundary values Multiplying by 0.25 is faster than dividing by 4.0 Newly computed values are placed in a new array. Then copied back to the original.

Improved Jacobi Iteration used in Paraguin Add another dimension of size 2 to A. A[0] is old A and A[1] is old B int main() { inti, j, current, next; double A[2][N][M]; // A[0] is initialized with data somehow and duplicated into A[1] current = 0; next = (current + 1) % 2; for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) A[next][i][j] = (A[current][i-1][j] + A[current][i+1][j] + A[current][i][j-1] + A[current][i][j+1]) * 0.25; current = next; next = (current + 1) % 2; } // Final result is in A[current] ... } We toggle between copies of the array This avoids copying values back into the original array.

PartitioningRow versus Block Partitioning With block partitioning, we will need to communicate data across both rows and columns This will result in too much communication (too fine granularity) With row partitioning, each processor only needs to communication with at most 2 other processors.

Communication Pattern with Row Partitioning

Paraguin Stencil Pragma A stencil pattern is done with a stencil pragma: #pragma paraguin stencil <data> <#rows> <#cols> <max_iterations> <fname> where <data> is a 3 dimensional array 2 x # rows x # cols <max_iterations> is the number of iterations of the time loop <fname> is the name of a function to perform each calculation

Paraguin Stencil PragmaFunction <fname> Function to perform each calculation should be declared as: <type> <fname> (<type> <data>[ ][ ], int i, int j) where <type> is the base type of the array and i, j is the location in the array to be computed The function should calculate and return the value at location <data>[i][j]. It should not modify that value, but simply return the new value.

Paraguin Stencil Program Previously functions needed // #pragma paraguinbegin_parallel // #pragma paraguinend_parallel int__guin_current = 0; // This is needed to access the last copy of the data double computeValue (double A[][M], inti, int j) { // Fnto compute each value return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; } int main() { inti, j, n, m, max_iterations; double A[2][N][M]; // A[0] is initialized with data somehow and duplicated into A[1] #pragma paraguinbegin_parallel n = N; m = M; max_iterations = TOTAL_TIME; #pragma paraguin stencil A n m max_iterationscomputeValue #pragma paraguinend_parallel // Final result is in A[__guin_current] or A[max_iterations % 2] } A has a 3rd dimension of size 2 All pragma parameters must be literals or variables. No preprocessors constants.

The Stencil Pragma is Replaced with Code to do: • The 3-dimensional array given as an argument to the stencil pragma is broadcast to all available processors. • __guin_current is set to zero and __guin_next is set to one. • A loop is created to iterate max_iteration number of times. Within that loop, code is inserted to perform the following steps:

Each processor (except the last one) will send its last row to the processor with rank one more than its own rank. • Each processor (except the first one) will receive the last row from the processor with rank one less than its own rank. • Each processor (except the first one) will send its first row to the processor with rank one less than its own rank. • Each processor (except the last one) will receive the first row from the processor with rank one more than its own rank. • Each processor will iterate through the values of the rows for which it is responsible and use the function provided compute the next value. • __guin_current and __guin_next toggle • The data is gathered back to the root processor (rank 0).

Stopping the Iterations Based Upon a Condition • The stencil pattern will execute a fixed number of iterations • What if we want to continue until the data converges to a solution • For example, if the maximum difference between the values in A[0] and A[1] is less than a tolerance, like 0.0001 • The problem with doing this in parallel is that it requires communication

Why Communication is Needed to Test For a Termination Condition • There are 2 reasons inter-processor communication is needed the test for a termination condition: • The data is scattered across processors; and • The processors need to all agree whether to continue or terminate. • Parts of the data may converge faster than others • Some processors may decide to stop and others do not • Without agreement, there will be a deadlock

Paraguin Stencil Pragma with Termination Condition This part is the same. int __guin_current = 0; // This is needed to access the last copy of the data // Function to compute each value double computeValue (double A[][M], inti, int j) { return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; } intmain() { inti, j, n, m, max_iterations, done; double A[2][N][M], diff, max_diff, tol; // A[0] is initialized with data somehow and duplicated into A[1] #pragmaparaguinbegin_parallel tol= 0.0001; n = N; m = M; max_iterations = TOTAL_TIME; New variables tolused to determine if termination condition met. When change in values are ALL less than tol, values have converged sufficiently. Initializations are within the parallel region

Paraguin Stencil Pragma with Termination Condition Need a logical-controlled loop To make sure following pragma is inside the while done = 0; // false while (!done) { ; #pragma paraguin stencil A n m max_iterationscomputeValue max_diff= 0.0; #pragma paraguinforall for (i = 1; i < n - 1; i++) { for (j = 1; j < n - 1; j++) { diff = fabs(A[__guin_current][i][j] - A[__guin_next][i][j]); if (diff > max_diff) max_diff = diff; } } Each processor determines max change in values of its partition. Loop bounds need to be 1 and n-1 to match bounds of stencil. Otherwise, partitioning will be incorrect. All processors determine the maximum absolute difference between the old values and the newly computed values.

Paraguin Stencil Pragma with Termination Condition Needed to prevent pragma from being located in above loop nest Reduce to find the maximum difference across all processors. ; // Reduce the max_diff's from all processors #pragma paraguin reduce max max_diff diff #pragma paraguinbcast diff if (diff <= tol) done = 1; // true } #pragma paraguinend_parallel // Final result is in A[__guin_current]. Cannot use max_iterations % 2 } The variable diff is being reused here. Broadcast diff so that all processes will agree to continue or terminate Termination condition if max change in values is less than tolerance.

Questions?

Using compiler-directed approach to create MPI code automatically Paraguin Compiler Patterns

Using compiler-directed approach to create MPI code automatically Paraguin Compiler Patterns

Presentation Transcript

Automatically Proving the Correctness of Compiler Optimizations

Compiler

COMPILER

Compiler

Compiler

COMPILER

Compiler

Compiler-Directed instruction cache leakage optimizations

Compiler

Compiler II: Code Generation

Compiler

Compiler

Compiler II: Code Generation

Paraguin Compiler

Paraguin Compiler

Using compiler-directed approach to create MPI code automatically Paraguin Compiler Continued

Portable Code Compiler