1 / 28

Using compiler-directed approach to create MPI code automatically Paraguin Compiler Patterns

This presentation discusses the Paraguin compiler, developed by Dr. C. Ferner, which uses patterns like Scatter/Gather, Stencil, and Broadcast/Reduce to automatically generate MPI code for parallel programming. It includes examples of these patterns being used for matrix addition, function integration, and the Jacobi iteration.

Download Presentation

Using compiler-directed approach to create MPI code automatically Paraguin Compiler Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using compiler-directed approach to create MPI code automatically Paraguin Compiler Patterns ITCS4145/5145, Parallel Programming Clayton Ferner/B. Wilkinson March 11, 2014. ParagionSlides2abw.ppt

  2. The Paraguin compiler is being developed by Dr. C Ferner, UNC-WilmingtonFollowing based upon his slides

  3. Patterns • As of right now, there are only two patterns implemented in Paraguin: • Scatter/Gather • Stencil

  4. Scatter/Gather Master prepares input Input is scatter to all processors 0 Scatter 0 1 2 3 4 5 6 7 Gather Processors work independently (no communication) 0 Partial results are gathered together to build final result

  5. Scatter/Gather • This pattern is done as a template rather than a single pragma • Master prepares input • Scatter input • Compute partial results • Gather partial results into the final result

  6. Scatter/Gather Example - Matrix Addition Make sure we have the correct number of arguments int main(intargc, char *argv[]) { inti, j, error = 0; double a[N][N], b[N][N], c[N][N]; char *usage = "Usage: %s file\n"; FILE *fd; if (argc < 2) { fprintf (stderr, usage, argv[0]); error = -1; } if (!error && (fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n", argv[0],argv[1]); fprintf (stderr, usage, argv[0]); error = -1; } #pragma paraguinbegin_parallel #pragma paraguinbcast error if (error) return error; #pragma paraguinend_parallel Make sure we can open the input file The variable error is used to stop the other processors error code broadcast to all processors so that they know to exit. If we just had a “return -1” in the above two if statements then the master only would exit and workers would not, causing a deadlock.

  7. Master prepares input for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%lf", &a[i][j]); for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%lf", &b[i][j]); fclose(fd); #pragma paraguinbegin_parallel #pragma paraguin scatter a b // Parallelize loop nest assigning iterations // of outermost loop (i) to different partitions. #pragma paraguinforall for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = a[i][j] + b[i][j]; } } ; #pragma paraguin gather c #pragma paraguinend_parallel • Scatter input • Compute partial results Semicolon to prevent gather pragma from being placed INSIDE above for loop nest. • Gather partial results into the final result

  8. More on Scatter/Gather • The scatter/gather pattern can also use either broadcast or reduction or both • Master prepares input • Broadcast input • Compute partial results • Reduce partial results into the final result

  9. Broadcast/Reduce ExampleIntegration To demonstrate Broadcast/Reduce, consider the problem of integrating a function using rectangles: As h approaches zero the area of the rectangles approaches the area under the curve between a and b y=f(x) f(x+h) f(x) a b x x+h

  10. Let f(x)=4sin(1.5x) + 5 f(x) need to run in parallel Previously functions needed // #pragma paraguinbegin_parallel // #pragma paraguinend_parallel double f(double x) { return 4.0 * sin(1.5*x) + 5; } int main(intargc, char *argv[]) { char *usage = "Usage: %s a b N\n"; inti, error = 0, N; double a, b, x, y, h, area, overall_area; if (argc < 4) { fprintf (stderr, usage, argv[0]); error = -1; } else { a = atof(argv[1]); b = atof(argv[2]); N = atoi(argv[3]); if (b <= a) { fprintf (stderr, "a should be smaller than b\n"); error = -1; } } #pragma paraguinbegin_parallel #pragma paraguinbcast error if (error) return error; Make sure we have the correct number of arguments • Master prepares input The variable error is used to stop the other processors Error code broadcast to all processors so that they know to exit.

  11. This semicolon is here to prevent the bcast pragma from being placed INSIDE the above if statement. • Broadcast input • Compute partial results ; #pragma paraguinbcast a b N h = (b - a) / N; area = 0.0; #pragmaparaguinforall for (i = 0; i < N-1; i++) { x = a + i * h; y = f(x); area += y * h; } ; #pragma paraguin reduce sum area overall_area #pragma paraguinend_parallel Since this is a forall loop, each processors will compute a partition of the rectangles This semicolon is here to prevent the reduce pragma from being placed INSIDE the above for loop nest. • Reduce partial results into the final result Final area is in overal_area

  12. Stencil Pattern 0

  13. Jacobi Iteration

  14. Basic Jacobi Iteration int main() { inti, j; double A[N][M], B[N][M]; // A is initialized with data somehow for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) A[i][j] = B[i][j]; } ... } Skip the boundary values Multiplying by 0.25 is faster than dividing by 4.0 Newly computed values are placed in a new array. Then copied back to the original.

  15. Improved Jacobi Iteration used in Paraguin Add another dimension of size 2 to A. A[0] is old A and A[1] is old B int main() { inti, j, current, next; double A[2][N][M]; // A[0] is initialized with data somehow and duplicated into A[1] current = 0; next = (current + 1) % 2; for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) A[next][i][j] = (A[current][i-1][j] + A[current][i+1][j] + A[current][i][j-1] + A[current][i][j+1]) * 0.25; current = next; next = (current + 1) % 2; } // Final result is in A[current] ... } We toggle between copies of the array This avoids copying values back into the original array.

  16. PartitioningRow versus Block Partitioning With block partitioning, we will need to communicate data across both rows and columns This will result in too much communication (too fine granularity) With row partitioning, each processor only needs to communication with at most 2 other processors.

  17. Communication Pattern with Row Partitioning

  18. Paraguin Stencil Pragma A stencil pattern is done with a stencil pragma: #pragma paraguin stencil <data> <#rows> <#cols> <max_iterations> <fname> where <data> is a 3 dimensional array 2 x # rows x # cols <max_iterations> is the number of iterations of the time loop <fname> is the name of a function to perform each calculation

  19. Paraguin Stencil PragmaFunction <fname> Function to perform each calculation should be declared as: <type> <fname> (<type> <data>[ ][ ], int i, int j) where <type> is the base type of the array and i, j is the location in the array to be computed The function should calculate and return the value at location <data>[i][j]. It should not modify that value, but simply return the new value.

  20. Paraguin Stencil Program Previously functions needed // #pragma paraguinbegin_parallel // #pragma paraguinend_parallel int__guin_current = 0; // This is needed to access the last copy of the data double computeValue (double A[][M], inti, int j) { // Fnto compute each value return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; } int main() { inti, j, n, m, max_iterations; double A[2][N][M]; // A[0] is initialized with data somehow and duplicated into A[1] #pragma paraguinbegin_parallel n = N; m = M; max_iterations = TOTAL_TIME; #pragma paraguin stencil A n m max_iterationscomputeValue #pragma paraguinend_parallel // Final result is in A[__guin_current] or A[max_iterations % 2] } A has a 3rd dimension of size 2 All pragma parameters must be literals or variables. No preprocessors constants.

  21. The Stencil Pragma is Replaced with Code to do: • The 3-dimensional array given as an argument to the stencil pragma is broadcast to all available processors. • __guin_current is set to zero and __guin_next is set to one. • A loop is created to iterate max_iteration number of times. Within that loop, code is inserted to perform the following steps:

  22. Each processor (except the last one) will send its last row to the processor with rank one more than its own rank. • Each processor (except the first one) will receive the last row from the processor with rank one less than its own rank. • Each processor (except the first one) will send its first row to the processor with rank one less than its own rank. • Each processor (except the last one) will receive the first row from the processor with rank one more than its own rank. • Each processor will iterate through the values of the rows for which it is responsible and use the function provided compute the next value. • __guin_current and __guin_next toggle • The data is gathered back to the root processor (rank 0).

  23. Stopping the Iterations Based Upon a Condition • The stencil pattern will execute a fixed number of iterations • What if we want to continue until the data converges to a solution • For example, if the maximum difference between the values in A[0] and A[1] is less than a tolerance, like 0.0001 • The problem with doing this in parallel is that it requires communication

  24. Why Communication is Needed to Test For a Termination Condition • There are 2 reasons inter-processor communication is needed the test for a termination condition: • The data is scattered across processors; and • The processors need to all agree whether to continue or terminate. • Parts of the data may converge faster than others • Some processors may decide to stop and others do not • Without agreement, there will be a deadlock

  25. Paraguin Stencil Pragma with Termination Condition This part is the same. int __guin_current = 0; // This is needed to access the last copy of the data // Function to compute each value double computeValue (double A[][M], inti, int j) { return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; } intmain() { inti, j, n, m, max_iterations, done; double A[2][N][M], diff, max_diff, tol; // A[0] is initialized with data somehow and duplicated into A[1] #pragmaparaguinbegin_parallel tol= 0.0001; n = N; m = M; max_iterations = TOTAL_TIME; New variables tolused to determine if termination condition met. When change in values are ALL less than tol, values have converged sufficiently. Initializations are within the parallel region

  26. Paraguin Stencil Pragma with Termination Condition Need a logical-controlled loop To make sure following pragma is inside the while done = 0; // false while (!done) { ; #pragma paraguin stencil A n m max_iterationscomputeValue max_diff= 0.0; #pragma paraguinforall for (i = 1; i < n - 1; i++) { for (j = 1; j < n - 1; j++) { diff = fabs(A[__guin_current][i][j] - A[__guin_next][i][j]); if (diff > max_diff) max_diff = diff; } } Each processor determines max change in values of its partition. Loop bounds need to be 1 and n-1 to match bounds of stencil. Otherwise, partitioning will be incorrect. All processors determine the maximum absolute difference between the old values and the newly computed values.

  27. Paraguin Stencil Pragma with Termination Condition Needed to prevent pragma from being located in above loop nest Reduce to find the maximum difference across all processors. ; // Reduce the max_diff's from all processors #pragma paraguin reduce max max_diff diff #pragma paraguinbcast diff if (diff <= tol) done = 1; // true } #pragma paraguinend_parallel // Final result is in A[__guin_current]. Cannot use max_iterations % 2 } The variable diff is being reused here. Broadcast diff so that all processes will agree to continue or terminate Termination condition if max change in values is less than tolerance.

  28. Questions?

More Related