1 / 22

ACTS Tools Case Studies with User Codes

ACTS Tools Case Studies with User Codes. Osni Marques and Tony Drummond (LBNL/NERSC) oamarques@lbl.gov , ladrummond@lbl.gov. PETSc: Applications. Prometheus code (unstructured meshes in solid mechanics), 26 million DOF, 640 nodes on NERSC’s Cray T3E (M. Adams and J. Demmel).

doriank
Download Presentation

ACTS Tools Case Studies with User Codes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ACTS ToolsCase Studies with User Codes Osni Marques and Tony Drummond (LBNL/NERSC) oamarques@lbl.gov, ladrummond@lbl.gov

  2. PETSc: Applications Prometheus code (unstructured meshes in solid mechanics), 26 million DOF, 640 nodes on NERSC’s Cray T3E (M. Adams and J. Demmel). 3D incompressible Euler,tetrahedral grid, up to 11 million unknowns, based on a legacy NASA code, FUN3d (W. K. Anderson), fully implicit steady-state. Courtesy of D. Kaushik and D. Keyes. The parallel version of M3d, a multi-level 3D plasma physics code developed at the Princeton Plasma Physics Laboratory (PPPL), makes extensive use of PETSc for parallelization and solution of an unstructured mesh problem. Recently, one of the members of the PPPL team stated that the parallelization of M3D “would have been very difficult without PETSc, and would have required several physicists to spend a significant amount of time reinventing numerical algorithms instead of doing physics.” ACTS Tools - Case Studies

  3. ScaLAPACK: Applications Advanced Computational Research in Fusion (SciDAC Project, PI Mitch Pindzola). Point of contact: Dario Mitnik (Dept. of Physics, Rollins College). Mitnik attended the workshop on the ACTS Toolkit in September 2000. Since then he has been actively using some of the ACTS tools, in particular ScaLAPACK, for which he has provided insightful feedback. Dario is currently working on the development, testing and support of new scientific simulation codes related to the study of atomic dynamics using time-dependent close coupling lattice and time-independent methods. He reports that this work could not be carried out in sequential machines and that ScaLAPACK is fundamental for the parallelization of these codes. Induced current (white arrows) and charge density (colored plane and gray surface) in crystallized glycine due to an external field (Louie, Yoon, Pfrommer and Canning). ACTS Tools - Case Studies

  4. Eigenmodes of Damped Detuned Structures (DDS) http://scidac.nersc.gov/accelerator/presentations.html • A DDS is an accelerator cavity that incorporates damping and detuning to suppress the transverse wake fields produced. • Omega3P code: • Quadratic Finite Element discretization: • K x =  M x, with K and M large, sparse and symmetric. • Inexact Shift-Invert Lanczos as Band Pass Filtering • JOCC Refinement (Jacobi iteration) • Poor convergence in applications where the matrices are ill-conditioned and the eigenvalues are not well separated, as in the case in long structures with many cells. ACTS Tools - Case Studies

  5. PARPACK and SuperLU_dist • Parallel exact shift-invert eigensolver. • Problem of size 380698 with 15844364 nonzeros (nprocs = 8) • Early tests show that the computation of ~100 eigenvalues is faster than the current eigensolver in the electromagnetic simulation code (which can compute only a few eigenvalues at the moment) • SuperLU: EQUIL time 1.33 COLPERM time 10.52 ROWPERM time 6.94 SYMBFACT time 18.60 DISTRIBUTE time 18.82 FACTOR time 408.90 Factor flops 1.949737e+12 Mflops 4768.25 • PARPACK: Total number update iterations = 31 Total number of OP*x operations = 421 Total number of B*x operations = 1163 Total number of reorthogonalization steps = 292 Total time in user OP*x operation = 2380.469727 Total time in user B*x operation = 104.050049 Total time in Arnoldi update routine = 2690.000000 Total time in basic Arnoldi iteration loop = 2551.629395 ACTS Tools - Case Studies

  6. Cosmic Microwave Background (CMB) Analysis • The CMB is the faint echo of the Big Bang. • The statistics of the tiny variations in the CMB allows the determination of the fundamental parameters of cosmology to the percent level or better. • MADCAP (Microwave Anisotropy Dataset Computational Analysis Package): • Makes maps from observations of the CMB and then calculates their angular power spectra. • ScaLAPACK-based code. • See http://www.nersc.gov/~borrill/cmb/madcap.html ACTS Tools - Case Studies

  7. MADCAP • Calculations are dominated by the solution of linear systems of the form M = A-1B for dense nxn matrices A and B scaling as O(n3) in flops. • On the NERSC Cray T3E (original code): • Cholesky factorization and triangular solve. • Typically reached 70-80% peak performance. • Solution of systems with n ~ 104 using tens of processors. • The results demonstrated that the Universe is spatially flat (overall) comprising 70% dark energy, 25% dark matter, and only 5% ordinary matter. • On the NERSC IBM SP: • Porting was trivial but tests showed only 20-30% peak performance. • Code rewritten to use Cholesky factorization, triangular matrix inversion and triangular matrix multiplicationone-day work because of the completeness and coherence of ScaLAPACK. • Performance increased to 50-60% peak. • Solution of previously intractable systems with n ~ 105 using hundreds of processors. ACTS Tools - Case Studies

  8. http://www.physics.ucsb.edu/~boomerang

  9. Profiling with TAU Fortran 90 cc Cubes program PROGRAM SUM_OF_CUBES integer profiler(2) save profiler INTEGER :: H, T, U call TAU_PROFILE_INIT() call TAU_PROFILE_TIMER(profiler, 'PROGRAM SUM_OF_CUBES') call TAU_PROFILE_START(profiler) call TAU_PROFILE_SET_NODE(0) ! This program prints all 3-digit numbers that ! equal the sum of the cubes of their digits. DO H = 1, 9 DO T = 0, 9 DO U = 0, 9 IF (100*H + 10*T + U == H**3 + T**3 + U**3) THEN PRINT "(3I1)", H, T, U ENDIF END DO END DO END DO call TAU_PROFILE_STOP(profiler) END PROGRAM SUM_OF_CUBES ACTS Tools - Case Studies

  10. Profiling with TAU C and OpenMP /* --- calculate mandelbrot set -- */ mytimer_(0); #pragma omp parallel { TAU_PROFILE_TIMER(pt, "Parallel Region", " " , TAU_DEFAULT); TAU_PROFILE_START(pt); #ifdef _OPENMP numpe = omp_get_num_threads(); #endif #pragma omp for private(ix,iy) for (ix=0; ix<width; ++ix) { double x = xmin + ix*dx; TAU_PROFILE_TIMER(fl, "For loop", " " , TAU_DEFAULT); TAU_PROFILE_START(fl); for (iy=0; iy<height; ++iy) { double y = ymin + iy*dy; double zx, zy, cx, cy, ox, oy; int count; zx = 0.0; zy = 0.0; ox = 0.0; oy = 0.0; cx = x; cy = y; count = 0; while ( (ox*ox + oy*oy) < 16 && count < maxiter ) { zx = ox*ox-oy*oy+cx; zy = ox*oy+ox*oy+cy; ++count; ox = zx; oy = zy; } iterations[ix][iy] = count; } TAU_PROFILE_STOP(fl); } TAU_PROFILE_STOP(pt); } mytimer_(&numpe); /* --- generate ppm file -- */ printf("Writing picture ...\n"); ppmwrite("mandel.ppm", iterations, maxiter); TAU_PROFILE_STOP(mt); exit(0); } #include <stdio.h> #include <stdlib.h> #include <Profile/Profiler.h> extern void mytimer_(int *); #ifdef _OPENMP # include <omp.h> #endif #include "ppm.h" field iterations; int main(int argc, char *argv[]) { double xmin, xmax, ymin, ymax, dx, dy; int numpe, maxiter, ix, iy; TAU_PROFILE_TIMER(mt, "main()", "int (int, char **)", TAU_DEFAULT); TAU_PROFILE_SET_NODE(0); TAU_PROFILE_START(mt); if ( argc != 6 ) { fprintf(stderr, "%s: xmin xmax ymin ymax maxiter\n", argv[0]); fprintf(stderr, "Using defaults: -.59 -.56 .47 .5 216\n"); xmin = -.59; xmax = -.56; ymin = .47; ymax = .5; maxiter = 216; } else { xmin = strtod(argv[1], 0); xmax = strtod(argv[2], 0); ymin = strtod(argv[3], 0); ymax = strtod(argv[4], 0); maxiter = atoi(argv[5]); } /* --- initialization -- */ numpe = 1; dx = (xmax - xmin) / width; dy = (ymax - ymin) / height; ACTS Tools - Case Studies

  11. Profiling with TAU C++ and PAPI /* Now we employ the strip mining optimization */ for(n = 0; n < SIZE; n++) for(m = 0; m < SIZE; m++) C[n][m] = 0; TAU_PROFILE_START(strip_timer); for(i=0; i < SIZE; i++) for(k=0; k < SIZE; k++) for(sz = 0; sz < SIZE; sz+=CACHE) { //vl = min(SIZE-sz, CACHE); vl = (SIZE - sz < CACHE ? SIZE - sz : CACHE); for(strip = sz; strip < sz+vl; strip++) C[i][strip] += A[i][k]*B[k][strip]; } TAU_PROFILE_STOP(strip_timer); return C[SIZE-10][SIZE-10]; // So KCC doesn't optimize this loop away. } int main(int argc, char **argv) { TAU_PROFILE("main()", "int (int, char **)", TAU_DEFAULT); TAU_PROFILE_SET_NODE(0); multiply(); return 0; } /* This demonstrates how data cache misses can affect the performance of an application. We show how the time/counts for a simple matrix Multiplication algorithm dramatically reduce when we employ a strip mining optimization. */ #include <Profile/Profiler.h> #define SIZE 128 #define CACHE 64 double A[SIZE][SIZE], B[SIZE][SIZE], C[SIZE][SIZE]; double multiply(void) { int i, j, k, n, m; int vl, sz, strip; TAU_PROFILE("multiply", "void (void)", TAU_USER); TAU_PROFILE_TIMER(t1,"multiply-regular", "void (void)", TAU_USER); TAU_PROFILE_TIMER(strip_timer,"multiply-with-strip-mining-optimization", "void (void)", TAU_USER); for (n = 0; n < SIZE; n++) for (m = 0; m < SIZE; m++) { A[n][m] = B[n][m] = n + m ; C[n][m] = 0; } TAU_PROFILE_START(t1); for (i = 0; i < SIZE; i ++) { for (j = 0; j < SIZE; j++) { for (k = 0; k < SIZE; k++) C[i][j] += A[i][k] * B[k][j]; } } TAU_PROFILE_STOP(t1); ACTS Tools - Case Studies

  12. Profiling EVH1 with TAU • EVH1 (Enhanced Virginia Hydrodynamics #1) benchmark • MPI code developed from VH1, based on the piece-wise parabolic method (PPM) of Colella and Woodward • PPM is a technique used for compressible, non-turbulent hydrodynamics. It has been used in a variety of astrophysical contexts, in addition to some ideal gas computations and studies of convection ACTS Tools - Case Studies

  13. Profiling EVH1 with TAU JRACY, time spent in each process. ACTS Tools - Case Studies

  14. Profiling EVH1 with TAU JRACY, mean inclusive time. ACTS Tools - Case Studies

  15. Profiling EVH1 with TAU JRACY, exclusive and inclusive floating point counts for all routines (except PARABOLA), mean over 16 processors. ACTS Tools - Case Studies

  16. Profiling EVH1 with TAU JRACY, exclusive and inclusive level 1 data cache misses for all routines (except PARABOLA), mean over 16 processors. ACTS Tools - Case Studies

  17. Profiling EVH1 with TAU Visualizing TAU traces with Vampir, a commercial trace visualization tool from Pallas, GmbH. ACTS Tools - Case Studies

  18. Profiling EVH1 with TAU Visualizing TAU traces with Vampir. Timeline view on process 1. ACTS Tools - Case Studies

  19. Profiling EVH1 with TAU Visualizing TAU traces with Vampir. Processes that participate in an activity at a given time. ACTS Tools - Case Studies

  20. Acknowledgments • Sherry Li • Chao Yang • Julian Borrill • Evan Welbourne • Sameer Shende • PERC: http://perc.nersc.gov/main.htm ACTS Tools - Case Studies

  21. Please mark your calendars!

  22. ACTS acts-support@nersc.gov http://acts.nersc.gov

More Related