Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code

Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora, U. Alberta, Canada NERSC User Training Lectures at June NUG Meeting June 6-7, 2000 ORNL

• Motivation - Very large scale calculations which are extremely demanding computionally are required when we attempt to model experimental situations - Optimization of performance is necessary to improve efficiency and make parameter scans feasible • Outline - Brief model description - Optimization strategy and implementation - Results - Summary

UCLA-Canada (UCAN) Global Gyrokinetic Particle Code • Designed to study ion microinstabilities and ensuing transport in toroidal magnetic fusion devices of the tokamak type • Physics Attributes: - Three dimensional - Toroidal - Nonlinear - Cartesian geometry - Covers the whole plasma cross-section=> Global - Adiabatic electrons Refs: - R. D. Sydora, V. K. Decyk, and J. M. Dawson, Plasma Phys. Control. Fusion 38 (1996) A281-A294 - J. N. Leboeuf et al. , Phys. Plasmas 7 (2000) 1795-1801

UCLA-Canada (UCAN) Global Gyrokinetic Particle Code • Numerical Attributes - Uses low-noise nonlinear delta-f methods to integrate ions orbits - Massively parallel implementation using PLIB library of particle and field management routines - MPI for message passing - One dimensional domain decomposition (toroidal or z direction) Ref: - V. K. Decyk, Computer Physics Communications 87 (1995) 87-94

Global Gyrokinetic Particle Simulations of Turbulence in UCLA’s Electric Tokamak (ET) • Massively parallel computers enable modeling of plasma conditions (plasma weather) inside a large scale fusion device like ET Nonlinear Steady State Linear Stage

Need for Optimization and Optimization Strategy • The CRAY T3E is notoriously cache starved with an 8K level 1 cache compared to a 64K level 1 cache on the IBM SP => minimizing cache usage is a crucial issue on the T3E • Follow strategy outlined in “Optimization of particle-in-cell codes for RISC processors”, V. K. Decyk, S. R. Karmesin, A. de Boer, and P. C. Liewer, Computers in Physics 10 (1996) 290-298 • Main elements: - Loop reordering - Data restructuring - Particle sorting • Should help even on the IBM SP

OPTIMIZATION: Loop Reordering in Particle Push and Accumulation Routines After do 33 l = 1, nblok c nleng = nleng0 ntot = npp(l) - nps(l) + 1 if( ntot .lt. nleng ) nleng = ntot ndiv = float( ntot - 1 ) / float( nleng ) + 1. nlast = ntot - ( ndiv - 1 ) * nleng c do 10 ncon = 1,ndiv nlengl = nleng if( ncon .eq. ndiv ) nlengl = nlast jfst = nps(l) + ( ncon - 1 ) * nleng - 1 c c c do 20 k = 1,4 c k2 = k + k c k1 = k2 - 1 c do 20 jj = 1,nlengl c i = jfst + jj c btot = cb1*rmajor / (rmajor + (part(1,i,l) - xxo) ) rhoi = sqrt( 2. * part(11,i,l) * btot ) * wcirv c do 30 k=1,4 k2=k+k k1=k2-1 c Before do 33 l = 1, nblok c nleng = nleng0 ntot = npp(l) - nps(l) + 1 if( ntot .lt. nleng ) nleng = ntot ndiv = float( ntot - 1 ) / float( nleng ) + 1. nlast = ntot - ( ndiv - 1 ) * nleng c do 10 ncon = 1,ndiv nlengl = nleng if( ncon .eq. ndiv ) nlengl = nlast jfst = nps(l) + ( ncon - 1 ) * nleng - 1 c c do 20 k = 1,4 k2 = k + k k1 = k2 - 1 c do 30 jj = 1,nlengl c i = jfst + jj c btot = cb1*rmajor / (rmajor + (part(1,i,l) - xxo) ) rhoi = sqrt( 2. * part(11,i,l) * btot ) * wcirv

OPTIMIZATION: Loop Reordering in FFT Routines Before After c first transform in z nrz = nxyz/nz do 580 l = 1, indz ns = 2**(l - 1) ns2 = ns + ns km = nzh/ns kmr = km*nrz do 570 m = 1, jblok do 560 k = 1, km k1 = ns2*(k - 1) k2 = k1 + ns do 550 j = 1, ns j1 = j + k1 j2 = j + k2 s = conjg(sct(1+kmr*(j-1))) do 540 i = 1, kxp do 530 n = 1, ny t = s*g(j2,n,i,m) g(j2,n,i,m) = g(j1,n,i,m) - t g(j1,n,i,m) = g(j1,n,i,m) + t 530 continue 540 continue 550 continue 560 continue 570 continue 580 continue c first transform in z nrz = nxyz/nz do 580 l = 1, indz ns = 2**(l - 1) ns2 = ns + ns km = nzh/ns kmr = km*nrz do 570 m = 1, jblok do 560 i = 1, kxp do 550 n = 1, ny do 540 k = 1, km k1 = ns2*(k - 1) k2 = k1 + ns do 530 j = 1, ns j1 = j + k1 j2 = j + k2 s = conjg(sct(1+kmr*(j-1))) t = s*g(j2,n,i,m) g(j2,n,i,m) = g(j1,n,i,m) - t g(j1,n,i,m) = g(j1,n,i,m) + t 530 continue 540 continue 550 continue 560 continue 570 continue 580 continue

OPTIMIZATION: Data Restructuring • Separate electric field components elx, ely, elz regrouped into single array el(1,...)=elx(...), el(2,...)=ely(...) el(3,...)=elz(...) so that they are now contiguous in memory Before After call pfft3r(g1x,g1xt,bs,br,1,ntpose,mixup,sct, cdddd1 incx+1, incy+1, incz, kstrt, 1 incx , incy , incz, kstrt, 2 ncxv, ncyppv, nczv, 3 nncxppp, nnczp, nnczp1, nblok, nblok, ncypp, 4 ncypp, ncy ) c do 810 l = 1,nblok do 810 k = 1,nczp do 810 j = 1,nncy1 do 810 i = 1,nncx1 elx(i,j,k,l) = g1x(i,j,k,l) 810 continue c call pfft3r(g1x,g1xt,bs,br,1,ntpose,mixup,sct, cdddd1 incx+1, incy+1, incz, kstrt, 1 incx , incy , incz, kstrt, 2 ncxv, ncyppv, nczv, 3 nncxppp, nnczp, nnczp1, nblok, nblok, ncypp, 4 ncypp, ncy ) c do 810 l = 1,nblok do 810 k = 1,nczp do 810 j = 1,nncy1 do 810 i = 1,nncx1 elx(i,j,k,l) = g1x(i,j,k,l) el(1,i,j,k,l)=g1x(i,j,k,l) 810 continue c

OPTIMIZATION: Memory Access Control • Particle pusher modified to take advantage of data restructuring, so as to minimize memory excursions and improve data locality After c dx= 1 el(1,lx,ly ,lz,l)*alf1+el(1,lxr,ly ,lz,l)*alf2 1 +el(1,lx,lyr,lz,l)*alf3+el(1,lxr,lyr,lz,l)*alf4 dy= 1 el(2,lx,ly ,lz,l)*alf1+el(2,lxr,ly ,lz,l)*alf2 1 +el(2,lx,lyr,lz,l)*alf3+el(2,lxr,lyr,lz,l)*alf4 dz= 1 el(3,lx,ly ,lz,l)*alf1+el(3,lxr,ly ,lz,l)*alf2 1 +el(3,lx,lyr,lz,l)*alf3+el(3,lxr,lyr,lz,l)*alf4 c c..... e.s. field c extt(k,jj,l)= dx 2 + el(1,lx ,ly ,lzr,l)*alf5 + el(1,lxr,ly ,lzr,l)*alf6 3 + el(1,lx ,lyr,lzr,l)*alf7 + el(1,lxr,lyr,lzr,l)*alf8 c eytt(k,jj,l)= dy 2 + el(2,lx ,ly ,lzr,l)*alf5 + el(2,lxr,ly ,lzr,l)*alf6 3 + el(2,lx ,lyr,lzr,l)*alf7 + el(2,lxr,lyr,lzr,l)*alf8 c eztt(k,jj,l)= dz 2 + el(3,lx ,ly ,lzr,l)*alf5 + el(3,lxr,ly ,lzr,l)*alf6 3 + el(3,lx ,lyr,lzr,l)*alf7 + el(3,lxr,lyr,lzr,l)*alf8 c Before c extt(k,jj,l)= 1 elx(lx ,ly ,lz ,l)*alf1 + elx(lxr,ly ,lz ,l)*alf2 1 + elx(lx ,lyr,lz ,l)*alf3 + elx(lxr,lyr,lz ,l)*alf4 2 + elx(lx ,ly ,lzr,l)*alf5 + elx(lxr,ly ,lzr,l)*alf6 3 + elx(lx ,lyr,lzr,l)*alf7 + elx(lxr,lyr,lzr,l)*alf8 c eytt(k,jj,l)= 1 ely(lx ,ly ,lz ,l)*alf1 + ely(lxr,ly ,lz ,l)*alf2 1 + ely(lx ,lyr,lz ,l)*alf3 + ely(lxr,lyr,lz ,l)*alf4 2 + ely(lx ,ly ,lzr,l)*alf5 + ely(lxr,ly ,lzr,l)*alf6 3 + ely(lx ,lyr,lzr,l)*alf7 + ely(lxr,lyr,lzr,l)*alf8 c eztt(k,jj,l)= 1 elz(lx ,ly ,lz ,l)*alf1 + elz(lxr,ly ,lz ,l)*alf2 1 + elz(lx ,lyr,lz ,l)*alf3 + elz(lxr,lyr,lz ,l)*alf4 2 + elz(lx ,ly ,lzr,l)*alf5 + elz(lxr,ly ,lzr,l)*alf6 3 + elz(lx ,lyr,lzr,l)*alf7 + elz(lxr,lyr,lzr,l)*alf8 c

OPTIMIZATION: Particle Sorting No Sorting Sorting Every 5 Time Steps cc cc initial sort cc call psortp3yz(part,pt,ip,npic,npp,noff,idimp, 1 nnopmax,nblok,ny1,nyzpm1) c c...................................................................... c...... main loop ..................................................... nsortt=5 c...................................................................... call timera(-1,'total ') c 100 continue c nt = nt + 1 nb = ( nt / nbb ) * nbb nsort=(nt/nsortt)*nsortt call pmove3( part, 1 edges, npp, sbufr, sbufl, rbufr, rbufl, ihole, 2 jsr, jsl, jss, 3 ncz, kstrt, nvp, idimp, nnopmax, nblok, idps, 4 nbmax, ntmax, ierr ) c if(nsort.eq.nt) then call psortp3yz(part,pt,ip,npic,npp,noff,idimp, 1 nnopmax,nblok,ny1,nyzpm1) end if c c c...................................................................... c...... main loop ..................................................... c...................................................................... call timera(-1,'total ') c 100 continue c nt = nt + 1 nb = ( nt / nbb ) * nbb c call pmove3( part, 1 edges, npp, sbufr, sbufl, rbufr, rbufl, ihole, 2 jsr, jsl, jss, 3 ncz, kstrt, nvp, idimp, nnopmax, nblok, idps, 4 nbmax, ntmax, ierr ) c

Performance Improvements on NERSC’s CRAY T3E Resulting from Optimizations • Improvements by better than a factor of 2 achieved in terms of overall time per particle per step and flops as measured by the ‘pat’ utility on NERSC’s CRAY T3E

Performance Improvements on NERSC’s CRAY T3E and IBM SP Resulting from Optimizations • Optimization for cache starved T3E still results in ~ 30% performance improvement on SP

Performance with Increasing Problem Size and Number of Processors on NERSC’s IBM SP fit • Linear scaling obtained when number of particles increases with number of processors (factor of 2 each at a time)

Comparison on Performance on Appleseed Mcintosh G4 cluster and IBM SP • Equivalent performance up to 8 processors

Summary • Optimization strategy successfully implemented • Optimizations have improved performance on both the T3E and SP • With up to 8 processors, performance is comparable on UCLA’s Appleseed cluster of Macintosh G4s and on NERSC’s IBM SP

Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code