K. Roche Future Technologies Group Computer Science and Mathematics Division

THE UNEDF COLLABORATION –COMPUTER SCIENCE RESEARCH AND SOFTWARE DEVELOPMENT 1. Superfluid Local Density Approximation Imaginary Time DFT Solver –Magierski, Bulgac Time Dependent DFT –Bulgac, Yu 2. Coupled Cluster Tensor Based Products -Dean K. Roche Future Technologies Group Computer Science and Mathematics Division Oak Ridge National Laboratory

Time-Dependent Superfluid Local Density Approximation (TD-SLDA) generalization of Kohn-Sham LDA to superfluid systems • Space-time lattice • use of FFTW for spatial derivative • No matrix operations Features of this extension to DFT: All nuclei (odd, even, spherical, deformed) Any quantum numbers of QRPA modes Fully selfconsistent, no self-consistent symmetriesimposed Observables and spectrum Higgs mode of the pairing field in a homogeneous unitary Fermi gas - Bulgac, Yoon

C version and F90 versions exist and are numerically consistent • Energy and particle number are conserved in time • Compiled and run on various platforms • Intel Xeon dual processor, quad-core (2X) • AMD Opteron single processor, dual-core (quad-core forthecoming) • e.g. 40^3 on jaguarcnl on 3200 nodes at 58s / time step • (1600 pes) 750e12 INS , 46e12 FLOP / time step • Scaling behavior • max lattice size ~ 66^3 on jaguar ~ ( 31,328 PEs , 2GB RAM / PE , 250TF ) • (per time step)STORAGE ~ O(2^40 BYTE) , COMPUTE ~ O(2^40 FLOP) • (per time step)Petascale problem at Nx = Ny = Nz > 100 • strong, fix lattice dimensions and add MPI processes thus • reducing the number of qp wavefunctions per MPI process • weak, increased problem complexity and hardware allocation, fixed walltime • Metric ~ cost($,time) / lattice site / wavefunction / timestep / pe • not ideal –memory bound (e.g. Nx = Ny = Nz = 30 )

Lattice DFT solver (imaginary time evolution) -SLDA Computed Energies: Harmonic Oscillator, Spline, Multiwavelets, and 3D-lattice • Goal is to compute ground state energies for all nuclei –a basis for time dependent code • Lattice code w/ plane wave basis and extensive use of FFT • Current version is not explicitly parallel • General loss of orthogonality –must diagonalize / re-orthogonalize • .75 walltime ~ evolution ; .25 walltime ~ diagonalization • Parallel discretization (will be) same as for time dependent slda lattice code

Coupled Cluster • ab initio technique • calculates ground-state properties of closed-shell (or sub-shell) nuclei • solves coupled nonlinear sets of equations (largest ~10M unknowns) • CCSD ~ O(nu^2 * No^4) (nu:=unoccupied orbitals; No:=occupied) • CCSDT ~ O(nu^3 * No^5) GOAL n~100 , N~1000 O( 2^40 BYTES ) –not memory bound O( 1E13 FLOP )-CCSD O( 1E18 FLOP )-CCSDT WHERE i,j,k,l = n := occupied AND a,b,c,d = N – n := unoccupied

HARD PROBLEM sum of wall time: 449.967129 sum of cpu time: 561.210000 • Irregular data access patterns for • copying / computation • Forced network / disk usage bc • of non-local terms • Products of rectangular arrays don’t • perform like square ones • Performance is Mixture • of BLAS1 , BLAS2 , BLAS3 • Large variance per term

Related (current and past) Computer Science Efforts • Tensor Contraction Engine (TCE) • Rely on combination of local disk and communication network • Cannon’s algorithm, replicated data blocks , accumulation via • all-all reduction over partial sums • Portable Remote Memory Copy Library for Distributed Array Libraries and • Compiler Run-time Systems (Aggregate Remote Memory Copy Interface, • Global Arrays) • Shared and remote-memory based universal matrix multiplication • algorithm • Co-array Fortran (Coupled Cluster cont.)

Software Dependencies • LIBRARIES • MPI1,2 (MPICH, Cray, Intel, OpenMPI) • POSIX Threads • BLAS(ATLAS,GOTO,LIBSCI,ACML,MKL) • LAPACK(ATLAS,LIBSCI,ACML,MKL) • ScaLAPACK(LIBSCI,CMKL) • FFTW(LIBSCI,ACML,MKL) • LUSTRE • OPERATING SYSTEMS • UNIX (Compute Node • Linux, Cray) • LANGUAGES / COMPILERS • C, Fortran90 • GCC(3.*,4.*) • Portland Group(7.*) • Intel(9.*,10*) • ISO/IEC JTC1/SC22 - Programming languages and operating systems • C99 + TC1 + TC2, WG14 N1124, dated 2005-05-06 • http://www.open-std.org/JTC1/SC22/WG14/www/standards.html • F03 (08): P1(base) + P2(variable length strings) + P3(conditional compilation) • http://www.nag.co.uk/sc22wg5 • http://www.co-array.org/caf_intro.htm (co-arrays -remote memory ops) • POSIX Threads: IEEE Std 1003.1, 2004 Edition;ISO/IEC standard 9945-1:1996 • http://www.unix.org/single_unix_specification

BONUS SLIDES

How to Speedup Relevant SLDA Computations Exploiting Lightweight Processes (PTHREADS) on Multicore Nodes • Communication and I/O (MPI_THREAD_Init()) • swap quasi-particle wavefunctions to / from • network disk / memory • overlap of reduction and computation • Partial densities –e.g. local summations • Strong scaling of computations to number • of processor cores • Discrete Fourier Transforms • Multithreaded versions for transpose and • multiplication

Coupled Cluster: Macro-index construction from rank four object • Construct macro indices I = {k,c} , J = {d,l} , K = {b,j} , L= {a,i} • Expensive copy • Q(J,I) , P(K,J) , S(I,L) are the results • Make operation look like sequence of matrix multiplications • Compute P(K,J) * Q(J,I) = R(K,I) • Compute R(K,I) * S(I,L) = W(K,L) • Store W(K,L) as copy I=0 Do a=1,N Do b=1,N I = I +1 J = 0 Do i=1,n Do j=1,n J = j + n*(i-1) J = J + 1 A(I,J) = t(a,b,i,j)

Disk and network memory (MPI) based arbitrary rank object storage and manipulation: data structure explorations MPI_COMM_WORLD Virtual rectangular Process grid Subgroup formation Subgroup Bcast , parse Multithreaded range retrieval 2 0 0 1 2 0 1 2 0 1 2 0 1 2 1 3 3 4 5 3 4 5 3 4 5 4 5 Higher Rank Objects Matrices 8 100 100 400 400 ./31243.dat.0 0 1073741823 ./31243.dat.1 1073741824 2147483647 ./31243.dat.2 2147483648 3221225471 ./31243.dat.3 3221225472 4294967295 ./31243.dat.4 4294967296 5368709119 ./31243.dat.5 5368709120 6442450943 ./31243.dat.6 6442450944 7516192767 ./31243.dat.7 7516192768 8589934591 ./31243.dat.8 8589934592 9663676415 ./31243.dat.9 9663676416 10737418239 ./31243.dat.10 10737418240 11811160063 ./31243.dat.11 11811160064 12799999999 [rochekj@athena0 ~]$ cat /share/data1/rochekj/2111149.dat 8 32768 32768 /share/data1/rochekj/2111149.dat.0 0 1073741823 /share/data1/rochekj/2111149.dat.1 1073741824 2147483647 /share/data1/rochekj/2111149.dat.2 2147483648 3221225471 /share/data1/rochekj/2111149.dat.3 3221225472 4294967295 /share/data1/rochekj/2111149.dat.4 4294967296 5368709119 /share/data1/rochekj/2111149.dat.5 5368709120 6442450943 /share/data1/rochekj/2111149.dat.6 6442450944 7516192767 /share/data1/rochekj/2111149.dat.7 7516192768 8589934591 *replace PATH w/ struct holding MPI id and ptr

Examples: kfil_rd_r() , kfil_rd_panel_r() –as needed for PBLAS, ScaLAPACK based operations [rochekj@athena0 kr-pt]$ ./xkfil_rd 1 /share/data1/rochekj/2111149.dat 0 /share/data1/rochekj/2111149.dat: wall_t: 0.009448 cpu_t: 0.000000 dim[0] = 32768 dim[1] = 32768 [0 / 1]: read BYTES := 8589934592 BYTES(kfil_rd_r) : 8589934592 t[s]: 45.968010 178.2MBps [rochekj@athena0 kr-pt]$ 1 process [rochekj@athena0 kr-pt]$ ./xkfil_rd 4 /share/data1/rochekj/2111149.dat 0 /share/data1/rochekj/2111149.dat: wall_t: 0.009792 cpu_t: 0.000000 dim[0] = 32768 dim[1] = 32768 [0] myoff=0 sb=0 eb=2147483647 tb=2147483648 [1] myoff=268435456 sb=2147483648 eb=4294967295 tb=2147483648 [2] myoff=536870912 sb=4294967296 eb=6442450943 tb=2147483648 [3] myoff=805306368 sb=6442450944 eb=8589934591 tb=2147483648 [0 / 4]: read BYTES := 2147483648 [1 / 4]: read BYTES := 2147483648 [2 / 4]: read BYTES := 2147483648 [3 / 4]: read BYTES := 2147483648 BYTES(kfil_rd_r) : 8589934592 t[s]: 33.940000 241.4MBps [rochekj@athena0 kr-pt]$ 4 pthread processes from within process Multithreaded range retrieval 0

Multi-threaded Asynchronous List Construction and Control in MPI Space • Useful to both slda and cc software efforts • Based on producer driven bounded buffer problem • client / server based producer consumer model • MPI_THREAD_Init() • MPI_THREAD_{SINGLE,FUNNELED,SERIALIZED,MULTIPLE} • Discuss benefits here …

BLAS remain one tool of choice but Moore’s Law still lingers … • Computation: • -Theoretical peak:(# cores)*(flops/cycle/core)*(cycles/second) Memory: -Theoretical peak: (bus width)*(bus speed) 6GBytes/second • y =  x + y :3 operands (24 Bytes) needed for 2 flops • (3operands/2flop)*(8bytes/operand)*2core*2(flop/cyc/core)*2.6e9(cyc/sec) • requires ~125 GBytes per second; have 6GBps  how to get sustainability?? BLAS 1: O(n) operations on O(n) operands BLAS 2: O(n**2) operations on O(n**2) operands BLAS 3: O(n**3) operations on O(n**2) operands

BE AWARE AND LEARN FROM PAST EXPERIENCES! 1. Form group G1 from MPI_COMM_WORLD 2. Form group G2 := outliers (feature extraction) 3. Form group G3 = G1 \ G2 and COMM3 (work group and communicator)

Imaginary Time DFT Solver

The renormalized energy density functional -

Bogoliubov quasi-particle wavefunction with index label,k • Computation is parallelized over index label k defined on • a periodic cubic lattice of dimension Nx**3 in a plane wave basis • Each process has a copy of the lattices (x,k) • Partial densities are computed locally, reduced and broadcast • Computation of gradients is required for this step • Fixing the number density of the homogeneous gas assigns energy • per particle in k-space, makes the chemical potential and the pairing • gap computable, and leads to the parameters a,b,g in the energy • functional • We can now evaluate the energy functional and begin time evolution

Adams-Bashforth (Moulton): predictor, corrector, modifier ‘ • After computation of p(n+1), and m(n+1), the kinetic energy density • must be updatedwe must compute time derivatives and gradients of the • qpwf and subsequently the required densities • Next, c(n+1) and phi(n+1) can be computed and used to update the densities • Again (as in previous step) • The energy functional and total particle number are evaluated, • time is bumped, the values of the quasi-particle wavefunctions at different • time indices are shifted (copied in memory)

K. Roche Future Technologies Group Computer Science and Mathematics Division