Using VASP on Ranger

Hang Liu Using VASP on Ranger

A part of an AUS project for VASP users from UCSB computational material science group led by Prof.Chris Van De Walle Collaborative efforts with Dodi Heryadi and Mark Vanmoer at NCSA, Anderson Janotti and Maosheng Miao at NCSB, coordinated by Amitava Majumda at SDSC and Bill Barth at TACC Many heuristics from HPC group at TACC, other users and their tickets Goal: have the VASP running on Ranger with reasonable performance About this work and talk

An ab-initio quantum mechanical molecular dynamics package. Current version 4.6, many users have latest development version Straightforward compilation by both Intel and PGI compilers + Mvapich Some performance libraries are needed, BLAS, LAPACK, FFT and ScaLapack. VASP Basics

Standard Compilation • Intel + Mvapich FFLAGS = -O1 -xW • PGI + Mvapich: FFLAGS = -tp barcelona-64 • GotoBLAS + LAPACK + FFTW3

Performance profiling of a testing run with 120 MPI tasks by IPM • # [total] <avg> min max • # wallclock 136076 1133.97 1133.97 1133.98 • # user 135664 1130.54 1127.63 1131.5 • # system 65.5808 0.546507 0.36 2.432 • # mpi 31809.9 265.083 235.753 299.535 • # %comm 23.3763 20.7899 26.4147 • # gflop/sec 222.097 1.85081 1.71003 1.8556 • # gbytes 78.4051 0.653376 0.646839 0.684002

Reasonable performance: 1.9GFLOPS/task Not memory intensive: 0.7GB/task Somehow communication intensive: 23% MPI Balanced instructions, communications and timings

Observing the performance bottlenecks in VASP by TAU • Most instruction intensive routines executed with very good performance, • The most time consuming routines looks like a random number generation and MPI communication, what does wave.f90 do?

induced by a user ticket: • VASP running much slower on Ranger than on Lonestar Hybrid Compilation and NUMA Control

User reported following timing of a VASP calculation On Lonestar LOOP: VPU time 62.07: CPU time 62.52LOOP: VPU time 76.34: CPU time 76.LOOP: VPU time 101.73: CPU time 101.83LOOP: VPU time 101.84: CPU time 102.15LOOP: VPU time 113.80: CPU time 114.01LOOP: VPU time 105.28: CPU time 105.38LOOP: VPU time 102.89: CPU time 103.00LOOP: VPU time 94.83: CPU time 94.93LOOP: VPU time 113.42: CPU time 113.53LOOP: VPU time 102.02: CPU time 102.08LOOP: VPU time 113.96: CPU time 114.04LOOP: VPU time 102.45: CPU time 102.53LOOP: VPU time 66.74: CPU time 66.84LOOP+: VPU time 1365.13: CPU time 1389.13 On Ranger: LOOP: VPU time 226.76: CPU time 173.41 LOOP: VPU time 288.46: CPU time 218.79 LOOP: VPU time 383.08: CPU time 287.26 LOOP: VPU time 385.57: CPU time 275.94 LOOP: VPU time 405.54: CPU time 303.41 LOOP: VPU time 378.68: CPU time 279.45 LOOP: VPU time 383.70: CPU time 283.03 LOOP: VPU time 353.88: CPU time 259.22 LOOP: VPU time 407.47: CPU time 300.22 LOOP: VPU time 378.02: CPU time 276.17 LOOP: VPU time 414.57: CPU time 307.78 LOOP: VPU time 382.99: CPU time 282.14 LOOP: VPU time 248.97: CPU time 188.06 LOOP+: VPU time 4754.26: CPU time 3591.14 Almost 3 times slower, must be something not right

In user’s makefileMKLPATH = ${TACC_MKL_LIB}BLAS= -L$(MKLPATH) $(MKLPATH)/libmkl_em64t.a $(MKLPATH)/libguide.a -lpthreadLAPACK= $(MKLPATH)/libmkl_lapack.a Is MKL on Ranger multi-threaded ? Looks like it is -pe 8way 192 setenv OMP_NUM_THREADS=1 ibrun tacc_affinity ./vasp LOOP: VPU time 61.31: CPU time 62.44 LOOP: VPU time 75.21: CPU time 75.33 LOOP: VPU time 97.97: CPU time 98.02 LOOP: VPU time 98.58: CPU time 98.65 LOOP: VPU time 108.35: CPU time 108.50 LOOP: VPU time 102.18: CPU time 102.45 LOOP: VPU time 99.29: CPU time 99.37 LOOP: VPU time 92.51: CPU time 92.57 LOOP: VPU time 108.44: CPU time 108.50 LOOP: VPU time 99.44: CPU time 99.51 LOOP: VPU time 108.74: CPU time 108.82 LOOP: VPU time 99.13: CPU time 99.23 LOOP: VPU time 64.47: CPU time 64.54 LOOP+: VPU time 1336.91: CPU time 1378.16 • Right number of threads • NUMA control commands • Proper core-memory affinity • Comparable performance to that on Lonestar

How can multi-threaded BLAS improve VASP performance? • VASP guide says: for good performance, VASP requires highly optimized BLAS routines • Multi-threaded BLAS is available on Ranger • MKL and GotoBLAS

Case-1: both BLAS and LAPACK are from MKL, 4 way, 4 threads in each way LOOP: VPU time 123.00: CPU time 66.00 LOOP: VPU time 157.92: CPU time 82.67 LOOP: VPU time 190.06: CPU time 97.56 LOOP: VPU time 179.26: CPU time 93.55 LOOP: VPU time 193.40: CPU time 99.40 LOOP: VPU time 209.87: CPU time 107.05 LOOP: VPU time 185.16: CPU time 95.44 LOOP: VPU time 185.77: CPU time 96.56 LOOP: VPU time 190.02: CPU time 99.51 LOOP: VPU time 201.10: CPU time 105.07 LOOP: VPU time 191.07: CPU time 99.26 LOOP: VPU time 195.49: CPU time 101.31 LOOP: VPU time 193.11: CPU time 99.38 LOOP: VPU time 147.91: CPU time 76.65 LOOP+: VPU time 2677.34: CPU time 1477.63 ==> almost the same as the 8x1 case. No improvement.

Case-2: both BLAS and LAPACK are from GotoBLAS, 4 way, 4 threads in each way BLAS= -L$(TACC_GOTOBLAS_LIB) -lgoto_lp64_mp –lpthread LOOP: VPU time 153.81: CPU time 46.27 LOOP: VPU time 198.21: CPU time 58.55 LOOP: VPU time 235.09: CPU time 69.63 LOOP: VPU time 225.93: CPU time 66.80 LOOP: VPU time 236.93: CPU time 71.55 LOOP: VPU time 256.36: CPU time 77.62 LOOP: VPU time 226.96: CPU time 68.61 LOOP: VPU time 230.06: CPU time 69.34 LOOP: VPU time 236.31: CPU time 71.27 LOOP: VPU time 251.50: CPU time 76.00 LOOP: VPU time 236.78: CPU time 71.45 LOOP: VPU time 241.77: CPU time 73.01 LOOP: VPU time 236.59: CPU time 71.39 LOOP: VPU time 182.20: CPU time 54.38 LOOP+: VPU time 3404.57: CPU time 1075.91 ==> The BLAS in GotoBLAS is much better than that in MKL. 30% faster for this case

4 way, 1 thread in each way: LOOP: VPU time 63.08: CPU time 63.72 LOOP: VPU time 80.91: CPU time 80.98 LOOP: VPU time 95.91: CPU time 96.00 LOOP: VPU time 91.77: CPU time 91.86 LOOP: VPU time 97.23: CPU time 97.42 LOOP: VPU time 105.29: CPU time 105.42 LOOP: VPU time 93.45: CPU time 93.57 LOOP: VPU time 94.48: CPU time 94.57 LOOP: VPU time 97.43: CPU time 97.52 LOOP: VPU time 103.22: CPU time 103.31 LOOP: VPU time 97.28: CPU time 97.35 LOOP: VPU time 99.45: CPU time 99.55 LOOP: VPU time 97.44: CPU time 97.50 LOOP: VPU time 74.86: CPU time 74.92 LOOP+: VPU time 1418.64: CPU time 1443.82 4 way, 2 threads in each way: LOOP: VPU time 89.57: CPU time 49.98 LOOP: VPU time 115.40: CPU time 63.41 LOOP: VPU time 136.96: CPU time 75.27 LOOP: VPU time 131.39: CPU time 72.32 LOOP: VPU time 138.68: CPU time 77.14 LOOP: VPU time 149.38: CPU time 83.26 LOOP: VPU time 132.74: CPU time 73.71 LOOP: VPU time 134.08: CPU time 74.40 LOOP: VPU time 138.26: CPU time 76.82 LOOP: VPU time 146.69: CPU time 81.61 LOOP: VPU time 138.44: CPU time 76.86 LOOP: VPU time 140.98: CPU time 78.38 LOOP: VPU time 138.54: CPU time 76.99 LOOP: VPU time 106.45: CPU time 58.89 LOOP+: VPU time 1998.97: CPU time 1148.74

VASP can be compiled straightforwardly, has reasonable performance • When linked with multi-threaded libraries, set proper number of threads and NUMA control commands • Multi-threaded GotoBLAS leads obvious performance improvement Summary and Outlook • ScaLapack: maybe not scaled very well • Task geometry: can a specific process-thread arrangement minimize communication cost?

Using VASP on Ranger