Parallel CC & Petaflop Applications

Parallel CC &Petaflop Applications Ryan OlsonCray, Inc.

Did you know … • Teraflop - Current • Petaflop - Imminent • What’s next? • Exaflop • Zettaflop • YOTTAflop!

Sanibel Symposium Programming Models Parallel CC Implementations Benchmarks Petascale Applications This Talk Distributed Data Interface GAMESS MP-CCSD(T) O vs. V Local & Many-Body Methods Outline

Programming ModelsThe Distributed Data Interface (DDI) Programming Interface, not Programming Model Choose the key functionality from the best programming models and provide: Common Interface Simple and Portable General Implementation Provide an interface to: SPMD: TCGMSG, MPI AMOs: SHMEM, GA SMPs: OpenMP, pThreads SIMD: GPUs, Vector directives, SSE, etc. Use the best models for the underlying hardware.

Overview GAMESSApplication Level Distributed Data Interface (DDI) High-Level API Implementation Native Implementations Non-Native Implementations SHMEM / GPSHMEM MPI-2 MPI-1 + GA MPI-1 TCP/IP Hardware API Elan, GM, etc. System V IPC

Programming ModelsThe Distributed Data Interface • Overview • Virtual Shared-Memory Model (Native) • Cluster Implementation (Non-Native) • Shared Memory/SMP Awareness • Clusters of SMP (DDI versions 2-3) • Goal: Multilevel Parallelism • Intra/Inter-node Parallelism • Maximize Data Locality • Minimize Latency / Maximize Bandwidth

Virtual Shared Memory Model CPU 0 CPU 1 CPU 3 CPU 2 NCols CPU0 CPU1 CPU2 CPU3 Subpatch 0 1 2 3 NRows Distributed Memory Storage Distributed MatrixDDI_Create(Handle,NRows,NCols) • Key Point: • The physical memory available to each CPU is divided into two parts: replicated storage and distributed storage.

Non-Native Implementations(and lost opportunities … ) Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) ComputeProcesses ACC (+=) PUT GET 0 1 2 3 DataServers 4 5 6 7 Distributed Memory Storage(on separate data servers)

DDI till 2003 …

SystemV Shared Memory(Fast Model) Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) GET ComputeProcesses ACC (+=) PUT 0 1 2 3 SharedMemorySegments Data Servers 4 5 6 7 Distributed Memory Storage(in SysV Shared Memory Segments)

DDI v2 - Full SMP Awareness Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) ACC (+=) ComputeProcesses GET PUT 0 1 3 2 SharedMemorySegments DataServers 4 5 6 7 Distributed Memory Storage(on separate System V Shared Memory Segments)

Proof of Principle - 2003 UMP2 Gradient Calculation - 380 BFs Dual AMD MP2200 Cluster using SCI network (2003 Results) Note: DDI v1 was especially problematic onthe SCI network.

DDI v2 • The DDI Library is SMP Aware. • offers new interfaces to make application SMP aware. • DDI programs inherit improvements in the library. • DDI programs do not automatically become SMP aware, unless they utilize the new interface.

Parallel CC and Threads(Shared Memory Parallelism) • Bentz and Kendall • Parallel BLAS3 • WOMPAT ‘05 • OpenMP • Parallelized Remaining Terms • Proof of Principle

Results • Au4 ==> GOOD • CCSD = (T) • No Disk I/O problems • Both CCSD and (T) scale well • Au+(C3H6) ==> POOR/AVERAGE • CCSD scales poorly due to I/O vs. FLOP Balance • (T) scales well, overshadowed by bad CCSD performance • Au8 ==> GOOD • CCSD scales reasonable (Greater FLOP count, about equal I/O). • N7 (T) step dominates over the relatively small time for CCSD. • (T) scales well, so the overall performance is good.

Detailed Speedups …

DDI v3Shared Memory for ALL ComputeProcesses 1 3 0 2 AggregrateDistributedStorage DataServers 2 3 6 7 Replicated Storage ~ 500MB –1GB Shared Memory ~ 1GB – 12GB Distributed Memory ~ 10 – 1000GB

DDI v3 • Memory Hierarchy • Replicated, Shared and Distributed • Program Models • Traditional DDI • Multilevel Model • DDI Groups (a different talk) • Multilevel Models • Intra/Internode Parallelism • Superset of MPI/OpenMP and/or MPI/pThreads models • MPI lacks “true” one-sided messaging

Parallel Coupled Cluster(Topics) • Data Distribution for CCSD(T) • Integrals Distributed • Amplitudes in Shared Memory once per node • Direct [vv|vv] term • Parallelism based on Data Locality • First Generation Algorithm • Ignore I/O • Focus on Data and FLOP parallelism

Important Array Sizes (in GB) v [vv|oo][vo|vo] T2 o v [vv|vo] o

MO Based Terms

Some code … DO 123 I=1,NU IOFF=NO2U*(I-1)+1 CALL RDVPP(I,NO,NU,TI) CALL DGEMM('N','N',NO2,NU,NU2,ONE,O2,NO2,TI,NU2,ONE, & T2(IOFF),NO2) 123 CONTINUE CALL TRMD(O2,TI,NU,NO,20) CALL TRMD(VR,TI,NU,NO,21) CALL VECMUL(O2,NO2U2,HALF) CALL ADT12(1,NO,NU,O1,O2,4) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,VR,NOU,O2,NOU,ONE,VL,NOU) CALL ADT12(2,NO,NU,O1,O2,4) CALL VECMUL(O2,NO2U2,TWO) CALL TRMD(O2,TI,NU,NO,27) CALL TRMD(T2,TI,NU,NO,28) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU) CALL TRANMD(O2,NO,NU,NU,NO,23) CALL TRANMD(T2,NO,NU,NU,NO,23) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU)

do  = 1,nshell do  = 1,nshell compute: transform: end do end do transform: contract: PUT and for ij Direct [VV|VV] Term do  = 1,nshell do  = 1, 11 21 22…occ indices…(NoNo)* PUT 11 12 13 …atomic orbital indices … Nbf2 end do end do synchronize for each “local” ij column do GET GET reorder: shell --> AO order transform: STORE in “local” solution vector 0 1 … processes … P-1 end do

(T) Parallelism • Trivial -- in theory • [vv|vo] distributed • v3 work arrays • at large v -- stored in shared memory • disjoint updates where both quantities are shared

Timings … (H2O)6 Prism - aug’-cc-pVTZFastest timing: < 6 hours on 8x8 Power5

Improvements … • Semi-Direct [vv|vv] term (IKCUT) • Concurrent MO terms • Generalized amplitudes storage

do  = 1,nshell do  = 1,nshell compute: transform: end do end do transform: contract: PUT and for ij Semi-Direct [VV|VV] Term do  = 1,nshell ! I-SHELL do  = 1, ! K-SHELL • Define IKCUT • Store if: LEN(I)+LEN(K) > IKCUT • Automatic contention avoidance • Adjustable: Fully direct to fully conventional. end do end do

Semi-Direct [vv|vv] Timings Water Tetramer / aug’-cc-pVTZ Storage: Shared NFS mounted (bad example).Local Disk or a higher quality Parallel File System (LUSTRE, etc.) should perform better. However: GPUs generate AOs much faster than they can be read off the disk.

Concurrency • Everything N-ways parallel • NO • Biggest mistake • Parallelizing every MO term over all cores. • Fix • Concurrency

Concurrent MO terms Nodes [vv|vv] MO Terms - Parallelized over the minimum number of nodes while still efficient & fast. MO nodes join the [vv|vv] term already in progress … dynamic load balancing.

Adaptive Computing • Self Adjusting / Self Tuning • Concurrent MO terms • Value of IKCUT • Use the iterations to improve the calculation: • Adjust initial node assignments • Increase IKCUT • Monte Carlo approach to tuning paramaters.

Conclusions … • Good First Start … • [vv|vv] scales perfectly with node count. • multilevel parallelism • adjustable i/o usage • A lot to do … • improve intra-node memory bottlenecks • concurrent MO terms • generalized amplitude storage • adaptive computing • Use the knowledge from these hand coded methods to refine the CS structure in automated methods.

People Mark Gordon Mike Schmidt Jonathan Bentz Ricky Kendall Alistair Rendell Funding DoE SciDAC SCL (Ames Lab) APAC / ANU NSF MSI Acknowledgements

Petaflop Applications(benchmarks, too) • Petaflop = ~125,000 2.2 GHz AMD Opteron cores. • O vs. V • small O, big V ==> CBS Limit • big O ==> see below • Local and Many-Body Methods • FMO, EE-MB, etc. - use existing parallel methods • Sampling

Parallel CC & Petaflop Applications