Parallel CC & Petaflop Applications - PowerPoint PPT Presentation

louise
parallel cc petaflop applications n.
Skip this Video
Loading SlideShow in 5 Seconds..
Parallel CC & Petaflop Applications PowerPoint Presentation
Download Presentation
Parallel CC & Petaflop Applications

play fullscreen
1 / 35
Download Presentation
106 Views
Download Presentation

Parallel CC & Petaflop Applications

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Parallel CC &Petaflop Applications Ryan OlsonCray, Inc.

  2. Did you know … • Teraflop - Current • Petaflop - Imminent • What’s next? • Exaflop • Zettaflop • YOTTAflop!

  3. Sanibel Symposium Programming Models Parallel CC Implementations Benchmarks Petascale Applications This Talk Distributed Data Interface GAMESS MP-CCSD(T) O vs. V Local & Many-Body Methods Outline

  4. Programming ModelsThe Distributed Data Interface (DDI) Programming Interface, not Programming Model Choose the key functionality from the best programming models and provide: Common Interface Simple and Portable General Implementation Provide an interface to: SPMD: TCGMSG, MPI AMOs: SHMEM, GA SMPs: OpenMP, pThreads SIMD: GPUs, Vector directives, SSE, etc. Use the best models for the underlying hardware.

  5. Overview GAMESSApplication Level Distributed Data Interface (DDI) High-Level API Implementation Native Implementations Non-Native Implementations SHMEM / GPSHMEM MPI-2 MPI-1 + GA MPI-1 TCP/IP Hardware API Elan, GM, etc. System V IPC


  6. Programming ModelsThe Distributed Data Interface • Overview • Virtual Shared-Memory Model (Native) • Cluster Implementation (Non-Native) • Shared Memory/SMP Awareness • Clusters of SMP (DDI versions 2-3) • Goal: Multilevel Parallelism • Intra/Inter-node Parallelism • Maximize Data Locality • Minimize Latency / Maximize Bandwidth

  7. Virtual Shared Memory Model CPU 0 CPU 1 CPU 3 CPU 2 NCols CPU0 CPU1 CPU2 CPU3 Subpatch 0 1 2 3 NRows Distributed Memory Storage Distributed MatrixDDI_Create(Handle,NRows,NCols) • Key Point: • The physical memory available to each CPU is divided into two parts: replicated storage and distributed storage.

  8. Non-Native Implementations(and lost opportunities … ) Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) ComputeProcesses ACC (+=) PUT GET 0 1 2 3 DataServers 4 5 6 7 Distributed Memory Storage(on separate data servers)

  9. DDI till 2003 …

  10. SystemV Shared Memory(Fast Model) Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) GET ComputeProcesses ACC (+=) PUT 0 1 2 3 SharedMemorySegments Data Servers 4 5 6 7 Distributed Memory Storage(in SysV Shared Memory Segments)

  11. DDI v2 - Full SMP Awareness Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) ACC (+=) ComputeProcesses GET PUT 0 1 3 2 SharedMemorySegments DataServers 4 5 6 7 Distributed Memory Storage(on separate System V Shared Memory Segments)

  12. Proof of Principle - 2003 UMP2 Gradient Calculation - 380 BFs Dual AMD MP2200 Cluster using SCI network (2003 Results) Note: DDI v1 was especially problematic onthe SCI network.

  13. DDI v2 • The DDI Library is SMP Aware. • offers new interfaces to make application SMP aware. • DDI programs inherit improvements in the library. • DDI programs do not automatically become SMP aware, unless they utilize the new interface.

  14. Parallel CC and Threads(Shared Memory Parallelism) • Bentz and Kendall • Parallel BLAS3 • WOMPAT ‘05 • OpenMP • Parallelized Remaining Terms • Proof of Principle

  15. Results • Au4 ==> GOOD • CCSD = (T) • No Disk I/O problems • Both CCSD and (T) scale well • Au+(C3H6) ==> POOR/AVERAGE • CCSD scales poorly due to I/O vs. FLOP Balance • (T) scales well, overshadowed by bad CCSD performance • Au8 ==> GOOD • CCSD scales reasonable (Greater FLOP count, about equal I/O). • N7 (T) step dominates over the relatively small time for CCSD. • (T) scales well, so the overall performance is good.

  16. Detailed Speedups …

  17. DDI v3Shared Memory for ALL ComputeProcesses 1 3 0 2 AggregrateDistributedStorage DataServers 2 3 6 7 Replicated Storage ~ 500MB –1GB Shared Memory ~ 1GB – 12GB Distributed Memory ~ 10 – 1000GB

  18. DDI v3 • Memory Hierarchy • Replicated, Shared and Distributed • Program Models • Traditional DDI • Multilevel Model • DDI Groups (a different talk) • Multilevel Models • Intra/Internode Parallelism • Superset of MPI/OpenMP and/or MPI/pThreads models • MPI lacks “true” one-sided messaging

  19. Parallel Coupled Cluster(Topics) • Data Distribution for CCSD(T) • Integrals Distributed • Amplitudes in Shared Memory once per node • Direct [vv|vv] term • Parallelism based on Data Locality • First Generation Algorithm • Ignore I/O • Focus on Data and FLOP parallelism

  20. Important Array Sizes (in GB) v [vv|oo][vo|vo] T2 o v [vv|vo] o

  21. MO Based Terms

  22. Some code … DO 123 I=1,NU IOFF=NO2U*(I-1)+1 CALL RDVPP(I,NO,NU,TI) CALL DGEMM('N','N',NO2,NU,NU2,ONE,O2,NO2,TI,NU2,ONE, & T2(IOFF),NO2) 123 CONTINUE CALL TRMD(O2,TI,NU,NO,20) CALL TRMD(VR,TI,NU,NO,21) CALL VECMUL(O2,NO2U2,HALF) CALL ADT12(1,NO,NU,O1,O2,4) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,VR,NOU,O2,NOU,ONE,VL,NOU) CALL ADT12(2,NO,NU,O1,O2,4) CALL VECMUL(O2,NO2U2,TWO) CALL TRMD(O2,TI,NU,NO,27) CALL TRMD(T2,TI,NU,NO,28) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU) CALL TRANMD(O2,NO,NU,NU,NO,23) CALL TRANMD(T2,NO,NU,NU,NO,23) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU)

  23. MO Parallelization T2 Soln T2 Soln 1 3 0 2 [vo*|vo*], [vv|o*o*][vv|v*o*] [vo*|vo*], [vv|o*o*][vv|v*o*] Goal: Disjoint updates to the solution matrix.Avoid locking/critical sections whenever possible.

  24. do  = 1,nshell do  = 1,nshell compute: transform: end do end do transform: contract: PUT and for ij Direct [VV|VV] Term do  = 1,nshell do  = 1, 11 21 22…occ indices…(NoNo)* PUT 11 12 13 …atomic orbital indices … Nbf2 end do end do synchronize for each “local” ij column do GET GET reorder: shell --> AO order transform: STORE in “local” solution vector 0 1 … processes … P-1 end do

  25. (T) Parallelism • Trivial -- in theory • [vv|vo] distributed • v3 work arrays • at large v -- stored in shared memory • disjoint updates where both quantities are shared

  26. Timings … (H2O)6 Prism - aug’-cc-pVTZFastest timing: < 6 hours on 8x8 Power5

  27. Improvements … • Semi-Direct [vv|vv] term (IKCUT) • Concurrent MO terms • Generalized amplitudes storage

  28. do  = 1,nshell do  = 1,nshell compute: transform: end do end do transform: contract: PUT and for ij Semi-Direct [VV|VV] Term do  = 1,nshell ! I-SHELL do  = 1, ! K-SHELL • Define IKCUT • Store if: LEN(I)+LEN(K) > IKCUT • Automatic contention avoidance • Adjustable: Fully direct to fully conventional. end do end do

  29. Semi-Direct [vv|vv] Timings Water Tetramer / aug’-cc-pVTZ Storage: Shared NFS mounted (bad example).Local Disk or a higher quality Parallel File System (LUSTRE, etc.) should perform better. However: GPUs generate AOs much faster than they can be read off the disk.

  30. Concurrency • Everything N-ways parallel • NO • Biggest mistake • Parallelizing every MO term over all cores. • Fix • Concurrency

  31. Concurrent MO terms Nodes [vv|vv] MO Terms - Parallelized over the minimum number of nodes while still efficient & fast. MO nodes join the [vv|vv] term already in progress … dynamic load balancing.

  32. Adaptive Computing • Self Adjusting / Self Tuning • Concurrent MO terms • Value of IKCUT • Use the iterations to improve the calculation: • Adjust initial node assignments • Increase IKCUT • Monte Carlo approach to tuning paramaters.

  33. Conclusions … • Good First Start … • [vv|vv] scales perfectly with node count. • multilevel parallelism • adjustable i/o usage • A lot to do … • improve intra-node memory bottlenecks • concurrent MO terms • generalized amplitude storage • adaptive computing • Use the knowledge from these hand coded methods to refine the CS structure in automated methods.

  34. People Mark Gordon Mike Schmidt Jonathan Bentz Ricky Kendall Alistair Rendell Funding DoE SciDAC SCL (Ames Lab) APAC / ANU NSF MSI Acknowledgements

  35. Petaflop Applications(benchmarks, too) • Petaflop = ~125,000 2.2 GHz AMD Opteron cores. • O vs. V • small O, big V ==> CBS Limit • big O ==> see below • Local and Many-Body Methods • FMO, EE-MB, etc. - use existing parallel methods • Sampling