Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters

SimuTools, Malaga, Spain March 16, 2010 Efficient Simulation ofAgent-based Models onMulti-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D ManagerOak Ridge National Laboratory Adjunct ProfessorGeorgia Institute of Technology

In a Nut Shell Dramatic improvements in speed

Outline

Agent Based Modeling and Simulation (ABMS) Game of Life Afghan Leadership ABMS: Motivating Demonstrations GOL LDR

GPU-based ABMS References • Examples: • K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008 • R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007

Hierarchical GPU System Hardware

Host initiates “launch” of many SIMD threads Threads get “scheduled” in batches on GPU hardware CUDA claims extremely efficient thread-launch implementation Millions of CUDA threads at once Computation Kernels on each GPUE.g., CUDA Threads

GPU Memory Types (CUDA) • GPU memory comes in several flavors • Registers • Local Memory • Shared Memory • Constant Memory • Global Memory • Texture Memory • An important challenge is organizing the application to make most effective use of hierarchy

GPU Communication Latencies (CUDA)

CUDA + MPI • An economical cluster solution • Affordable GPUs, each providing one-node CUDA • MPI on giga-bit Ethernet for inter-node comm. • Memory speed-constrained system • Inter-memory transfers can dominate runtime • Runtime overhead can be severe • Need a way to tie CUDA and MPI • Algorithmic solution needed • Need to overcome latency challenge

Analogous Networked Multi-core System

Block0,0 P0,0 Block0,1 P0,1 Block0,2 P0,2 Block1,1 P1,1 Block1,0 P1,0 Block1,2 P1,2 B Block2,0 P2,0 Block2,1 P2,1 Block2,2 P2,2 Parallel Execution: Conventional Method

Latency Challenge: Conventional Method • High latency between GPU and CPU memories • CUDA inter-memory data transfer primitives • Very high latency across CPU memories • MPI communication for data transfers • Naïve method gives very poor computation to communication ratio • Slow-downs instead of speedups • Need latency resilient method …

Block0,0 P0,0 Block0,1 P0,1 Block0,2 P0,2 Block1,1 P1,1 Block1,0 P1,0 Block1,2 P1,2 B Block2,0 P2,0 Block2,1 P2,1 Block2,2 P2,2 Our Solution: B2R Method R R

B2R Algorithm

Total Runtime Cost: Analytical Form At any level in the hierarchy, total runtime F is given by: Most interesting aspect Cubic in R!

Implications of being Cubic in R • Benefits with B2R not immediately seen for small R • In fact, degradation for small R! • Dramatic improvement possible after small R • Our experiments confirm this trend! • Too large is too bad too • Can’t profit indefinitely! Total Execution Time R

Sub-division Across LevelsE.g., MPI to Blocks to Threads MPI: Rm Block: Rb Thread: Rt

Hierarchy and Recursive Use of B & R B2R can be applied at all levels! E.g., CUDA Hierarchy • A different R can be chosen at every level, E.g. • Rb for block-level R • Rt for thread-level R • Simple constraints exist for possible values of R • Between R and B • Between R’s at different levels • Details in our paper

B2R Implementation within CUDA

Performance Over 100× speedup with MPI+CUDA Speedup relative to naïve method with no latency-hiding

Multi-GPU MPI+CUDA – Game of Life

Multi-core MPI+pthreads– Game of Life

Multi-core MPI+Pthreads – Game of Life

Multi-core MPI+pthreads – Leadership

Summary • B2R Algorithm applies across heterogeneous, hierarchical platforms • Deep GPU hierarchies • Deep CPU multi-core systems • Cubic nature of runtime dependence on R is a a remarkable aspect • A maximum and minimum exist • Optimal (minimum) can be dramatically low • Results show clear performance improvement • Up to 150x in the best case (fine grained)

Future Work • Generate cross-platform code • E.g, Implement in OpenCL • Add to CUDA-MPI levels • Multi-GPU per node • Implement and test with more benchmarks • E.g., From existing ABMS suites NetLogo & Repast • Generalize to unstructured inter-agent graphs • E.g., Social networks • Potential to apply to other domains • E.g., Stencil computations

Thank you!Questions? Additional material at our webpage: Discrete Computing Systems www.ornl.gov/~2ip

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters