Efficient simulation of agent based models on multi gpu multi core clusters
Sponsored Links
This presentation is the property of its rightful owner.
1 / 28

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on
  • Presentation posted in: General

SimuTools , Malaga, Spain March 16, 2010. Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters. Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology. In a Nut Shell.

Download Presentation

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


SimuTools, Malaga, Spain

March 16, 2010

Efficient Simulation ofAgent-based Models onMulti-GPU & Multi-Core Clusters

Kalyan S. Perumalla, Ph.D.

Senior R&D ManagerOak Ridge National Laboratory

Adjunct ProfessorGeorgia Institute of Technology


In a Nut Shell

Dramatic improvements in speed


Outline


Agent Based Modeling and Simulation (ABMS)

Game of Life

Afghan Leadership

ABMS: Motivating Demonstrations

GOL

LDR


GPU-based ABMS References

  • Examples:

    • K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008

    • R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007


Hierarchical GPU System Hardware


Host initiates “launch” of many SIMD threads

Threads get “scheduled” in batches on GPU hardware

CUDA claims extremely efficient thread-launch implementation

Millions of CUDA threads at once

Computation Kernels on each GPUE.g., CUDA Threads


GPU Memory Types (CUDA)

  • GPU memory comes in several flavors

    • Registers

    • Local Memory

    • Shared Memory

    • Constant Memory

    • Global Memory

    • Texture Memory

  • An important challenge is organizing the application to make most effective use of hierarchy


GPU Communication Latencies (CUDA)


CUDA + MPI

  • An economical cluster solution

    • Affordable GPUs, each providing one-node CUDA

    • MPI on giga-bit Ethernet for inter-node comm.

  • Memory speed-constrained system

    • Inter-memory transfers can dominate runtime

    • Runtime overhead can be severe

  • Need a way to tie CUDA and MPI

    • Algorithmic solution needed

    • Need to overcome latency challenge


Analogous Networked Multi-core System


Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,1

P1,1

Block1,0

P1,0

Block1,2

P1,2

B

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

Parallel Execution: Conventional Method


Latency Challenge: Conventional Method

  • High latency between GPU and CPU memories

    • CUDA inter-memory data transfer primitives

  • Very high latency across CPU memories

    • MPI communication for data transfers

  • Naïve method gives very poor computation to communication ratio

    • Slow-downs instead of speedups

  • Need latency resilient method …


Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,1

P1,1

Block1,0

P1,0

Block1,2

P1,2

B

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

Our Solution: B2R Method

R

R


B2R Algorithm


Total Runtime Cost: Analytical Form

At any level in the hierarchy, total runtime F is given by:

Most interesting aspect

Cubic in R!


Implications of being Cubic in R

  • Benefits with B2R not immediately seen for small R

    • In fact, degradation for small R!

  • Dramatic improvement possible after small R

    • Our experiments confirm this trend!

  • Too large is too bad too

    • Can’t profit indefinitely!

Total Execution Time

R


Sub-division Across LevelsE.g., MPI to Blocks to Threads

MPI: Rm

Block: Rb

Thread: Rt


Hierarchy and Recursive Use of B & R

B2R can be applied at all levels!

E.g., CUDA Hierarchy

  • A different R can be chosen at every level, E.g.

    • Rb for block-level R

    • Rt for thread-level R

  • Simple constraints exist for possible values of R

    • Between R and B

    • Between R’s at different levels

    • Details in our paper


B2R Implementation within CUDA


Performance

Over 100× speedup with MPI+CUDA

Speedup relative to naïve method with no latency-hiding


Multi-GPU MPI+CUDA – Game of Life


Multi-core MPI+pthreads– Game of Life


Multi-core MPI+Pthreads – Game of Life


Multi-core MPI+pthreads – Leadership


Summary

  • B2R Algorithm applies across heterogeneous, hierarchical platforms

    • Deep GPU hierarchies

    • Deep CPU multi-core systems

  • Cubic nature of runtime dependence on R is a a remarkable aspect

    • A maximum and minimum exist

    • Optimal (minimum) can be dramatically low

  • Results show clear performance improvement

    • Up to 150x in the best case (fine grained)


Future Work

  • Generate cross-platform code

    • E.g, Implement in OpenCL

  • Add to CUDA-MPI levels

    • Multi-GPU per node

  • Implement and test with more benchmarks

    • E.g., From existing ABMS suites NetLogo & Repast

  • Generalize to unstructured inter-agent graphs

    • E.g., Social networks

  • Potential to apply to other domains

    • E.g., Stencil computations


Thank you!Questions?

Additional material at our webpage:

Discrete Computing Systems

www.ornl.gov/~2ip


  • Login