efficient simulation of agent based models on multi gpu multi core clusters
Download
Skip this Video
Download Presentation
Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters

Loading in 2 Seconds...

play fullscreen
1 / 28

Efficient Simulation of Agent-based Models on Multi-GPU Multi-Core Clusters - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

SimuTools , Malaga, Spain March 16, 2010. Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters. Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology. In a Nut Shell.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient Simulation of Agent-based Models on Multi-GPU Multi-Core Clusters' - ketan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
efficient simulation of agent based models on multi gpu multi core clusters

SimuTools, Malaga, Spain

March 16, 2010

Efficient Simulation ofAgent-based Models onMulti-GPU & Multi-Core Clusters

Kalyan S. Perumalla, Ph.D.

Senior R&D ManagerOak Ridge National Laboratory

Adjunct ProfessorGeorgia Institute of Technology

in a nut shell
In a Nut Shell

Dramatic improvements in speed

gpu based abms references
GPU-based ABMS References
  • Examples:
    • K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008
    • R. D\'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007
computation kernels on each gpu e g cuda threads
Host initiates “launch” of many SIMD threads

Threads get “scheduled” in batches on GPU hardware

CUDA claims extremely efficient thread-launch implementation

Millions of CUDA threads at once

Computation Kernels on each GPUE.g., CUDA Threads
gpu memory types cuda
GPU Memory Types (CUDA)
  • GPU memory comes in several flavors
    • Registers
    • Local Memory
    • Shared Memory
    • Constant Memory
    • Global Memory
    • Texture Memory
  • An important challenge is organizing the application to make most effective use of hierarchy
cuda mpi
CUDA + MPI
  • An economical cluster solution
    • Affordable GPUs, each providing one-node CUDA
    • MPI on giga-bit Ethernet for inter-node comm.
  • Memory speed-constrained system
    • Inter-memory transfers can dominate runtime
    • Runtime overhead can be severe
  • Need a way to tie CUDA and MPI
    • Algorithmic solution needed
    • Need to overcome latency challenge
parallel execution conventional method

Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,1

P1,1

Block1,0

P1,0

Block1,2

P1,2

B

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

Parallel Execution: Conventional Method
latency challenge conventional method
Latency Challenge: Conventional Method
  • High latency between GPU and CPU memories
    • CUDA inter-memory data transfer primitives
  • Very high latency across CPU memories
    • MPI communication for data transfers
  • Naïve method gives very poor computation to communication ratio
    • Slow-downs instead of speedups
  • Need latency resilient method …
our solution b2r method

Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,1

P1,1

Block1,0

P1,0

Block1,2

P1,2

B

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

Our Solution: B2R Method

R

R

total runtime cost analytical form
Total Runtime Cost: Analytical Form

At any level in the hierarchy, total runtime F is given by:

Most interesting aspect

Cubic in R!

implications of being cubic in r
Implications of being Cubic in R
  • Benefits with B2R not immediately seen for small R
    • In fact, degradation for small R!
  • Dramatic improvement possible after small R
    • Our experiments confirm this trend!
  • Too large is too bad too
    • Can’t profit indefinitely!

Total Execution Time

R

hierarchy and recursive use of b r
Hierarchy and Recursive Use of B & R

B2R can be applied at all levels!

E.g., CUDA Hierarchy

  • A different R can be chosen at every level, E.g.
    • Rb for block-level R
    • Rt for thread-level R
  • Simple constraints exist for possible values of R
    • Between R and B
    • Between R’s at different levels
    • Details in our paper
performance
Performance

Over 100× speedup with MPI+CUDA

Speedup relative to naïve method with no latency-hiding

summary
Summary
  • B2R Algorithm applies across heterogeneous, hierarchical platforms
    • Deep GPU hierarchies
    • Deep CPU multi-core systems
  • Cubic nature of runtime dependence on R is a a remarkable aspect
    • A maximum and minimum exist
    • Optimal (minimum) can be dramatically low
  • Results show clear performance improvement
    • Up to 150x in the best case (fine grained)
future work
Future Work
  • Generate cross-platform code
    • E.g, Implement in OpenCL
  • Add to CUDA-MPI levels
    • Multi-GPU per node
  • Implement and test with more benchmarks
    • E.g., From existing ABMS suites NetLogo & Repast
  • Generalize to unstructured inter-agent graphs
    • E.g., Social networks
  • Potential to apply to other domains
    • E.g., Stencil computations
thank you questions

Thank you!Questions?

Additional material at our webpage:

Discrete Computing Systems

www.ornl.gov/~2ip

ad