Efficient simulation of agent based models on multi gpu multi core clusters
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on
  • Presentation posted in: General

SimuTools , Malaga, Spain March 16, 2010. Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters. Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology. In a Nut Shell.

Download Presentation

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficient simulation of agent based models on multi gpu multi core clusters

SimuTools, Malaga, Spain

March 16, 2010

Efficient Simulation ofAgent-based Models onMulti-GPU & Multi-Core Clusters

Kalyan S. Perumalla, Ph.D.

Senior R&D ManagerOak Ridge National Laboratory

Adjunct ProfessorGeorgia Institute of Technology


In a nut shell

In a Nut Shell

Dramatic improvements in speed


Outline

Outline


Abms motivating demonstrations

Agent Based Modeling and Simulation (ABMS)

Game of Life

Afghan Leadership

ABMS: Motivating Demonstrations

GOL

LDR


Gpu based abms references

GPU-based ABMS References

  • Examples:

    • K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008

    • R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007


Hierarchical gpu system hardware

Hierarchical GPU System Hardware


Computation kernels on each gpu e g cuda threads

Host initiates “launch” of many SIMD threads

Threads get “scheduled” in batches on GPU hardware

CUDA claims extremely efficient thread-launch implementation

Millions of CUDA threads at once

Computation Kernels on each GPUE.g., CUDA Threads


Gpu memory types cuda

GPU Memory Types (CUDA)

  • GPU memory comes in several flavors

    • Registers

    • Local Memory

    • Shared Memory

    • Constant Memory

    • Global Memory

    • Texture Memory

  • An important challenge is organizing the application to make most effective use of hierarchy


Gpu communication latencies cuda

GPU Communication Latencies (CUDA)


Cuda mpi

CUDA + MPI

  • An economical cluster solution

    • Affordable GPUs, each providing one-node CUDA

    • MPI on giga-bit Ethernet for inter-node comm.

  • Memory speed-constrained system

    • Inter-memory transfers can dominate runtime

    • Runtime overhead can be severe

  • Need a way to tie CUDA and MPI

    • Algorithmic solution needed

    • Need to overcome latency challenge


Analogous networked multi core system

Analogous Networked Multi-core System


Parallel execution conventional method

Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,1

P1,1

Block1,0

P1,0

Block1,2

P1,2

B

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

Parallel Execution: Conventional Method


Latency challenge conventional method

Latency Challenge: Conventional Method

  • High latency between GPU and CPU memories

    • CUDA inter-memory data transfer primitives

  • Very high latency across CPU memories

    • MPI communication for data transfers

  • Naïve method gives very poor computation to communication ratio

    • Slow-downs instead of speedups

  • Need latency resilient method …


Our solution b2r method

Block0,0

P0,0

Block0,1

P0,1

Block0,2

P0,2

Block1,1

P1,1

Block1,0

P1,0

Block1,2

P1,2

B

Block2,0

P2,0

Block2,1

P2,1

Block2,2

P2,2

Our Solution: B2R Method

R

R


B2r algorithm

B2R Algorithm


Total runtime cost analytical form

Total Runtime Cost: Analytical Form

At any level in the hierarchy, total runtime F is given by:

Most interesting aspect

Cubic in R!


Implications of being cubic in r

Implications of being Cubic in R

  • Benefits with B2R not immediately seen for small R

    • In fact, degradation for small R!

  • Dramatic improvement possible after small R

    • Our experiments confirm this trend!

  • Too large is too bad too

    • Can’t profit indefinitely!

Total Execution Time

R


Sub division across levels e g mpi to blocks to threads

Sub-division Across LevelsE.g., MPI to Blocks to Threads

MPI: Rm

Block: Rb

Thread: Rt


Hierarchy and recursive use of b r

Hierarchy and Recursive Use of B & R

B2R can be applied at all levels!

E.g., CUDA Hierarchy

  • A different R can be chosen at every level, E.g.

    • Rb for block-level R

    • Rt for thread-level R

  • Simple constraints exist for possible values of R

    • Between R and B

    • Between R’s at different levels

    • Details in our paper


B2r implementation within cuda

B2R Implementation within CUDA


Performance

Performance

Over 100× speedup with MPI+CUDA

Speedup relative to naïve method with no latency-hiding


Multi gpu mpi cuda game of life

Multi-GPU MPI+CUDA – Game of Life


Multi core mpi pthreads game of life

Multi-core MPI+pthreads– Game of Life


Multi core mpi pthreads game of life1

Multi-core MPI+Pthreads – Game of Life


Multi core mpi pthreads leadership

Multi-core MPI+pthreads – Leadership


Summary

Summary

  • B2R Algorithm applies across heterogeneous, hierarchical platforms

    • Deep GPU hierarchies

    • Deep CPU multi-core systems

  • Cubic nature of runtime dependence on R is a a remarkable aspect

    • A maximum and minimum exist

    • Optimal (minimum) can be dramatically low

  • Results show clear performance improvement

    • Up to 150x in the best case (fine grained)


Future work

Future Work

  • Generate cross-platform code

    • E.g, Implement in OpenCL

  • Add to CUDA-MPI levels

    • Multi-GPU per node

  • Implement and test with more benchmarks

    • E.g., From existing ABMS suites NetLogo & Repast

  • Generalize to unstructured inter-agent graphs

    • E.g., Social networks

  • Potential to apply to other domains

    • E.g., Stencil computations


Thank you questions

Thank you!Questions?

Additional material at our webpage:

Discrete Computing Systems

www.ornl.gov/~2ip


  • Login