hybrid redux cuda mpi n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Hybrid Redux : CUDA / MPI PowerPoint Presentation
Download Presentation
Hybrid Redux : CUDA / MPI

Loading in 2 Seconds...

play fullscreen
1 / 7

Hybrid Redux : CUDA / MPI - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

Hybrid Redux : CUDA / MPI. CUDA / MPI Hybrid – Why?. Harness more hardware 16 CUDA GPUs > 1! You have a legacy MPI code that you’d like to accelerate. . CUDA / MPI – Hardware?. In a cluster environment, a typical configuration is one of: Tesla S1050 cards attached to (some) nodes.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hybrid Redux : CUDA / MPI' - oralee


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cuda mpi hybrid why
CUDA / MPI Hybrid – Why?
  • Harness more hardware
    • 16 CUDA GPUs > 1!
  • You have a legacy MPI code that you’d like to accelerate.
cuda mpi hardware
CUDA / MPI – Hardware?
  • In a cluster environment, a typical configuration is one of:
    • Tesla S1050 cards attached to (some) nodes.
      • 1 GPU for each node.
    • Tesla S1070 server node (4 GPUs) connected to two different host nodes via PCI-E.
      • 2 GPUs per node.
    • Sooner’s CUDA nodes are like the latter
      • Those nodes are used when you submit a job to the queue “cuda”.
  • You can also attach multiple cards to a workstation.
cuda mpi approach
CUDA / MPI – Approach
  • CUDA will likely be:
    • Doing most of the computational heavy lifting
    • Dictating your algorithmic pattern
    • Dictating your parallel layout
      • i.e. which nodes have how many cards
  • Therefore:
    • We want to design the CUDA portions first, and
    • We want to do most of the compute in CUDA, and use MPI to move work and results around when needed.
writing and testing
Writing and Testing
  • Use MPI for what CUDA can’t do/does poorly
    • communication!
  • Basic Organization:
    • Get data to node/CUDA card
    • Compute
    • Get result out of node/CUDA card
  • Focus testing on these three spots to pinpoint bugs
  • Be able to “turn off” either MPI or CUDA for testing
    • Keep a serial version of the CUDA code around just for this
compiling
Compiling
  • Basically, we need to use both mpicc and nvcc
    • These are both just wrappers for other compilers
  • Dirty trick – wrap mpicc with nvcc:

nvcc -arch sm_13 --compiler-bindirmpiccdriver.ckernel.cu

  • add -g and -G to add debug info for gdb
running cuda mpi
Running CUDA / MPI
  • Run no more than one MPI process per GPU.
    • On sooner, this means each cuda node should be running no more than 2 MPI processes.
  • On Sooner, these cards can be reserved by using:

#BSUB -R "select[cuda> 0]“

#BSUB -R "rusage[cuda=2]“

in your .bsub file.