1 / 20

Killdevil

Killdevil. Running CUDA programs on cluster. Requesting permission. https://onyen.unc.edu/cgi-bin/unc_id/services. Compiling CUDA programs. module load cuda Run script : compile.sh

xanto
Download Presentation

Killdevil

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Killdevil Running CUDA programs on cluster

  2. Requesting permission • https://onyen.unc.edu/cgi-bin/unc_id/services

  3. Compiling CUDA programs • module load cuda • Run script : compile.sh • nvcc-o MatrixMul -I/usr/local/cuda/include/ -L/usr/local/lib64 -L/usr/local/cuda/lib64 MatrixMul.cu

  4. Running CUDA programs • ssh killdevil.unc.edu • module load cuda • Run script : submitjob.sh • bsub –q gpu –a gpuexcl_t –n 1 –o MYGPUJOB.o%J <myprogramscript>

  5. CUDA SDK • https://developer.nvidia.com/cuda-downloads • Download the SDK depending on your OS • Windows : Requires Visual Studio to compile sample • Linux :Requires gcc

  6. CUDA : Threads

  7. Recap • Kernel program is executed by a grid of threads

  8. Thread Organization • Organized in two-level hierarchy • Grid composed of Blocks • gridDim : Number of blocks the grid has • Blocks composed of Threads • blockDim : Number of threads the block has • Each block gets a unique Id • blockIdx • Each thread gets a unique Id • threadIdx

  9. Thread Organization • Each block has equal number of threads • blockDim.x, blockDim.y, blockDim.z • threadIdx is always local to the block

  10. 1D Example • Grid = 128 blocks • Block = 32 threads • blockDim.x in kernel returns 32 • Total threads = 128 x 32 = 4096 • Each thread has a unique Id • blockIdx.x * blockDim.x + threadId.x

  11. Multi-Dimension Example

  12. Things to Note • Blocks are organized into 3D arrays of threads • 1D, 2D, 3D depending on your problem • Vector sum : 1D; Matrix multiplication : 2D • All blocks in a grid have the same dimensions • i.e all blocks have equal number of threads in each dimension • The total size of a block is limited to 512 threads • blockDim can be (512, 1, 1), (8, 16, 2), (16, 16, 2) • But not (32, 32, 1) • Total threads : 32 x 32 x 1 = 1024 which exceeds 512

  13. USING blockIdx AND threadIdx 0, 0 1, 0 2, 0 width-1, 0 0, 1 width–1, 1 0, 2 0, width-1 width – 1, width - 1

  14. Matrix-Multiplication with larger size

  15. Simple example

  16. Updated kernel code

  17. Block scheduling on device

  18. Thread Assignment

  19. Thread Assignment

  20. Questions?

More Related