1 / 17

First CUDA Program

First CUDA Program. First C program. First CUDA program. # include &quot; stdio.h &quot; int main() { printf (&quot;Hello, world<br> &quot;); return 0; }. #include &lt; cuda.h &gt; #include &lt; stdio.h &gt; __global__ void kernel (void) { } int main (void) { kernel &lt;&lt;&lt; 1, 1 &gt;&gt;&gt; (); printf (&quot;Hello World!<br>&quot;);

ely
Download Presentation

First CUDA Program

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. First CUDA Program

  2. First C program First CUDA program #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include <cuda.h> #include <stdio.h> __global__ void kernel (void) { } int main (void) { kernel <<< 1, 1 >>> (); printf("Hello World!\n"); return 0; } Compilation gcc -o first first.c ./first Compilation nvcc -o first first.cu ./first

  3. Kernels • CUDA C extends C by allowing the programmer to define C functions, called kernels, • when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. • A kernel is defined using the _ _global_ _ declaration specifier. • The number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax Amrita School of Biotechnology

  4. Example Program 1 #include <cuda.h> #include <stdio.h> __global__ void kernel (void) { } int main (void) { kernel <<< 1, 1 >>> (); printf("Hello World!\n"); return 0; } • “__global__” says the function is to be compiled to run on a “device” (GPU), not “host” (CPU) • Angle brackets “<<<“ and “>>>” for passing params/args to runtime A function executed on the GPU (device) is usually called a “kernel” Amrita School of Biotechnology

  5. Example Program 2 • We can pass parameters to a kernel as we would with any C function _ _global_ _ void add(int a, int b, int *c) { *c = a+b; } We need to allocate memory to do anything useful on a device int main (void) { int c, *dev_c; cudaMalloc ((void **) &dev_c, sizeof (int)); add <<< 1, 1 >>> (2,7, dev_c); cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost); printf(“2 + 7 = %d\n“, c); cudaFree(dev_c); return 0; } BlocksPerGrid, threadsPerBlock Amrita School of Biotechnology

  6. CUDA Device Memory Allocation • cudaMalloc() : cudaError_t cudaMalloc ( void **  devPtr, size_tsize  ) • Allocates object in the device Global Memory • Allocates size bytes of linear memory • on the device and returns in *devPtr • a pointer to the allocated memory. • Requires two parameters • Address of a pointer to the allocated object • Size of allocated object • cudaFree() • cudaError_t cudaFree ( void *  devPtr  )  • Frees object from device • Global Memory • Pointer to freed object Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory Amrita School of Biotechnology

  7. CUDA Device Memory Allocation (cont.)‏ • Code example: • Allocate a 64 * 64 single precision float array • Attach the allocated storage to Md • “d” is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md); Amrita School of Biotechnology

  8. Thread Kernel 0 Per-deviceGlobal Memory Per-threadLocal Memory Sequential Kernels Block Kernel 1 Per-blockShared Memory Host memory Device 0memory . . . . . . cudaMemcpy() Device 1memory Memory model The CUDA programming model assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively. There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are persistent across kernel launches by the same application. Amrita School of Biotechnology

  9. Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory • Each thread can: • Read/write per-thread registers • Read/write per-thread local memory • Read/write per-block shared memory • Read/write per-grid global memory • Read/only per-gridconstant memory Amrita School of Biotechnology

  10. CUDA Host-Device Data Transfer • cudaMemcpy() : • memory data transfer • cudaError_t cudaMemcpy ( void *  dst, • const void *  src, size_tcount, • enum cudaMemcpyKind  kind  ) • Requires four parameters • Pointer to destination • Pointer to source • Number of bytes copied • Type of transfer • Host to Host, • cudaMemcpyHostToHost • Host to Device: cudaMemcpyHostToDevice • Device to Host: cudaMemcpyDeviceToHost • Device to Device :cudaMemcpyDeviceToDevice Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory Amrita School of Biotechnology

  11. CUDA Host-Device Data Transfer(cont.) • Code example: • Transfer a 64 * 64 single precision float array • M is in host memory and Md is in device memory • cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost); Amrita School of Biotechnology

  12. Summing Vectors A simple example to illustrate threads and how we use them to code with CUDA C. Amrita School of Biotechnology

  13. #define N 10 • void add( int *a, int *b, int *c ) • { • inttid = 0; // this is CPU zero, so we start at zero • while (tid < N) { • c[tid] = a[tid] + b[tid]; • tid += 1; // we have one CPU, so we increment by one • } • } • int main( void ) { • int a[N], b[N], c[N]; • for (int i=0; i<N; i++) { // fill the arrays 'a' and 'b' on the CPU • a[i] = -i; • b[i] = i * i; • } • add( a, b, c ); // display the results • for (int i=0; i<N; i++) { • printf( "%d + %d = %d\n", a[i], b[i], c[i] ); • } • return 0; • } Traditional C code in CPU: Amrita School of Biotechnology

  14. We can accomplish the same addition very similarly on a GPU by writing add() as a device function. GPU Vector Sums #define N 10 int main( void ) { int a[N], b[N], c[N]; int *dev_a, *dev_b, *dev_c; // allocate the memory on the GPU cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ; cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ; cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ; // fill the arrays 'a' and 'b' on the CPU for (int i=0; i<N; i++) { a[i] = -i; b[i] = i * i; } // copy the arrays 'a' and 'b' to the GPU cudaMemcpy( dev_a, a, N * sizeof(int),cudaMemcpyHostToDevice ) ; cudaMemcpy( dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice ) ; add<<<N,1>>>( dev_a, dev_b, dev_c ); Amrita School of Biotechnology

  15. // copy the array 'c' back from the GPU to the CPU cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost ) ; // display the results for (int i=0; i<N; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } // free the memory allocated on the GPU cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; } kernel callable from host __global__ void KernelFunc(...); function callable on device __device__ void DeviceFunc(...); variable in device memory __device__intGlobalVar; in per-block shared memory __shared__intSharedVar; // Kernel definition __global__ void add( int *a, int *b, int *c ) { inttid = blockIdx.x; // handle the data at this index if (tid < N) c[tid] = a[tid] + b[tid]; } Amrita School of Biotechnology

  16. In • add<<<N,1>>>( dev_a, dev_b, dev_c ); • N is the number of blocks that we want to run in parallel. • If we call add<<<4,1>>>(..), the function will have four copies running in parallel, where each copy is named a block. • Thread block = a (data) parallel task • all blocks in kernel have the same entry point • but may execute any code they want Amrita School of Biotechnology

  17. This is what the actual code being executed in each of the four parallel blocks looks like after the runtime substitutes the appropriate block index for blockIdx.x: Runtime system is already launching a different kernel where each block will have one of these indices, the work is done in parallel. Amrita School of Biotechnology

More Related