1 / 17

More CUDA Examples

More CUDA Examples. Different Levels of parallelism. Thread parallelism each thread is an independent thread of execution Data parallelism across threads in a block across blocks in a kernel Task parallelism different blocks are independent independent kernels. Thread Ids.

odelia
Download Presentation

More CUDA Examples

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More CUDA Examples

  2. Different Levels of parallelism • Thread parallelism • each thread is an independent thread of execution • Data parallelism • across threads in a block • across blocks in a kernel • Task parallelism • different blocks are independent • independent kernels Amrita School of Biotechnology

  3. Thread Ids • Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable. • threadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional thread block. • This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or volume. • Block ID: 1D or 2D • blockIdx.x {x,y} • Thread ID: 1D, 2D, or 3D • threadIdx.{x,y,z} Amrita School of Biotechnology

  4. A general guidline is that a block should consist of at least 192 threads in order to hide memory access latency. Therefore, 256, and 512 threads are common and practical numbers. • The following kernel used one block with N threads // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); • Here, each of the N threads that execute VecAdd() performs one pair-wise addition. __global__ void VecAdd(float* A, float* B, float* C) { inti = threadIdx.x; If(i < N) C[i] = A[i] + B[i]; } Simplest choice is to have each thread calculate one, and only one, element in the final result array Amrita School of Biotechnology

  5. The number of threads per block and the number of blocks per grid specified in the <<<…>>> syntax can be of type int or dim3. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable. Suppose we have 10000 elements and No of threads per blocks : 256 Then No of blocks required= 10000 / 256 = 40 An Array of 16 elements divided into 4 blocks intidx = blockDim.x * blockIdx.x + threadIdx.x; N=16, blockDim=4 -> 4 blocks blockIdx.x=1 blockDim.x=4 threadIdx.x=0,1,2,3 idx=4,5,6,7 blockIdx.x=0 blockDim.x=4 threadIdx.x=0,1,2,3 idx=0,1,2,3 blockIdx.x=2 blockDim.x=4 threadIdx.x=0,1,2,3 idx=8,9,10,11 blockIdx.x=3 blockDim.x=4 threadIdx.x=0,1,2,3 idx=12,13,14,15 Amrita School of Biotechnology

  6. 2D Examples • Add two matrices • Case 1: matrix dimension and block dimension same • Works for small matrices (dim. < 1024 * 1024) • No of blocks needed: 1 • Dim3 threadsPerBlock(row,column) • Dim3 blocksPerGrid(1) • Kernel invocation • AddMatrix<<<blocksPerGrid,threadsPerBlock>>>(a,b,c,cols) Amrita School of Biotechnology

  7. The following code adds two matrices A and B of size NxN and stores the result into matrix C: • // Kernel definition • __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) • { • inti = threadIdx.x; • int j = threadIdx.y; • C[i][j] = A[i][j] + B[i][j]; • } • int main() • { • ... • // Kernel invocation with one block of N * N * 1 threads • intnumBlocks = 1; • dim3 threadsPerBlock(N, N); • MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); • ... • } // N should be less then 1024, the max threads per block Amrita School of Biotechnology

  8. Case 2: MatAdd() example to handle multiple blocks // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { inti = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() { ... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); ... } Total number of threads is equal to the number of threads per block times the number of blocks. Amrita School of Biotechnology

  9. There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. Amrita School of Biotechnology

  10. If • Max Number of Threads per Block: 512 • Max Number of Blocks per Streaming Multiprocessor: 8 • Number of Streaming Multiprocessors: 30 • Total Number of Threads Available = • 30 x 8 x 512 = 122880 Amrita School of Biotechnology

  11. Compute Capability Compute Capability 1.x Thread dimension : 1D,2D or 3D Thread Block dimension: 1D or 2D Max Threads / block : 512 Compute Capability 2.x Thread dimension : 1D,2D or 3D Thread Block dimension: 1D,2D or 2D Max Threads / block : 1024 Amrita School of Biotechnology

  12. Matrix Multiplication Memory layout of a matrix Matrices are stored in column major order in CUDA M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M M0,3 M1,3 M2,3 M3,3 M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 (M X P) * (P X N) => (M X N)

  13. Matrix Multiplication // Matrix multiplication on the (CPU) host void MatrixMulOnHost(float* mat1, float* mat2, float* R, intM,intP,int N)‏ { for (inti = 0; i < M; ++i)‏ for (int j = 0; j < N; ++j) { double sum = 0; for (int k = 0; k < P; ++k) { double a = mat1[i * P + k]; double b = mat2[k * N+ j]; sum += a * b; } R[i * N + j] = sum; } } mat2 k (M X P) * (P X N) => (M X N) j N P mat1 P R i M N k M P

  14. Matrix multiplication on GPU Each thread calculates one value in the resulting matrix __global__ void MatrixMulOnDevice(float* mat1, float* mat2, float* R, intM,intP,int N)‏ { int sum = 0; int row = threadIdx.y; intcol = threadIdx.x; for (int k = 0; k < P; ++k) { int a = mat1[row * P + k]; int b = mat2[k * N+ col]; sum += a * b; } R[row * N + col] = sum; } } MatrixMulOnDevice<<<threadsPerBlock,blocksPerGrid>>>(A,B,C,m,p,n); Amrita School of Biotechnology

  15. Limitation: • Size of a matrix is limited by the number of threads allowed in a thread block • Solution: Use multiple thread blocks • Kernel invocation • Int threads = 64; • Dim3 threadsPerBlock(threads,threads); • dim2 blocksPerGrid(m/threads,n/threads); • Multiply<<<threadsPerBlock,blocksPerGrid>>>(A,B,C,m,p,n); • threadIds • int row = blockIdx.y * blockDim.y + threadIdx.y; • int col = blockIdx.x * blockDim.x + threadIdx.x; Amrita School of Biotechnology

  16. Another solution: • Give each thread more work • Instead of doing one operation, each thread is assigned more jobs • A tile of WIDTH * WIDTH entries Amrita School of Biotechnology

  17. Question?? • Write a program to implement the kernel function • Increment(a[],b) • The function is to increment each elements of the array a by b units. • The array size need to be dynamically allocated • No of threads per block: 256 • No of blocks need to be dynamically allocated depending on the size of the array • Each thread should perform one increment operation in one array element • Do the same in a two dimensional array • With one block • With a no. of blocks Amrita School of Biotechnology

More Related