1 / 94

CUDA Programming Model Overview Hardware Hierarchy and Optimization

CUDA Programming Model Overview Hardware Hierarchy and Optimization. Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University. CUDA Development Environment. CUDA Development Environment. for developers who wants low-level APIs. for developers who

sveta
Download Presentation

CUDA Programming Model Overview Hardware Hierarchy and Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA Programming Model OverviewHardware Hierarchy and Optimization YukaiHung a0934147@gmail.comDepartment of MathematicsNational Taiwan University

  2. CUDA Development Environment

  3. CUDA Development Environment for developers who wants low-level APIs for developers who wants high-level APIs share back-end compiler and optimize technology 3

  4. CUDA Development Environment 4

  5. CUDA Development Environment • CUDA source files must be compiled by nvcc • - create host object code directly • - create device object code directly • - create Parallel Thread eXecution source code directly • - compile into device emulation mode for host simulation • Executable CUDA code requires two libraries • - dynamic linking with cuda core library • - dynamic linking with cuda runtime library 5

  6. CUDA Development Environment • CUDA emulation mode • - use host to simulate kernel execution • - no need of any device and CUDA driver • - each device thread is emulated with a host thread • Running in device emulation mode can • - use host native debugger support • - call any host function from device code and vice-versa • - access any device-specific data from host code and vice-versa • - detect deadlock situation caused by improper synchronization 6

  7. CUDA Development Environment • Linux and Windows environment 7

  8. CUDA Development Environment float4 me=gx[gtid]; me.x=me.x+me.y*me.z; • C/C++ CUDA • application Virtual Layer NVCC • CPU code • PTX code PTX to target compiler Physical Layer GPU • G80 • … target code ld.global.v4.f32 {$f1,$f3,$f5,$f7},[$r9+0]; mad.f32 $f1,$f5,$f3,$f1; 8

  9. CUDA Development Environment • How to install CUDA? • How to compile CUDA file? • How to compile CUDA file into PTX? 9

  10. CUDA Development Environment • Windows Nexus IDE • - fist IDE for massively thread parallel applications • - accelerate coprocessing application development • - complete Visual Studio integrated development environment 10

  11. CUDA Development Environment • Linux and Mac Visual Profiler 11

  12. Start from Matrix-Matrix Multiplication naïve and tiled algorithm

  13. Naïve Matrix-Matrix Multiplication • Each thread calculates one result element • - read one row and one column from global memory • - perform inner product and store back to global memory N WIDTH M P WIDTH WIDTH WIDTH 13

  14. Naïve Matrix-Matrix Multiplication //compute matrix matrix multiplication P=M*N __global__ void SGEMM(float* d_A,float* d_B,float* d_C) { float value; float Melement; float Nelement; //calculate each thread global index ... value=0.0f; • //each thread performs its inner product for(int loop=0;loop<dim;loop++) { Melement=Amatrix[yidx*dim+loop]; Nelement=Bmatrix[loop*dim+xidx]; value=value+Melement*Nelement; } • //store final result into P matrix • Pmatrix[yidx*dim+loop]=value; } 14

  15. Naïve Matrix-Matrix Multiplication • Analyze the global memory usage • - each thread compute one element on the result matrix • - load one row and one column elements from matrix M and N • - perform one multiply and addition for each pair of elements • - compute and access global memory ratio is close to 1:1 • Analyze the global memory bandwidth • - 4bytes x 346.5GFLOPs = 1386GB/s requires to achieve peak • - 86.4GB/s limits the actual performance about 21.6GFLOPs • - the actual code for naïve algorithm runs at about 15GFLOPs 15

  16. Naïve Matrix-Matrix Multiplication • Need to drastically cut own memory accessing • - each row on matrix M is read for many times • - each column on matrix N is read for many times N WIDTH M P WIDTH WIDTH WIDTH 16

  17. Tiled Matrix-Matrix Multiplication • Use tiled algorithm for matrix data reusing • - each block computes one square sub-matrix result • - each thread computes one element on the sub-matrix N TILE_WIDTH WIDTH TILE_WIDTH M P Pdsub WIDTH TILE_WIDTHE TILE_WIDTH TILE_WIDTH TILE_WIDTH WIDTH WIDTH 17

  18. Tiled Matrix-Matrix Multiplication • Each block should have many threads • - tiled size 16 creates 16x16=256 threads per block • - 1024x1024 matrix size needs 64x64=4096 blocks on grid • Each thread block loads two sub-matrix and compute results • - load 2x256=512 floats from global memory to shared memory • - perform 256x(2x16)=512x16 operations on the shared memory • - global memory bandwidth no longer a limiting factor • - compute and access global memory ratio is close to 16:1 18

  19. Tiled Matrix-Matrix Multiplication //compute matrix matrix multiplication P=M*N void SGEMM(float* h_A,float* h_B,float* h_C) { dim3 gridSize; dim3 blockSize; //setup the execution configuration blockSize.x=TILED_WIDTH; blockSize.y=TILED_WIDTH; gridSize.x=WIDTH/TILED_WIDTH; gridSize.y=WIDTH/TILED_WIDTH; • //launch device matrix-matrix multiplication kernel • sgemm<<<gridSize,blockSize>>>(d_A,d_B,d_C); • ... } 19

  20. Tiled Matrix-Matrix Multiplication //compute matrix matrix multiplication P=M*N __global__ void SGEMM(float* d_A,float* d_B,float* d_C) { • //setup the block index and thread index int bx=blockIdx.x; int by=blockIdx.y; int tx=threadIdx.x; int ty=threadIdx.y; float value=0.0f; • //declare the shared memory per block • __shared__ float Mshared[TILED_WIDTH][TILED_WIDTH]; • __shared__ float Nshared[TILED_WIDTH][TILED_WIDTH]; • ... 20

  21. Tiled Matrix-Matrix Multiplication • //loop over all sub-matrices on matrix M and N • //required to compute block sub-matrix results for(int loop=0;loop<WIDTH/TILED_WIDTH;loop++) { • //get the pointer to the current sub-matrix M and N float* Msub=GetSubMatrix(M,loop,bx,by,tx,ty,WIDTH); float* Nsub=GetSubMatrix(N,loop,bx,by,tx,ty,WIDTH); • //each thread load one element to shared memory • Msub[ty][tx]=GetMatrixElement(Msub,tx,ty); • Nsub[ty][tx]=GetMatrixElement(Nsub,tx,ty); //synchronize to make sure the sub-matrix are load //to shared memory before computing partial result __syncthreads(); 21

  22. Tiled Matrix-Matrix Multiplication • void __syncthreads() function • - synchronize all executed threads in the same block • - resume normally when all threads have reached the same point • - use to avoid RAW/WAR/WRW hazards when accessing memory 22

  23. Tiled Matrix-Matrix Multiplication • //each thread conputes one element of the sub-matrix for(int loope=0;loope<TILED_WIDTH;loope++) value=value+Msud[ty][loope]*Nsub[loope][tx]; //synchronize to make sure that the proceding //computation is done before loading two new //sub-matrices of M and N in the next iteration __syncthreads(); } • //get the pointer to the block sub-matrix P • float* Psub=GetSubMatrix(P,loop,bx,by,tx,ty,WIDTH); • //save the block sub-matrix result to global memory • SetMatrixElement(Psub,value,bx,by,tx,ty,WIDTH); • return; • } 23

  24. Tiled Matrix-Matrix Multiplication 24

  25. Tiled Matrix-Matrix Multiplication • What is the limitation on matrix tiled size? • - access memory and compute ratio depends on tiled size • - higher memory usage ratio comes from the lager tiled size • - the ratio is limited by shared memory size and block size • What is the performance between two algorithms? • - naïve matrix-matrix multiplication runs at about 15GFLOPs • - tiled matrix-matrix multiplication runs at about 90GFLOPs • What kind of tiled algorithm used on Intel Math Kernel Library? 25

  26. Matrix-Matrix Multiplication device memory host memory shared memory L3 cache L2 cache GPU version CPU version 26

  27. Matrix-Matrix Multiplication 27

  28. Matrix-Matrix Multiplication • How about the memory accessing pattern? N WIDTH M P WIDTH WIDTH WIDTH 28

  29. Tiled Matrix-Matrix Multiplication • How to accelerate current tiled algorithm? • - prefetch the next sub-matrix data while computing • - overlap the loading and computing times to hide latency 29

  30. Recommend Program Strategy • Global memory resides in device memory • - much slower accessing than shared memory • A profitable way of programming on device is tiled data • - divide data into subsets that fit into shared memory • - handle each data subset with one thread block by: • 1: load the subset from global memory to shared memory • use multiple threads to exploit memory-level parallelism • 2: perform computation on the subset from shared memory • 3: store final results from shared memory to global memory 30

  31. Same Idea for Matrix Transpose naïve and tiled algorithm

  32. Naïve Matrix Transpose • Each thread represents one element of the matrix • - load one row element from original matrix • - store one column element into another matrix WIDTH WIDTH 32

  33. Naïve Matrix Transpose load data from global memory stride = 1 store data into global memory WIDTH WIDTH stride = dimension 33

  34. Tiled Matrix Transpose Consider the tiled algorithm on the matrix! load data from memory store data into memory 34

  35. Tiled Matrix Transpose load data from global memory to shard memory Consider one of thread blocks! transpose matrix on shared memory global memory shared memory store data from shared memory to global memory 35

  36. Hardware Hierarchy Architecture Details

  37. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Work Distribution Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Hardware Hierarchy Architecture this unit distributes all blocks into shader multiprocessors 37

  38. Hardware Hierarchy Architecture this means you need sufficient blocks to fill all the shader pipeline all blocks get scheduled round-robin based on the number of shaders 38

  39. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Work Distribution Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Hardware Hierarchy Architecture this unit performs graphic texture operations texture filtering and pixel interpolations 39

  40. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Work Distribution Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Hardware Hierarchy Architecture shader processor array SPA on chip shader number depends on the different hardware 40

  41. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Work Distribution Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Hardware Hierarchy Architecture shader multiprocessor SM on SPA thread block is scheduled into one SM 41

  42. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Work Distribution Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Hardware Hierarchy Architecture shader multiprocessor SM on SPA thread block is scheduled into one SM 42

  43. Hardware Hierarchy Architecture shader processor SP for regular arithmetic thread is scheduled into one shader processor 43

  44. Hardware Hierarchy Architecture low precision special function unit SFU hardware supports intrinsic functions 44

  45. Hardware Hierarchy Architecture • Intrinsic functions are supported by hardware SFU • - intrinsic function comes from traditional usage • - the functions are less accurate but faster version • - the functions have name prefixed with __function() • Selectable operation rounding mode • - function suffixed with _rn uses round-to-nearest-even • - function suffixed with _rz uses round-towards-zero • - function suffixed with _rd uses round-down • - function suffixed with _ru uses round-up • SFU and SP units are two independent operation units • - SFU and SP units can be performed simultaneously 45

  46. Hardware Hierarchy Architecture on-chip shared memory on each SM shared data in the same thread block fetch instruction and dispatch to all threads in the same warp 46

  47. Warp Architecture • Unrelated to software design which is purely hardware • - designed to solve variable block size • - designed to achieve effective divergence • - designed to hide memory accessing latency • Each block are divided into several warps • - each warp contains 32-threads on current shader • - warp size is decided from some hardware reasons • Each shader can execute only one warp at the same time • - each shader can monitor and switch between several warps • - synchronization not necessary for all threads in the same warp 47

  48. Warp Architecture block size 256 2 2 3 3 4 4 5 5 6 6 7 7 9 8 8 1 1 warp umber 8 block size 280 warp umber 9 8 threads on the last warp are idle 48

  49. Warp Architecture • Shader multiprocessor scheduling Shader Multiprocessor block 0 block 1 block 0 warp 0 block 0 warp 1 block 1 warp 0 block 1 warp 1 block 0 warp 2 block 0 warp 3 block 1 warp 2 block 1 warp 3 49

  50. Warp Architecture • Shader multiprocessor synchronization Shader Multiprocessor block 0 block 1 block 0 warp 0 block 0 warp 1 block 1 warp 0 block 1 warp 1 block 0 warp 2 block 0 warp 3 block 1 warp 2 block 1 warp 3 50

More Related