1 / 19

Fast matrix multiplication with CUDA

Fast matrix multiplication with CUDA. Overview. Platform GEFORCE 8800GT, 512MB Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz Performance Tuned for 4k x 4k matrix, 192 GFlops Revisiting the Tiled version Using large tiles Base algorithm Optimized algorithm Tools and tips.

oceana
Download Presentation

Fast matrix multiplication with CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast matrix multiplication with CUDA

  2. Overview • Platform • GEFORCE 8800GT, 512MB • Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz • Performance • Tuned for 4k x 4k matrix, 192 GFlops • Revisiting the Tiled version • Using large tiles • Base algorithm • Optimized algorithm • Tools and tips University of Central Florida

  3. The Tiled version • Tile Size: 16 x 16 • 256 threads / block • 14 regs, 2076 smem / block • Occupancy: 2/3 T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 University of Central Florida

  4. The Tiled version – Memory access • Every half warp is accessing continuous memory locations. • Memory accesses are fully coalesced. T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 University of Central Florida

  5. The Tiled version – Bank conflicts • No bank conflicts. 16 banks broadcast T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 University of Central Florida

  6. The Tiled version - Bottlenecks • If fully use memory bandwidth and ALUs: • 14.4G float/s, 168G MAD/s • 11.67 MAD/float • With 16 x 16 tiles: • Total (W3/8) loads, 8 MAD/float • Too many loads! • Solution: large tile. • “Psub += As[ty][k] * Bs[k][tx]” • Extra instructions. • 77 GFlops (4k x 4k) mov.b32 $r12, s[$ofs4+0x0000] mov.b32 $r7, s[$ofs4+0x0040] mad.rn.f32 $r11, s[$ofs1+0x000c], $r11, $r13 add.b32 $ofs4, $ofs3, 0x0000019c mad.rn.f32 $r13, s[$ofs1+0x0010], $r12, $r11 mov.b32 $r12, s[$ofs4+0x0000] mov.b32 $r11, s[$ofs4+0x0040] mad.rn.f32 $r7, s[$ofs1+0x0014], $r7, $r13 add.b32 $ofs4, $ofs3, 0x0000021c mad.rn.f32 $r13, s[$ofs1+0x0018], $r12, $r7 Reused THeight times. TWidth Reused TWidth times. THeight University of Central Florida

  7. Using Large Tiles • Each thread: • 17 loads / iteration • W/16 iterations • Total (W3/15) loads, 15 MAD/load 16 256 Stored in registers. 256 threads 16 256 16 16 Stored in shared memory. 16 Psubs/thread University of Central Florida

  8. Using Large Tiles - Algorithm • For each sub tile in A & B • Read the sub tile from A to shared memory. 1 number / thread. • For each of the 16 numbers in B: • Read one number from B into a register. • Perform one MAD for each Psub. • To remove extra instructions for offset calculation, we want the sub tile A to be stored in column-major format in the shared memory. • But … B A C T0 T1 T2 … T255 University of Central Florida

  9. Using Large Tiles - Algorithm • Solution1: • Transpose A to column-major format first. A Shared A C T0 B0 B15 T15 T0 T1 T2 … T255 • Solution2: • Read A in row-major format, write to the shared memory in column-major format. • Bank conflicts when write to the shared memory! A Shared A C B0 T0 T15 B15 T0 T1 T2 … T255 University of Central Florida

  10. Using Large Tiles - Algorithm • Solution3: • Padding Shared A with one empty row. • No bank conflicts. Do not need to transpose A. • 164 GFlops (4k x 4k). Shared A A C B0 B1 B15 T0 T15 B1 B2 B0 B15 B0 B1 T0 T1 T2 … T255 University of Central Florida

  11. Using Large Tiles - code for (int i = 0; i < MATRIX_WIDTH/16; i++) { ashare[tx][ty] = A[0]; __syncthreads(); #pragma unroll // 150 GFlops (4k x 4k) without unroll for (int k = 0; k < 16; k++) { b = B[k * MATRIX_WIDTH]; comp16(b, &ashare[k][0], c); } A += 16; B += 16 * MATRIX_WIDTH; __syncthreads(); }; University of Central Florida

  12. Using Large Tiles - optimized do { ashare[tx][ty] = a; __syncthreads(); a = A[0]; bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[4 * MATRIX_WIDTH]; b[1] = B[5 * MATRIX_WIDTH]; b[2] = B[6 * MATRIX_WIDTH]; b[3] = B[7 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i][0], c); … bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[12 * MATRIX_WIDTH]; b[1] = B[13 * MATRIX_WIDTH]; b[2] = B[14 * MATRIX_WIDTH]; b[3] = B[15 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i + 8][0], c); bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; A += 16; B += 16 * MATRIX_WIDTH; b[0] = B[0 * MATRIX_WIDTH]; b[1] = B[1 * MATRIX_WIDTH]; b[2] = B[2 * MATRIX_WIDTH]; b[3] = B[3 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i + 12][0], c); __syncthreads(); } while( A < Alast ); ... // last iteration University of Central Florida

  13. Using Large Tiles - Performance Execution time is measured as the computation time on GPU. University of Central Florida

  14. Using Large Tiles – Performance 2 Gflops (comp): excluding CPU GPU data transfer time. Gflops (total): including CPU GPU data transfer time. University of Central Florida

  15. Tools - CUDA GPU Occupancy Calculator University of Central Florida

  16. Tools - decuda • Developed by Wladimir J. van der Laan • a PhD candidate at the Institute of Mathematics and Computing Science of the University of Groningen. • http://www.cs.rug.nl/~wladimir/decuda/ University of Central Florida

  17. Tools – CUDA Visual Profiler • http://forums.nvidia.com/index.php?showtopic=57443 • GPU TimeCPU TimeOccupancy • Profiler counters:gld_incoherent : Number of non-coalesced global memory loadsgld_coherent : Number of coalesced global memory loadsgst_incoherent : Number of non-coalesced global memory storesgst_coherent : Number of coalesced global memory storeslocal_load : Number of local memory loadslocal_store : Number of local memory storesbranch : Number of branch events (instruction and/or sync stack)divergent_branch : Number of divergent branches within a warp instructions : Number of dynamic instructions (in fetch)warp_serialize : Number of threads in a warp serialize basedon address (GRF or constant)cta_launched : Number of CTAs launched on the PM TPC University of Central Florida

  18. Tips • Get usage of reg, smem, cmem, and lmem: • nvcc -m32 -o data/matrix_kernel.cubin -cubin matrix_kernel.cu --compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -DUNIX -O3 --ptxas-options=-v • Compile with –maxrregcount University of Central Florida

  19. References • NVIDIA CUDA Samples: • http://www.nvidia.com/object/cuda_sample_linear_algebra.html • Simple CUBLAS • Matrix Multiplication • Matrix Transpose • NVIDIA Forum: • http://forums.nvidia.com/index.php?showtopic=47689&st=0 University of Central Florida

More Related