Fast matrix multiplication with CUDA

Fast matrix multiplication with CUDA

Overview • Platform • GEFORCE 8800GT, 512MB • Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz • Performance • Tuned for 4k x 4k matrix, 192 GFlops • Revisiting the Tiled version • Using large tiles • Base algorithm • Optimized algorithm • Tools and tips University of Central Florida

The Tiled version • Tile Size: 16 x 16 • 256 threads / block • 14 regs, 2076 smem / block • Occupancy: 2/3 T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 University of Central Florida

The Tiled version – Memory access • Every half warp is accessing continuous memory locations. • Memory accesses are fully coalesced. T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 University of Central Florida

The Tiled version – Bank conflicts • No bank conflicts. 16 banks broadcast T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 University of Central Florida

The Tiled version - Bottlenecks • If fully use memory bandwidth and ALUs: • 14.4G float/s, 168G MAD/s • 11.67 MAD/float • With 16 x 16 tiles: • Total (W3/8) loads, 8 MAD/float • Too many loads! • Solution: large tile. • “Psub += As[ty][k] * Bs[k][tx]” • Extra instructions. • 77 GFlops (4k x 4k) mov.b32 $r12, s[$ofs4+0x0000] mov.b32 $r7, s[$ofs4+0x0040] mad.rn.f32 $r11, s[$ofs1+0x000c], $r11, $r13 add.b32 $ofs4, $ofs3, 0x0000019c mad.rn.f32 $r13, s[$ofs1+0x0010], $r12, $r11 mov.b32 $r12, s[$ofs4+0x0000] mov.b32 $r11, s[$ofs4+0x0040] mad.rn.f32 $r7, s[$ofs1+0x0014], $r7, $r13 add.b32 $ofs4, $ofs3, 0x0000021c mad.rn.f32 $r13, s[$ofs1+0x0018], $r12, $r7 Reused THeight times. TWidth Reused TWidth times. THeight University of Central Florida

Using Large Tiles • Each thread: • 17 loads / iteration • W/16 iterations • Total (W3/15) loads, 15 MAD/load 16 256 Stored in registers. 256 threads 16 256 16 16 Stored in shared memory. 16 Psubs/thread University of Central Florida

Using Large Tiles - Algorithm • For each sub tile in A & B • Read the sub tile from A to shared memory. 1 number / thread. • For each of the 16 numbers in B: • Read one number from B into a register. • Perform one MAD for each Psub. • To remove extra instructions for offset calculation, we want the sub tile A to be stored in column-major format in the shared memory. • But … B A C T0 T1 T2 … T255 University of Central Florida

Using Large Tiles - Algorithm • Solution1: • Transpose A to column-major format first. A Shared A C T0 B0 B15 T15 T0 T1 T2 … T255 • Solution2: • Read A in row-major format, write to the shared memory in column-major format. • Bank conflicts when write to the shared memory! A Shared A C B0 T0 T15 B15 T0 T1 T2 … T255 University of Central Florida

Using Large Tiles - Algorithm • Solution3: • Padding Shared A with one empty row. • No bank conflicts. Do not need to transpose A. • 164 GFlops (4k x 4k). Shared A A C B0 B1 B15 T0 T15 B1 B2 B0 B15 B0 B1 T0 T1 T2 … T255 University of Central Florida

Using Large Tiles - code for (int i = 0; i < MATRIX_WIDTH/16; i++) { ashare[tx][ty] = A[0]; __syncthreads(); #pragma unroll // 150 GFlops (4k x 4k) without unroll for (int k = 0; k < 16; k++) { b = B[k * MATRIX_WIDTH]; comp16(b, &ashare[k][0], c); } A += 16; B += 16 * MATRIX_WIDTH; __syncthreads(); }; University of Central Florida

Using Large Tiles - optimized do { ashare[tx][ty] = a; __syncthreads(); a = A[0]; bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[4 * MATRIX_WIDTH]; b[1] = B[5 * MATRIX_WIDTH]; b[2] = B[6 * MATRIX_WIDTH]; b[3] = B[7 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i][0], c); … bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[12 * MATRIX_WIDTH]; b[1] = B[13 * MATRIX_WIDTH]; b[2] = B[14 * MATRIX_WIDTH]; b[3] = B[15 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i + 8][0], c); bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; A += 16; B += 16 * MATRIX_WIDTH; b[0] = B[0 * MATRIX_WIDTH]; b[1] = B[1 * MATRIX_WIDTH]; b[2] = B[2 * MATRIX_WIDTH]; b[3] = B[3 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i + 12][0], c); __syncthreads(); } while( A < Alast ); ... // last iteration University of Central Florida

Using Large Tiles - Performance Execution time is measured as the computation time on GPU. University of Central Florida

Using Large Tiles – Performance 2 Gflops (comp): excluding CPU GPU data transfer time. Gflops (total): including CPU GPU data transfer time. University of Central Florida

Tools - CUDA GPU Occupancy Calculator University of Central Florida

Tools - decuda • Developed by Wladimir J. van der Laan • a PhD candidate at the Institute of Mathematics and Computing Science of the University of Groningen. • http://www.cs.rug.nl/~wladimir/decuda/ University of Central Florida

Tools – CUDA Visual Profiler • http://forums.nvidia.com/index.php?showtopic=57443 • GPU TimeCPU TimeOccupancy • Profiler counters:gld_incoherent : Number of non-coalesced global memory loadsgld_coherent : Number of coalesced global memory loadsgst_incoherent : Number of non-coalesced global memory storesgst_coherent : Number of coalesced global memory storeslocal_load : Number of local memory loadslocal_store : Number of local memory storesbranch : Number of branch events (instruction and/or sync stack)divergent_branch : Number of divergent branches within a warp instructions : Number of dynamic instructions (in fetch)warp_serialize : Number of threads in a warp serialize basedon address (GRF or constant)cta_launched : Number of CTAs launched on the PM TPC University of Central Florida

Tips • Get usage of reg, smem, cmem, and lmem: • nvcc -m32 -o data/matrix_kernel.cubin -cubin matrix_kernel.cu --compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -DUNIX -O3 --ptxas-options=-v • Compile with –maxrregcount University of Central Florida

References • NVIDIA CUDA Samples: • http://www.nvidia.com/object/cuda_sample_linear_algebra.html • Simple CUBLAS • Matrix Multiplication • Matrix Transpose • NVIDIA Forum: • http://forums.nvidia.com/index.php?showtopic=47689&st=0 University of Central Florida

Fast matrix multiplication with CUDA

Fast matrix multiplication with CUDA

Presentation Transcript

Matrix Multiplication

Fast matrix multiplication with CUDA

MATRIX MULTIPLICATION

Matrix-Matrix Multiplication

Matrix Multiplication

MATRIX MULTIPLICATION

Matrix Multiplication in CUDA

Matrix Multiplication

Fast Sparse Matrix Multiplication

Matrix Multiplication

Parallel Computing-Dense Matrix Multiplication on CUDA

Matrix Multiplication

Matrix Multiplication

Matrix Multiplication

Matrix Multiplication

Matrix Multiplication

Matrix Multiplication

Matrix Multiplication

Genetic Algorithms for Fast Matrix Multiplication

Matrix Multiplication

Fast Sparse Matrix Multiplication