fast matrix multiplication with cuda
Download
Skip this Video
Download Presentation
Fast matrix multiplication with CUDA

Loading in 2 Seconds...

play fullscreen
1 / 19

Fast matrix multiplication with CUDA - PowerPoint PPT Presentation


  • 159 Views
  • Uploaded on

Fast matrix multiplication with CUDA. Overview. Platform GEFORCE 8800GT, 512MB Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz Performance Tuned for 4k x 4k matrix, 192 GFlops Revisiting the Tiled version Using large tiles Base algorithm Optimized algorithm Tools and tips.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Fast matrix multiplication with CUDA' - oceana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview
Overview
  • Platform
    • GEFORCE 8800GT, 512MB
    • Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz
  • Performance
    • Tuned for 4k x 4k matrix, 192 GFlops
  • Revisiting the Tiled version
  • Using large tiles
    • Base algorithm
    • Optimized algorithm
  • Tools and tips

University of Central Florida

the tiled version
The Tiled version
  • Tile Size: 16 x 16
  • 256 threads / block
  • 14 regs, 2076 smem / block
  • Occupancy: 2/3

T0 T1 … T14 T15

T16 T17 … T30 T31

T239 … T255

University of Central Florida

the tiled version memory access
The Tiled version – Memory access
  • Every half warp is accessing continuous memory locations.
  • Memory accesses are fully coalesced.

T0 T1 … T14 T15

T16 T17 … T30 T31

T239 … T255

University of Central Florida

the tiled version bank conflicts
The Tiled version – Bank conflicts
  • No bank conflicts.

16 banks

broadcast

T0 T1 … T14 T15

T16 T17 … T30 T31

T239 … T255

University of Central Florida

the tiled version bottlenecks
The Tiled version - Bottlenecks
  • If fully use memory bandwidth and ALUs:
    • 14.4G float/s, 168G MAD/s
    • 11.67 MAD/float
  • With 16 x 16 tiles:
    • Total (W3/8) loads, 8 MAD/float
    • Too many loads!
  • Solution: large tile.
  • “Psub += As[ty][k] * Bs[k][tx]”
    • Extra instructions.
  • 77 GFlops (4k x 4k)

mov.b32 $r12, s[$ofs4+0x0000]

mov.b32 $r7, s[$ofs4+0x0040]

mad.rn.f32 $r11, s[$ofs1+0x000c], $r11, $r13

add.b32 $ofs4, $ofs3, 0x0000019c

mad.rn.f32 $r13, s[$ofs1+0x0010], $r12, $r11

mov.b32 $r12, s[$ofs4+0x0000]

mov.b32 $r11, s[$ofs4+0x0040]

mad.rn.f32 $r7, s[$ofs1+0x0014], $r7, $r13

add.b32 $ofs4, $ofs3, 0x0000021c

mad.rn.f32 $r13, s[$ofs1+0x0018], $r12, $r7

Reused THeight times.

TWidth

Reused TWidth times.

THeight

University of Central Florida

using large tiles
Using Large Tiles
  • Each thread:
    • 17 loads / iteration
    • W/16 iterations
    • Total (W3/15) loads, 15 MAD/load

16

256

Stored in registers.

256 threads

16

256

16

16

Stored in shared memory.

16 Psubs/thread

University of Central Florida

using large tiles algorithm
Using Large Tiles - Algorithm
  • For each sub tile in A & B
    • Read the sub tile from A to shared memory. 1 number / thread.
    • For each of the 16 numbers in B:
      • Read one number from B into a register.
      • Perform one MAD for each Psub.
  • To remove extra instructions for offset calculation, we want the sub tile A to be stored in column-major format in the shared memory.
    • But …

B

A

C

T0 T1 T2 … T255

University of Central Florida

using large tiles algorithm1
Using Large Tiles - Algorithm
  • Solution1:
    • Transpose A to column-major format first.

A

Shared A

C

T0

B0

B15

T15

T0 T1 T2 … T255

  • Solution2:
    • Read A in row-major format, write to the shared memory in column-major format.
    • Bank conflicts when write to the shared memory!

A

Shared A

C

B0

T0

T15

B15

T0 T1 T2 … T255

University of Central Florida

using large tiles algorithm2
Using Large Tiles - Algorithm
  • Solution3:
    • Padding Shared A with one empty row.
    • No bank conflicts. Do not need to transpose A.
    • 164 GFlops (4k x 4k).

Shared A

A

C

B0

B1

B15

T0

T15

B1

B2

B0

B15

B0

B1

T0 T1 T2 … T255

University of Central Florida

using large tiles code
Using Large Tiles - code

for (int i = 0; i < MATRIX_WIDTH/16; i++)

{

ashare[tx][ty] = A[0];

__syncthreads();

#pragma unroll // 150 GFlops (4k x 4k) without unroll

for (int k = 0; k < 16; k++)

{

b = B[k * MATRIX_WIDTH];

comp16(b, &ashare[k][0], c);

}

A += 16;

B += 16 * MATRIX_WIDTH;

__syncthreads();

};

University of Central Florida

using large tiles optimized
Using Large Tiles - optimized

do

{

ashare[tx][ty] = a;

__syncthreads();

a = A[0];

bb[0] = b[0];

bb[1] = b[1];

bb[2] = b[2];

bb[3] = b[3];

b[0] = B[4 * MATRIX_WIDTH];

b[1] = B[5 * MATRIX_WIDTH];

b[2] = B[6 * MATRIX_WIDTH];

b[3] = B[7 * MATRIX_WIDTH];

for (int i = 0; i < 4; i ++)

comp16(bb[i], &ashare[i][0], c);

bb[0] = b[0];

bb[1] = b[1];

bb[2] = b[2];

bb[3] = b[3];

b[0] = B[12 * MATRIX_WIDTH];

b[1] = B[13 * MATRIX_WIDTH];

b[2] = B[14 * MATRIX_WIDTH];

b[3] = B[15 * MATRIX_WIDTH];

for (int i = 0; i < 4; i ++)

comp16(bb[i], &ashare[i + 8][0], c);

bb[0] = b[0];

bb[1] = b[1];

bb[2] = b[2];

bb[3] = b[3];

A += 16;

B += 16 * MATRIX_WIDTH;

b[0] = B[0 * MATRIX_WIDTH];

b[1] = B[1 * MATRIX_WIDTH];

b[2] = B[2 * MATRIX_WIDTH];

b[3] = B[3 * MATRIX_WIDTH];

for (int i = 0; i < 4; i ++)

comp16(bb[i], &ashare[i + 12][0], c);

__syncthreads();

} while( A < Alast );

... // last iteration

University of Central Florida

using large tiles performance
Using Large Tiles - Performance

Execution time is measured as the computation time on GPU.

University of Central Florida

using large tiles performance 2
Using Large Tiles – Performance 2

Gflops (comp): excluding CPU GPU data transfer time.

Gflops (total): including CPU GPU data transfer time.

University of Central Florida

tools cuda gpu occupancy calculator
Tools - CUDA GPU Occupancy Calculator

University of Central Florida

tools decuda
Tools - decuda
  • Developed by Wladimir J. van der Laan
    • a PhD candidate at the Institute of Mathematics and Computing Science of the University of Groningen.
  • http://www.cs.rug.nl/~wladimir/decuda/

University of Central Florida

tools cuda visual profiler
Tools – CUDA Visual Profiler
  • http://forums.nvidia.com/index.php?showtopic=57443
    • GPU TimeCPU TimeOccupancy
    • Profiler counters:gld_incoherent : Number of non-coalesced global memory loadsgld_coherent : Number of coalesced global memory loadsgst_incoherent : Number of non-coalesced global memory storesgst_coherent : Number of coalesced global memory storeslocal_load : Number of local memory loadslocal_store : Number of local memory storesbranch : Number of branch events (instruction and/or sync stack)divergent_branch : Number of divergent branches within a warp instructions : Number of dynamic instructions (in fetch)warp_serialize : Number of threads in a warp serialize basedon address (GRF or constant)cta_launched : Number of CTAs launched on the PM TPC

University of Central Florida

slide18
Tips
  • Get usage of reg, smem, cmem, and lmem:
    • nvcc -m32 -o data/matrix_kernel.cubin -cubin matrix_kernel.cu --compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -DUNIX -O3 --ptxas-options=-v
  • Compile with –maxrregcount

University of Central Florida

references
References
  • NVIDIA CUDA Samples:
    • http://www.nvidia.com/object/cuda_sample_linear_algebra.html
    • Simple CUBLAS
    • Matrix Multiplication
    • Matrix Transpose
  • NVIDIA Forum:
    • http://forums.nvidia.com/index.php?showtopic=47689&st=0

University of Central Florida

ad